Models Comparison

Detailed comparison of all ML models from the CVE exploitation prediction paper

Early Premium

Best Overall AUC

0.9913

AUC-ROC Score

Features

Best For

SPARSE

Full MLP

Deep Neural Network

0.9719

AUC-ROC Score

Features

Best For

RICH

GNN (GraphSAGE)

Knowledge Graph Reasoning

0.9344

AUC-ROC Score

Features

18 + KG

Best For

Interpretability

Model Performance Comparison

Data Regime Distribution

SPARSE (Early Premium) 71.6%

MODERATE (Balanced) 25.2%

RICH (Full MLP) 3.2%

71.6% of CVEs fall into SPARSE regime at publication time, requiring the Early Premium model which works without EPSS, sightings, or CVSS data.

Ensemble Weights by Regime

P_final = w_ml × P_ml + w_kg × P_kg + w_sim × P_sim

Model Selection Logic

SPARSE (no CWE, no CPE)

• No EPSS score available
• No sightings data
• Limited NVD enrichment
• Similarity leads (66.7%)

MODERATE (CWE or CPE)

• Some CVSS data available
• Partial enrichment
• Balanced ensemble
• ML: 37%, KG: 37%, Sim: 26%

RICH (CVSS + CWE + CPE)

• Full EPSS, CVSS, sightings
• Complete NVD data
• ML leads (39.9%)
• Full MLP model used

Early Premium Model

Best performance with minimal data - AUC 0.9913

Why "Early Premium"?

Named because it works at the earliest stage of CVE lifecycle (before NVD enrichment) while achieving premium (best) performance. It outperforms even the full 66-feature MLP because it focuses on the most predictive signals.

Training Details

Dataset: 289,705 CVEs (2010-2025)
Split: 80% train, 10% val, 10% test
Epochs: 100 with early stopping
Optimizer: Adam (lr=0.001)
Regularization: Dropout 0.3, L2 weight decay

Architecture

Input Layer (25 features)

↓

Dense(256) + ReLU + BatchNorm + Dropout(0.3)

↓

Dense(128) + ReLU + BatchNorm + Dropout(0.3)

↓

Dense(64) + ReLU + BatchNorm + Dropout(0.3)

↓

Output Layer (1) + Sigmoid → P(exploit)

Loss Function

Weighted Binary Cross-Entropy with class weights to handle imbalanced data (only ~5% of CVEs are exploited).

Key Features (25 Total)

Vendor → Product → Version Knowledge Graph

Early Premium uses a hierarchical knowledge graph capturing vendor, product, and version relationships. This enables risk inheritance: a vendor with historically exploited products signals higher risk for new CVEs.

Vendor Features

• vendor_exploit_rate
• vendor_cve_count
• vendor_avg_cvss
• vendor_risk_score

Product Features

• product_exploit_rate
• product_cve_count
• product_avg_severity
• product_cwe_diversity

Version Features

• version_exploit_rate
• version_exploit_rate_before
• version_exploit_rate_after
• version_risk_delta

Version Risk Inheritance

The model captures how vulnerability patterns propagate across versions:

v2.0 (45% exploited)

v2.1 (32% exploited)

v2.2 (18% exploited)

v3.0 (5% exploited)

version_risk_delta = version_exploit_rate_before - version_exploit_rate_after (positive = improving)

Full MLP Model

Deep learning with all 66 features - AUC 0.9719

When to Use

Used for RICH regime CVEs (2.5% of total) that have full NVD enrichment: EPSS scores, CVSS data, sightings, and complete ATT&CK mappings.

Additional Features (41 more)

EPSS score and percentile
Sighting counts and recency
Full CVSS v3 vector components
ATT&CK tactic encodings
Reference type features

Architecture

Numerical Input (52 features)

Categorical Embeddings (14 features)

↓ Concatenate

Dense(256) + ReLU + BN + Dropout(0.3)

Dense(128) + ReLU + BN + Dropout(0.3)

Dense(64) + ReLU + BN + Dropout(0.3)

Output(1) + Sigmoid

Why Lower AUC?

More features doesn't always mean better. The Early Premium model achieves higher AUC because it focuses on the most predictive signals and avoids noise from less useful features.

Feature Categories (66 Total)

EPSS Features (4)

• epss_score
• epss_percentile
• epss_score_30d_change
• has_epss

Sighting Features (6)

• sighting_total
• sighting_recent_30d
• sighting_recency_days
• sighting_velocity
• has_sightings
• sighting_sources

CVSS Features (15)

• cvss_v3_score
• attack_vector_encoded
• attack_complexity_encoded
• privileges_required_encoded
• user_interaction_encoded
• scope_encoded
• + 9 more CIA impact metrics

Temporal Features (8)

• days_since_published
• days_to_first_sighting
• change_total
• change_velocity
• publication_year
• publication_month
• is_recent_30d
• is_recent_90d

ATT&CK Features (8)

• technique_count
• tactic_initial_access
• tactic_execution
• tactic_persistence
• tactic_privilege_escalation
• tactic_defense_evasion
• tactic_credential_access
• has_attack_mapping

Early Premium (25)

• desc_similarity_max/mean
• vendor/product exploit rates
• version risk features
• CWE exploitation rates
• Reference counts by type
• + more (see Early Premium tab)

GNN (GraphSAGE) Model

Knowledge graph reasoning - AUC 0.9344

How It Works

Uses GraphSAGE (Graph Sample and Aggregate) to learn from the CVE knowledge graph. Each CVE node aggregates information from its neighboring CWE, CAPEC, and ATT&CK technique nodes to create a rich representation.

Graph Structure

CVE Nodes: 289,705 vulnerabilities
CWE Nodes: 1,425 weakness types
CAPEC Nodes: 559 attack patterns
Technique Nodes: 652 ATT&CK techniques
Edges: 1.2M+ relationships

Architecture

Node Features (18 dims)

↓

GraphSAGE Layer 1 (mean aggregation)

↓ Aggregate from CWE neighbors

GraphSAGE Layer 2 (mean aggregation)

↓ Aggregate from CAPEC/ATT&CK

Dense(64) + ReLU + Dropout

Output(1) + Sigmoid

Aggregation Formula

h_v^(k) = σ(W · CONCAT(h_v^(k-1), AGG({h_u : u ∈ N(v)})))

GNN Input Features (18 dimensions)

Original CVE (7)

1. cvss_v3_score
2. epss_score
3. epss_percentile
4. sighting_total (log)
5. sighting_exploited (log)
6. change_total (log)
7. days_since_pub (log)

CWE-Derived (3)

8. is_high_risk_cwe (binary)
9. num_cwes (log)
10. attack_surface (CAPEC count, log)

High-risk CWEs: CWE-78, 77, 89, 94, 502 (injection), CWE-119, 120, 787 (memory), CWE-22, 434, 287, 798

ATT&CK-Derived (8)

11. num_techniques (log)
12. num_tactics (log)
13. has_execution (binary)
14. has_initial_access (binary)
15. has_persistence (binary)
16. has_priv_escalation (binary)
17. has_defense_evasion (binary)
18. severity_score (0-1, weighted)

Feature Processing

All count features (sightings, days, techniques) are log-transformed using log1p(x). Binary flags are set based on presence in high-risk sets. Severity score is normalized to 0-1 using weighted tactic contributions.

Why Use GNN?

Interpretability

Can explain predictions through the graph path: "This CVE is risky because CWE-89 links to T1190 (Exploit Public-Facing Application) used by APT28."

Transductive Learning

Benefits from the entire graph structure. New CVEs inherit knowledge from similar CWEs that were exploited before.

Threat Intelligence

Connects CVEs to real threat groups and malware. If APT29 uses techniques linked to a CVE's CWE, risk increases.

Adaptive Ensemble System

The final prediction combines three components with regime-specific weights learned through logistic regression.

P_final = w_ml × P_ml + w_kg × P_kg + w_sim × P_sim

P_ml

ML Model Prediction

Early Premium or MLP based on regime

P_kg

Knowledge Graph Score

CWE rates + ATT&CK + vendor history

P_sim

Similarity Score

Sentence-BERT description similarity

How Weights Were Learned

# From: 10_adaptive_risk_model.py
# Learn optimal weights via logistic regression on component scores

from sklearn.linear_model import LogisticRegression

# Stack the three component scores
X = np.column_stack([ml_scores, kg_scores, similarity_scores])
y = exploitation_labels

# Train logistic regression to learn optimal combination
model = LogisticRegression(max_iter=1000)
model.fit(X, y)

# Extract learned weights (normalized)
raw_weights = model.coef_[0]
weights = raw_weights / raw_weights.sum()

# Result: [0.442, 0.312, 0.246] for global weights
# Per-regime weights vary based on data availability

Learned Weights by Data Regime

Regime	ML Weight	KG Weight	Similarity Weight	Dominant
SPARSE	33.3%	0%	66.7%	Similarity
MODERATE	41.5%	16.9%	41.6%	Balanced
RICH	39.9%	27.2%	33.0%	ML Model

Similarity Component (P_sim)

Uses Sentence-BERT (all-MiniLM-L6-v2) to compute semantic similarity between the target CVE's description and descriptions of known exploited CVEs.

How It Works

Encode target CVE description → 384-dim vector
Compare against 10,000+ exploited CVE embeddings
Find top-k most similar exploited CVEs
Compute weighted average similarity as P_sim

Why 17.9% Feature Importance?

desc_similarity_max is the #1 most important feature because CVE descriptions contain strong signals about exploitability. Similar language to past exploited CVEs indicates similar attack surface.

# Similarity calculation
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

# Encode CVE description
embedding = model.encode(cve_description)  # 384-dim vector

# Compare to exploited CVE index
similarities = cosine_similarity([embedding], exploited_embeddings)[0]

# P_sim = weighted average of top-k similarities
p_sim = 0.5 * max_similarity + 0.3 * mean_top10 + 0.2 * weighted_avg

Models Comparison

Early Premium

Full MLP

GNN (GraphSAGE)

Model Performance Comparison

Data Regime Distribution

Ensemble Weights by Regime

Model Selection Logic

SPARSE (no CWE, no CPE)

MODERATE (CWE or CPE)

RICH (CVSS + CWE + CPE)

Early Premium Model

Why "Early Premium"?

Training Details

Architecture

Loss Function

Key Features (25 Total)

Vendor → Product → Version Knowledge Graph

Vendor Features

Product Features

Version Features

Version Risk Inheritance

Full MLP Model

When to Use

Additional Features (41 more)

Architecture

Why Lower AUC?

Feature Categories (66 Total)

EPSS Features (4)

Sighting Features (6)

CVSS Features (15)

Temporal Features (8)

ATT&CK Features (8)

Early Premium (25)

GNN (GraphSAGE) Model

How It Works

Graph Structure

Architecture

Aggregation Formula

GNN Input Features (18 dimensions)

Original CVE (7)

CWE-Derived (3)

ATT&CK-Derived (8)

Why Use GNN?

Interpretability

Transductive Learning

Threat Intelligence

Adaptive Ensemble System

P_final = w_ml × P_ml + w_kg × P_kg + w_sim × P_sim

How Weights Were Learned

Learned Weights by Data Regime

Similarity Component (P_sim)

How It Works

Why 17.9% Feature Importance?

Complete Feature List

Top 15 Features by Importance (Early Premium)

Feature Categories