Models Comparison

Detailed comparison of all ML models from the CVE exploitation prediction paper

Early Premium

Best Overall AUC

0.9913
AUC-ROC Score
Features
25
Best For
SPARSE

Full MLP

Deep Neural Network

0.9719
AUC-ROC Score
Features
66
Best For
RICH

GNN (GraphSAGE)

Knowledge Graph Reasoning

0.9344
AUC-ROC Score
Features
18 + KG
Best For
Interpretability

Model Performance Comparison

Data Regime Distribution

SPARSE (Early Premium) 71.6%
MODERATE (Balanced) 25.2%
RICH (Full MLP) 3.2%

71.6% of CVEs fall into SPARSE regime at publication time, requiring the Early Premium model which works without EPSS, sightings, or CVSS data.

Ensemble Weights by Regime

P_final = w_ml × P_ml + w_kg × P_kg + w_sim × P_sim

Model Selection Logic

SPARSE (no CWE, no CPE)

  • • No EPSS score available
  • • No sightings data
  • • Limited NVD enrichment
  • Similarity leads (66.7%)

MODERATE (CWE or CPE)

  • • Some CVSS data available
  • • Partial enrichment
  • Balanced ensemble
  • • ML: 37%, KG: 37%, Sim: 26%

RICH (CVSS + CWE + CPE)

  • • Full EPSS, CVSS, sightings
  • • Complete NVD data
  • ML leads (39.9%)
  • • Full MLP model used

Early Premium Model

Best performance with minimal data - AUC 0.9913

Why "Early Premium"?

Named because it works at the earliest stage of CVE lifecycle (before NVD enrichment) while achieving premium (best) performance. It outperforms even the full 66-feature MLP because it focuses on the most predictive signals.

Training Details

  • Dataset: 289,705 CVEs (2010-2025)
  • Split: 80% train, 10% val, 10% test
  • Epochs: 100 with early stopping
  • Optimizer: Adam (lr=0.001)
  • Regularization: Dropout 0.3, L2 weight decay

Architecture

Input Layer (25 features)
Dense(256) + ReLU + BatchNorm + Dropout(0.3)
Dense(128) + ReLU + BatchNorm + Dropout(0.3)
Dense(64) + ReLU + BatchNorm + Dropout(0.3)
Output Layer (1) + Sigmoid → P(exploit)

Loss Function

Weighted Binary Cross-Entropy with class weights to handle imbalanced data (only ~5% of CVEs are exploited).

Key Features (25 Total)

Vendor → Product → Version Knowledge Graph

Early Premium uses a hierarchical knowledge graph capturing vendor, product, and version relationships. This enables risk inheritance: a vendor with historically exploited products signals higher risk for new CVEs.

Vendor Features

  • vendor_exploit_rate
  • vendor_cve_count
  • vendor_avg_cvss
  • vendor_risk_score

Product Features

  • product_exploit_rate
  • product_cve_count
  • product_avg_severity
  • product_cwe_diversity

Version Features

  • version_exploit_rate
  • version_exploit_rate_before
  • version_exploit_rate_after
  • version_risk_delta

Version Risk Inheritance

The model captures how vulnerability patterns propagate across versions:

v2.0 (45% exploited)
v2.1 (32% exploited)
v2.2 (18% exploited)
v3.0 (5% exploited)

version_risk_delta = version_exploit_rate_before - version_exploit_rate_after (positive = improving)

Full MLP Model

Deep learning with all 66 features - AUC 0.9719

When to Use

Used for RICH regime CVEs (2.5% of total) that have full NVD enrichment: EPSS scores, CVSS data, sightings, and complete ATT&CK mappings.

Additional Features (41 more)

  • EPSS score and percentile
  • Sighting counts and recency
  • Full CVSS v3 vector components
  • ATT&CK tactic encodings
  • Reference type features

Architecture

Numerical Input (52 features)
Categorical Embeddings (14 features)
↓ Concatenate
Dense(256) + ReLU + BN + Dropout(0.3)
Dense(128) + ReLU + BN + Dropout(0.3)
Dense(64) + ReLU + BN + Dropout(0.3)
Output(1) + Sigmoid

Why Lower AUC?

More features doesn't always mean better. The Early Premium model achieves higher AUC because it focuses on the most predictive signals and avoids noise from less useful features.

Feature Categories (66 Total)

EPSS Features (4)

  • • epss_score
  • • epss_percentile
  • • epss_score_30d_change
  • • has_epss

Sighting Features (6)

  • • sighting_total
  • • sighting_recent_30d
  • • sighting_recency_days
  • • sighting_velocity
  • • has_sightings
  • • sighting_sources

CVSS Features (15)

  • • cvss_v3_score
  • • attack_vector_encoded
  • • attack_complexity_encoded
  • • privileges_required_encoded
  • • user_interaction_encoded
  • • scope_encoded
  • • + 9 more CIA impact metrics

Temporal Features (8)

  • • days_since_published
  • • days_to_first_sighting
  • • change_total
  • • change_velocity
  • • publication_year
  • • publication_month
  • • is_recent_30d
  • • is_recent_90d

ATT&CK Features (8)

  • • technique_count
  • • tactic_initial_access
  • • tactic_execution
  • • tactic_persistence
  • • tactic_privilege_escalation
  • • tactic_defense_evasion
  • • tactic_credential_access
  • • has_attack_mapping

Early Premium (25)

  • • desc_similarity_max/mean
  • • vendor/product exploit rates
  • • version risk features
  • • CWE exploitation rates
  • • Reference counts by type
  • • + more (see Early Premium tab)

GNN (GraphSAGE) Model

Knowledge graph reasoning - AUC 0.9344

How It Works

Uses GraphSAGE (Graph Sample and Aggregate) to learn from the CVE knowledge graph. Each CVE node aggregates information from its neighboring CWE, CAPEC, and ATT&CK technique nodes to create a rich representation.

Graph Structure

  • CVE Nodes: 289,705 vulnerabilities
  • CWE Nodes: 1,425 weakness types
  • CAPEC Nodes: 559 attack patterns
  • Technique Nodes: 652 ATT&CK techniques
  • Edges: 1.2M+ relationships

Architecture

Node Features (18 dims)
GraphSAGE Layer 1 (mean aggregation)
↓ Aggregate from CWE neighbors
GraphSAGE Layer 2 (mean aggregation)
↓ Aggregate from CAPEC/ATT&CK
Dense(64) + ReLU + Dropout
Output(1) + Sigmoid

Aggregation Formula

h_v^(k) = σ(W · CONCAT(h_v^(k-1), AGG({h_u : u ∈ N(v)})))

GNN Input Features (18 dimensions)

Original CVE (7)

  • 1. cvss_v3_score
  • 2. epss_score
  • 3. epss_percentile
  • 4. sighting_total (log)
  • 5. sighting_exploited (log)
  • 6. change_total (log)
  • 7. days_since_pub (log)

CWE-Derived (3)

  • 8. is_high_risk_cwe (binary)
  • 9. num_cwes (log)
  • 10. attack_surface (CAPEC count, log)
High-risk CWEs: CWE-78, 77, 89, 94, 502 (injection), CWE-119, 120, 787 (memory), CWE-22, 434, 287, 798

ATT&CK-Derived (8)

  • 11. num_techniques (log)
  • 12. num_tactics (log)
  • 13. has_execution (binary)
  • 14. has_initial_access (binary)
  • 15. has_persistence (binary)
  • 16. has_priv_escalation (binary)
  • 17. has_defense_evasion (binary)
  • 18. severity_score (0-1, weighted)
Feature Processing

All count features (sightings, days, techniques) are log-transformed using log1p(x). Binary flags are set based on presence in high-risk sets. Severity score is normalized to 0-1 using weighted tactic contributions.

Why Use GNN?

Interpretability

Can explain predictions through the graph path: "This CVE is risky because CWE-89 links to T1190 (Exploit Public-Facing Application) used by APT28."

Transductive Learning

Benefits from the entire graph structure. New CVEs inherit knowledge from similar CWEs that were exploited before.

Threat Intelligence

Connects CVEs to real threat groups and malware. If APT29 uses techniques linked to a CVE's CWE, risk increases.

Adaptive Ensemble System

The final prediction combines three components with regime-specific weights learned through logistic regression.

P_final = w_ml × P_ml + w_kg × P_kg + w_sim × P_sim

P_ml

ML Model Prediction

Early Premium or MLP based on regime

P_kg

Knowledge Graph Score

CWE rates + ATT&CK + vendor history

P_sim

Similarity Score

Sentence-BERT description similarity

How Weights Were Learned

# From: 10_adaptive_risk_model.py
# Learn optimal weights via logistic regression on component scores

from sklearn.linear_model import LogisticRegression

# Stack the three component scores
X = np.column_stack([ml_scores, kg_scores, similarity_scores])
y = exploitation_labels

# Train logistic regression to learn optimal combination
model = LogisticRegression(max_iter=1000)
model.fit(X, y)

# Extract learned weights (normalized)
raw_weights = model.coef_[0]
weights = raw_weights / raw_weights.sum()

# Result: [0.442, 0.312, 0.246] for global weights
# Per-regime weights vary based on data availability

Learned Weights by Data Regime

Regime ML Weight KG Weight Similarity Weight Dominant
SPARSE 33.3% 0% 66.7% Similarity
MODERATE 41.5% 16.9% 41.6% Balanced
RICH 39.9% 27.2% 33.0% ML Model

Similarity Component (P_sim)

Uses Sentence-BERT (all-MiniLM-L6-v2) to compute semantic similarity between the target CVE's description and descriptions of known exploited CVEs.

How It Works

  1. Encode target CVE description → 384-dim vector
  2. Compare against 10,000+ exploited CVE embeddings
  3. Find top-k most similar exploited CVEs
  4. Compute weighted average similarity as P_sim

Why 17.9% Feature Importance?

desc_similarity_max is the #1 most important feature because CVE descriptions contain strong signals about exploitability. Similar language to past exploited CVEs indicates similar attack surface.

# Similarity calculation
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

# Encode CVE description
embedding = model.encode(cve_description)  # 384-dim vector

# Compare to exploited CVE index
similarities = cosine_similarity([embedding], exploited_embeddings)[0]

# P_sim = weighted average of top-k similarities
p_sim = 0.5 * max_similarity + 0.3 * mean_top10 + 0.2 * weighted_avg

Complete Feature List

Top 15 Features by Importance (Early Premium)

Feature Categories