Monte Carlo Dropout is a Bayesian approximation technique. Instead of making one prediction, we run the model 50 times with random neurons disabled (dropout). The variance across these predictions tells us how uncertain the model is.
Model uncertainty - How unfamiliar the model is with this CVE pattern.
High epistemic = Novel CVE type, the model is extrapolating beyond training data. Consider expert review.
Data uncertainty - Inherent ambiguity in the CVE features.
High aleatoric = CVE has characteristics of both exploited and non-exploited classes. It's genuinely borderline.
Security teams need to know not just what the prediction is, but how confident the model is. High uncertainty predictions should be escalated to human experts, while confident predictions can be used for automated prioritization.
We use two trained ML regression models to predict how many days after publication a CVE will be exploited. Both models were trained on 39,426 CVEs with known exploit dates from 6 sources (NVD exploit tags, ExploitDB, VulnCheck KEV, CISA KEV, threat sightings, and ZeroDay tracker).
The number of days between a CVE's publication date and its first known exploit. Negative values mean the exploit appeared before the CVE was published (zero-day).
Gradient-boosted decision trees. Learns non-linear patterns from 24 features including vendor history, CVSS score, and description similarity to known exploits.
A 4-layer neural network (24 → 128 → 64 → 32 → 1) trained with the same 24 features. Captures different patterns than XGBoost.
The final estimate averages both models. When models agree closely, we have higher confidence.
CVSS score, vendor exploit/patch/critical rates, product exploit rate, version history,
description similarity to known exploits (Sentence-BERT), patch information, and more.
These are the same features as the Early Premium classification model, minus
days_since_pub (which would leak time information).
Confidence measures how much the three ensemble components agree:
High confidence (>80%): All three methods agree → reliable prediction
Low confidence (<60%): Methods disagree → consider manual review
Neural networks trained on CVE exploitation data
Heuristic calculations (not trained models)
vendor_exploit_rate,
product_exploit_rate,
version_exploit_rate_before,
version_exploit_rate_after,
version_risk_delta
are used to predict exploitation risk based on historical patterns.
CVEs with similar descriptions that were exploited (Sentence-BERT similarity)
Higher confidence = better agreement between ML, KG, and Similarity scores
The model is 95% confident the true exploitation probability falls within this range. Narrow = more certain.
How unfamiliar the model is with this CVE's pattern
Inherent ambiguity in the CVE's features
High uncertainty detected (std > 4%). This prediction should be reviewed by a security expert before critical decisions.
Low uncertainty — model predictions are consistent across all MC passes. Safe for automated decision-making.
Click "Run MC Dropout" to quantify prediction uncertainty
Runs 50 stochastic forward passes through the neural network with dropout enabled, producing a distribution of predictions to estimate how confident the model is.
Every CVE is scored by three independent signals: ML model, Knowledge Graph, and Text Similarity. The data regime (SPARSE / MODERATE / RICH) determines how much to trust each signal, using weights learned from actual exploitation data via logistic regression.
Weights are not set manually. They are learned empirically from 322,763 CVEs with exploitation labels from 7 independent sources (~40K exploited CVEs from ExploitDB, CISA KEV, VulnCheck KEV, etc.).
LogisticRegression(is_exploited ~ kg_score + ml_score + sim_score)At inference, we use the full LR model (not just weights) to preserve intercept and scaling:
This achieves AUC ~0.84, compared to AUC ~0.68 from naive weighted averaging, because the LR model captures intercept and input standardisation that raw weighting misses.
Weight learning (11_learn_ensemble_weights.py):
# Per-regime logistic regression training
for regime in ['sparse', 'moderate', 'rich']:
mask = df['regime'] == regime
X = df.loc[mask, ['kg_score', 'ml_score', 'sim_score']]
y = df.loc[mask, 'is_exploited']
scaler = StandardScaler().fit(X)
X_scaled = scaler.transform(X)
lr = LogisticRegression().fit(X_scaled, y)
# Normalise coefficients for interpretable weights
abs_coef = np.abs(lr.coef_[0])
weights = abs_coef / abs_coef.sum()
# Save full model for scoring
ensemble_weights[regime] = {
'weights': dict(zip(feature_names, weights)),
'lr_model': {
'raw_coef': lr.coef_[0].tolist(),
'intercept': lr.intercept_[0],
'scaler_mean': scaler.mean_.tolist(),
'scaler_scale': scaler.scale_.tolist()
}
}
Inference scoring (predictor.py):
raw_coef = np.array(lr_model['raw_coef'])
intercept = lr_model['intercept']
scaler_mean = np.array(lr_model['scaler_mean'])
scaler_scale = np.array(lr_model['scaler_scale'])
# Order: [kg_score, ml_score, sim_score]
x = np.array([p_kg, p_ml, p_sim])
x_scaled = (x - scaler_mean) / scaler_scale
z = np.dot(raw_coef, x_scaled) + intercept
p_final = 1.0 / (1.0 + np.exp(-z))
| Regime | Pml | Pkg | Psim | Behaviour |
|---|---|---|---|---|
| SPARSE | 33.3% | 0.0% | 66.7% | Similarity-heavy |
| MODERATE | 41.5% | 16.9% | 41.6% | Balanced ML + Sim |
| RICH | 39.9% | 27.2% | 33.0% | All signals balanced |