Independent Research

Temporal Cross-Validation

Clean year-over-year prediction: FRED scores computed entirely from 2024 data, tested against 2025 crash outcomes. No data leakage. No circular validation. Honest temporal prediction.

441,199 carriers
0.585 temporal AUC
19.9× grade EB-rate separation
4.85× crash history RR

1 Study Design

Temporal Split

Y1 2024 predictors: All FRED component scores (crash, behavioral, equipment, severity RRs and peer_index) recomputed from 2024 raw CSV data using Empirical Bayes. No database FRED scores used.
Y2 2025 outcomes: Crash events from FMCSA crashes.csv with REPORT_DATE in 2025.
Why this matters: The original FRED scores in the database use a 24-month window that includes 2025 data. Using them to "predict" 2025 crashes would be circular. This study recomputes all scores from scratch using only 2024 data.

Population

Total eligible carriers441,199
Y2 crash rate10.2%
Small (1-5 trucks)366,822 (83.1%)
Medium (6-20)54,515 (12.4%)
Large (21-100)16,813 (3.8%)
XLarge (101+)3,049 (0.7%)

Method

  • Empirical Bayes (Gamma-Poisson) per fleet-size band
  • 4-component model: crash (56%), behavioral (18%), equipment (14%), severe (12%)
  • AUC with 1,000-iteration bootstrap 95% CIs
  • Logistic regression with standardized coefficients

2 Model Discrimination (AUC)

AUC measures the probability that a randomly chosen crash carrier ranks higher than a randomly chosen non-crash carrier. All models are tested against the same Y2 (2025) binary outcome: did the carrier have any crash?

0.585
[0.582 – 0.588]
FRED Peer Index
Y1-only EB scores
0.676
[0.673 – 0.678]
ISS Score
FMCSA Inspection Selection
0.752
[0.749 – 0.754]
ML20 Crash Prob
FMCSA Machine Learning
Interpreting these AUCs: FRED's peer_index is a rate-based score (crash risk per 100k miles), while the binary outcome (any crash?) is heavily driven by fleet size and mileage exposure. A large carrier running 50M miles/year almost certainly has at least one crash regardless of safety quality. ISS and ML20 are trained on binary outcomes and naturally predict them better. For rate-based metrics like EB crash rates, FRED shows 19.9× separation between best and worst grades.

FRED AUC by Fleet-Size Band

BandnAUC95% CI
Small (1-5)366,8220.5820.577–0.586
Medium (6-20)54,5150.5530.547–0.558
Large (21-100)16,8130.5820.574–0.591
XLarge (101+)3,0490.6490.620–0.683

AUC improves with fleet size because larger carriers have more stable EB estimates (more data per carrier) and the rate-to-binary mismatch is reduced.

3 Grade-Level Separation

Grades assigned from Y1-only peer_index using production thresholds. Two metrics shown: binary crash rate (% with any crash) and EB crash rate (crashes per 100k miles).

GradenCrashed % Crashed95% CI EB Rate
Excellent21,9502,38610.87%10.46–11.270.01374
Strong34,0992,7007.92%7.63–8.210.02113
Satisfactory188,36113,0966.95%6.84–7.070.03272
Marginal81,74410,73013.13%12.89–13.350.04552
Poor54,7588,90516.26%15.97–16.560.05888
Critical60,2876,99011.59%11.34–11.850.27295
EB Rate Separation: 19.9×
Critical carriers have 19.9× the crash rate per 100k miles vs Excellent. EB rates are perfectly monotonic across all six grades (0.014 → 0.273).
Binary Rate Non-Monotonic
The "% crashed" metric (any crash: yes/no) is not monotonic because it conflates rate and exposure. A high-mileage carrier rated "Critical" via violations may still have lower binary crash probability than a moderate carrier with substantial crash exposure.
Chi-square: 5,483 (p < 10-300) — grades are not independent of crash outcome
Spearman rho: 0.080 (p < 10-300) — positive monotone correlation with Y2 crash

4 Crash History (Clean Temporal)

2024 crashes counted directly from FMCSA crash CSV, tested against 2025 crash outcomes. This is the strongest single predictor.

4.85×
Had Y1 crash vs not
35.6% vs 7.3%
82.3%
Y2 rate with 3+ Y1 crashes
n = 5,628
7.3%
Y2 rate with 0 Y1 crashes
n = 397,206

Dose-Response: More Y1 Crashes → Higher Y2 Risk

Y1 CrashesnY2 Crash Rate95% CIRR vs 0
0397,2067.3%7.3–7.4
132,01924.7%24.2–25.23.38×
26,34649.0%47.8–50.36.71×
3+5,62882.3%81.3–83.311.27×

5 Behavioral Violation Dose-Response (Clean Temporal)

2024 violation types from raw CSV, tested against 2025 crash outcomes. Clean temporal separation — no score contamination. Behavioral violations show a strong, monotonic dose-response relationship with future crash risk.

Violation Factorn with Crash % (with)Crash % (without)RR
Reckless Driving21953.9%10.1%5.32×
Drugs / Alcohol2,38136.8%10.0%3.68×
Any Speeding41,52034.3%7.6%4.49×
1+ behavioral types88,13825.1%6.4%3.92×
2+ behavioral types31,91841.3%7.7%5.35×
3+ behavioral types13,86157.8%8.6%6.72×
4+ behavioral types6,89671.9%9.2%7.84×
1+ equipment types132,56720.0%5.9%3.36×
2+ equipment types75,51125.5%7.0%3.65×

6 Severe Violations (Severity Weight ≥ 7)

Overall

21.6%
Y2 crash rate
With severe viols (n=118,986)
5.9%
Y2 crash rate
No severe viols (n=322,213)
Relative Risk: 3.63× (Mann-Whitney p < 10-300)

Specific Risk Factors

FactornRR
Reckless Driving2195.32×
Drugs / Alcohol2,3813.68×

7 Component Analysis

Logistic Regression Importance

Standardized coefficients from Y1 component RRs predicting Y2 binary crash.

ComponentImportanceOR
Equipment RR42.0%0.997
Severe RR29.2%1.011
Crash RR28.1%1.036
Behavioral RR0.7%1.000

Ablation Study (AUC)

Single-component AUC and impact of removing each component from the full model.

ComponentAlone AUCAUC Drop
Behavioral RR0.626+0.000
Crash RR0.542+0.025
Equipment RR0.529+0.011
Severe RR0.528+0.017
Key insight: Behavioral violations (AUC=0.626) are clearly the strongest single temporal predictor among the four FRED components. Crash history (0.542), equipment (0.529), and severe (0.528) are all near random when used alone. The behavioral component carries nearly all the genuine temporal signal, consistent with prior research showing behavioral violations are the strongest predictors of future crashes.

8 Age × Fleet Size (Simpson's Paradox)

Among small and medium carriers, younger companies have higher crash rates. The apparent reversal for large/xlarge fleets is a sample artifact: "young large fleets" barely exist (n=195 and n=21), since fleets grow over time and rarely appear as large carriers from day one. The few that do are typically corporate restructurings or rebrands, not genuinely new operations.

Fleet BandYoung (0-2yr)Old (21+yr)n (young)Direction
Small (1-5)5.9%4.6%30,390Young riskier (1.28×)
Medium (6-20)21.6%19.9%1,341Young riskier (1.09×)
Large (21-100)39.0%53.0%195Unreliable (tiny n)
XLarge (101+)71.4%90.8%21Unreliable (tiny n)
Takeaway: For small and medium carriers (95.5% of the population), younger companies are genuinely riskier — a 1.09-1.28× relative risk that is robust across large sample sizes. The "reversal" for large/xlarge fleets is not meaningful because young large carriers are not a natural category; fleets grow over time and the handful of exceptions are likely restructured entities, not inexperienced operations.

The aggregate "old riskier" pattern (0.45×) is a textbook Simpson's paradox: large carriers are both older and have higher binary crash rates (due to mileage exposure), dragging the aggregate in the opposite direction from the within-band relationship.

9 Decile Capture Analysis

Carriers ranked by Y1-only peer_index (highest risk first), then split into deciles.

43.3%
Top 30% captures
19.1%
Top 20% captures
7.3%
Top 10% captures
DecilenY2 Crashes % of TotalCumulativeCrash Rate
D1 (riskiest)44,1196,6297.3%7.3%11.1%
D244,12010,79511.8%19.1%14.0%
D344,12022,11824.2%43.3%19.4%
D444,12014,49115.9%59.2%11.0%
D544,12011,57312.7%71.9%11.0%
D644,1208,0298.8%80.7%8.2%
D744,1203,6994.0%84.7%5.5%
D844,1204,7785.2%89.9%5.4%
D944,1203,5273.9%93.8%6.3%
D10 (safest)44,1205,7606.3%100.0%9.7%
Note: The peer_index is a rate-based score. Decile crash CAPTURE (count-based) is expected to be lower than count-based models like ML20 because rate-based scores rank by risk intensity, not total expected crashes. The top 30% capturing 43.3% of all crashes demonstrates meaningful risk stratification despite this mismatch.

10 Comparison with Original Study

How this clean temporal validation compares with the original empirical validation.

MetricOriginal StudyThis Study (Y1-only)Notes
FRED AUC 0.852 0.585 Original used 24-mo window (includes Y2 data)
Grade RR (% crashed) 18.2× 1.1× Binary rate confounded by exposure
Grade RR (EB rate) 19.9× New metric: EB crash rate per 100k mi
Crash history RR 1.41× 4.85× Clean CSV-based vs DB (leaked)
Behavioral 4+ types RR 8.17× 7.84× Both clean temporal — consistent
Reckless driving RR 3.94× 5.32× Both clean temporal — consistent
Population 180,402 441,199 Different eligibility filters
Score source DB FRED scores Y1-only EB recomputation Key methodological difference
Summary: The original study's high AUC (0.852) and grade RR (18.2×) were inflated by temporal leakage — the FRED scores included 2025 crash data that was also the test outcome. With clean temporal separation, the FRED peer_index shows genuine but modest binary prediction (AUC=0.585). However, the violation-type analyses (Sections 4-5) were already temporally clean in both studies and show consistent results: behavioral violations are powerful predictors of future crash risk (RR 3.9-7.8×).

11 Spearman Rank Correlations with Y2 Crash

PredictorSpearman rhop-value
ML20 crash prob0.264< 10-300
ISS score0.184< 10-300
Behavioral RR (Y1)0.132< 10-300
FRED Peer Index (Y1)0.089< 10-300
Crash RR (Y1)0.0441.2 × 10-185
Equipment RR (Y1)0.0306.2 × 10-89
Severe RR (Y1)0.0294.3 × 10-82

All correlations are highly significant. Behavioral violations (rho=0.132) stand out as the strongest FRED component — 3× stronger than crash history (0.044) and 4× stronger than equipment (0.030) or severe (0.029). The composite peer_index (0.089) is diluted by the heavy crash weight (56%) given that crash history has the weakest temporal signal among the four components.

Methodology & Limitations

Empirical Bayes Computation: For each of the 4 components (crash, behavioral, equipment, severe), Gamma(alpha, beta) priors are estimated per fleet-size band via exposure-weighted method of moments with 1% trimming. EB posterior mean = (alpha + count) / (beta + exposure). Relative Risk = EB_rate / band_mean. Peer_index = 0.56×crash_rr + 0.18×behavioral_rr + 0.14×equipment_rr + 0.12×severe_rr.
Rate vs Count Mismatch: The FRED peer_index is a rate-based metric (risk per unit of exposure). The binary outcome "any crash in Y2" is count-based and heavily influenced by fleet size. This mismatch systematically depresses AUC and binary RR metrics. EB rates and Spearman correlations are more appropriate for evaluating rate-based scores.
Exposure Definition: Annual mileage from FMCSA MCS-150 filing, converted to 100k-mile units. This is a self-reported annual figure and may not precisely match actual 2024 mileage.
Prior Fallback: For components where the method-of-moments estimation degenerates (alpha or beta ≤ 0.01), a pooled prior is computed from the non-degenerate fleet-size bands, weighted by sample size. This primarily affects the severe and equipment components in the small-carrier band, where most carriers have zero violations and the Gamma-Poisson model cannot distinguish "low risk" from "no data."
Y1 Window: January 1 – December 31, 2024 for all predictor events. Y2 outcome window: January 1 – December 31, 2025. No overlap between predictor and outcome periods.