Fleetidy — Temporal Cross-Validation Study

1 Study Design

Temporal Split

Y1 2024 predictors: All FRED component scores (crash, behavioral, equipment, severity RRs and peer_index) recomputed from 2024 raw CSV data using Empirical Bayes. No database FRED scores used.

Y2 2025 outcomes: Crash events from FMCSA crashes.csv with REPORT_DATE in 2025.

Why this matters: The original FRED scores in the database use a 24-month window that includes 2025 data. Using them to "predict" 2025 crashes would be circular. This study recomputes all scores from scratch using only 2024 data.

Population

Total eligible carriers	441,199
Y2 crash rate	10.2%
Small (1-5 trucks)	366,822 (83.1%)
Medium (6-20)	54,515 (12.4%)
Large (21-100)	16,813 (3.8%)
XLarge (101+)	3,049 (0.7%)

Method

Empirical Bayes (Gamma-Poisson) per fleet-size band
4-component model: crash (56%), behavioral (18%), equipment (14%), severe (12%)
AUC with 1,000-iteration bootstrap 95% CIs
Logistic regression with standardized coefficients

2 Model Discrimination (AUC)

AUC measures the probability that a randomly chosen crash carrier ranks higher than a randomly chosen non-crash carrier. All models are tested against the same Y2 (2025) binary outcome: did the carrier have any crash?

0.585

[0.582 – 0.588]

FRED Peer Index

Y1-only EB scores

0.676

[0.673 – 0.678]

ISS Score

FMCSA Inspection Selection

0.752

[0.749 – 0.754]

ML20 Crash Prob

FMCSA Machine Learning

Interpreting these AUCs: FRED's peer_index is a rate-based score (crash risk per 100k miles), while the binary outcome (any crash?) is heavily driven by fleet size and mileage exposure. A large carrier running 50M miles/year almost certainly has at least one crash regardless of safety quality. ISS and ML20 are trained on binary outcomes and naturally predict them better. For rate-based metrics like EB crash rates, FRED shows 19.9× separation between best and worst grades.

FRED AUC by Fleet-Size Band

Band	n	AUC	95% CI
Small (1-5)	366,822	0.582	0.577–0.586
Medium (6-20)	54,515	0.553	0.547–0.558
Large (21-100)	16,813	0.582	0.574–0.591
XLarge (101+)	3,049	0.649	0.620–0.683

AUC improves with fleet size because larger carriers have more stable EB estimates (more data per carrier) and the rate-to-binary mismatch is reduced.

3 Grade-Level Separation

Grades assigned from Y1-only peer_index using production thresholds. Two metrics shown: binary crash rate (% with any crash) and EB crash rate (crashes per 100k miles).

Grade	n	Crashed	% Crashed	95% CI	EB Rate
Excellent	21,950	2,386	10.87%	10.46–11.27	0.01374
Strong	34,099	2,700	7.92%	7.63–8.21	0.02113
Satisfactory	188,361	13,096	6.95%	6.84–7.07	0.03272
Marginal	81,744	10,730	13.13%	12.89–13.35	0.04552
Poor	54,758	8,905	16.26%	15.97–16.56	0.05888
Critical	60,287	6,990	11.59%	11.34–11.85	0.27295

EB Rate Separation: 19.9×
Critical carriers have 19.9× the crash rate per 100k miles vs Excellent. EB rates are perfectly monotonic across all six grades (0.014 → 0.273).

Binary Rate Non-Monotonic
The "% crashed" metric (any crash: yes/no) is not monotonic because it conflates rate and exposure. A high-mileage carrier rated "Critical" via violations may still have lower binary crash probability than a moderate carrier with substantial crash exposure.

Chi-square: 5,483 (p < 10^-300) — grades are not independent of crash outcome

Spearman rho: 0.080 (p < 10^-300) — positive monotone correlation with Y2 crash

4 Crash History (Clean Temporal)

2024 crashes counted directly from FMCSA crash CSV, tested against 2025 crash outcomes. This is the strongest single predictor.

4.85×

Had Y1 crash vs not

35.6% vs 7.3%

82.3%

Y2 rate with 3+ Y1 crashes

n = 5,628

7.3%

Y2 rate with 0 Y1 crashes

n = 397,206

Dose-Response: More Y1 Crashes → Higher Y2 Risk

Y1 Crashes	n	Y2 Crash Rate	95% CI	RR vs 0
0	397,206	7.3%	7.3–7.4	—
1	32,019	24.7%	24.2–25.2	3.38×
2	6,346	49.0%	47.8–50.3	6.71×
3+	5,628	82.3%	81.3–83.3	11.27×

5 Behavioral Violation Dose-Response (Clean Temporal)

2024 violation types from raw CSV, tested against 2025 crash outcomes. Clean temporal separation — no score contamination. Behavioral violations show a strong, monotonic dose-response relationship with future crash risk.

Violation Factor	n with	Crash % (with)	Crash % (without)	RR
Reckless Driving	219	53.9%	10.1%	5.32×
Drugs / Alcohol	2,381	36.8%	10.0%	3.68×
Any Speeding	41,520	34.3%	7.6%	4.49×
1+ behavioral types	88,138	25.1%	6.4%	3.92×
2+ behavioral types	31,918	41.3%	7.7%	5.35×
3+ behavioral types	13,861	57.8%	8.6%	6.72×
4+ behavioral types	6,896	71.9%	9.2%	7.84×
1+ equipment types	132,567	20.0%	5.9%	3.36×
2+ equipment types	75,511	25.5%	7.0%	3.65×

6 Severe Violations (Severity Weight ≥ 7)

Overall

21.6%

Y2 crash rate

With severe viols (n=118,986)

5.9%

Y2 crash rate

No severe viols (n=322,213)

Relative Risk: 3.63× (Mann-Whitney p < 10^-300)

Specific Risk Factors

Factor	n	RR
Reckless Driving	219	5.32×
Drugs / Alcohol	2,381	3.68×

7 Component Analysis

Logistic Regression Importance

Standardized coefficients from Y1 component RRs predicting Y2 binary crash.

Component	Importance	OR
Equipment RR	42.0%	0.997
Severe RR	29.2%	1.011
Crash RR	28.1%	1.036
Behavioral RR	0.7%	1.000

Ablation Study (AUC)

Single-component AUC and impact of removing each component from the full model.

Component	Alone AUC	AUC Drop
Behavioral RR	0.626	+0.000
Crash RR	0.542	+0.025
Equipment RR	0.529	+0.011
Severe RR	0.528	+0.017

Key insight: Behavioral violations (AUC=0.626) are clearly the strongest single temporal predictor among the four FRED components. Crash history (0.542), equipment (0.529), and severe (0.528) are all near random when used alone. The behavioral component carries nearly all the genuine temporal signal, consistent with prior research showing behavioral violations are the strongest predictors of future crashes.

8 Age × Fleet Size (Simpson's Paradox)

Among small and medium carriers, younger companies have higher crash rates. The apparent reversal for large/xlarge fleets is a sample artifact: "young large fleets" barely exist (n=195 and n=21), since fleets grow over time and rarely appear as large carriers from day one. The few that do are typically corporate restructurings or rebrands, not genuinely new operations.

Fleet Band	Young (0-2yr)	Old (21+yr)	n (young)	Direction
Small (1-5)	5.9%	4.6%	30,390	Young riskier (1.28×)
Medium (6-20)	21.6%	19.9%	1,341	Young riskier (1.09×)
Large (21-100)	39.0%	53.0%	195	Unreliable (tiny n)
XLarge (101+)	71.4%	90.8%	21	Unreliable (tiny n)

Takeaway: For small and medium carriers (95.5% of the population), younger companies are genuinely riskier — a 1.09-1.28× relative risk that is robust across large sample sizes. The "reversal" for large/xlarge fleets is not meaningful because young large carriers are not a natural category; fleets grow over time and the handful of exceptions are likely restructured entities, not inexperienced operations.

The aggregate "old riskier" pattern (0.45×) is a textbook Simpson's paradox: large carriers are both older and have higher binary crash rates (due to mileage exposure), dragging the aggregate in the opposite direction from the within-band relationship.

9 Decile Capture Analysis

Carriers ranked by Y1-only peer_index (highest risk first), then split into deciles.

43.3%

Top 30% captures

19.1%

Top 20% captures

7.3%

Top 10% captures

Decile	n	Y2 Crashes	% of Total	Cumulative	Crash Rate
D1 (riskiest)	44,119	6,629	7.3%	7.3%	11.1%
D2	44,120	10,795	11.8%	19.1%	14.0%
D3	44,120	22,118	24.2%	43.3%	19.4%
D4	44,120	14,491	15.9%	59.2%	11.0%
D5	44,120	11,573	12.7%	71.9%	11.0%
D6	44,120	8,029	8.8%	80.7%	8.2%
D7	44,120	3,699	4.0%	84.7%	5.5%
D8	44,120	4,778	5.2%	89.9%	5.4%
D9	44,120	3,527	3.9%	93.8%	6.3%
D10 (safest)	44,120	5,760	6.3%	100.0%	9.7%

Note: The peer_index is a rate-based score. Decile crash CAPTURE (count-based) is expected to be lower than count-based models like ML20 because rate-based scores rank by risk intensity, not total expected crashes. The top 30% capturing 43.3% of all crashes demonstrates meaningful risk stratification despite this mismatch.

10 Comparison with Original Study

How this clean temporal validation compares with the original empirical validation.

Metric	Original Study	This Study (Y1-only)	Notes
FRED AUC	0.852	0.585	Original used 24-mo window (includes Y2 data)
Grade RR (% crashed)	18.2×	1.1×	Binary rate confounded by exposure
Grade RR (EB rate)	—	19.9×	New metric: EB crash rate per 100k mi
Crash history RR	1.41×	4.85×	Clean CSV-based vs DB (leaked)
Behavioral 4+ types RR	8.17×	7.84×	Both clean temporal — consistent
Reckless driving RR	3.94×	5.32×	Both clean temporal — consistent
Population	180,402	441,199	Different eligibility filters
Score source	DB FRED scores	Y1-only EB recomputation	Key methodological difference

Summary: The original study's high AUC (0.852) and grade RR (18.2×) were inflated by temporal leakage — the FRED scores included 2025 crash data that was also the test outcome. With clean temporal separation, the FRED peer_index shows genuine but modest binary prediction (AUC=0.585). However, the violation-type analyses (Sections 4-5) were already temporally clean in both studies and show consistent results: behavioral violations are powerful predictors of future crash risk (RR 3.9-7.8×).

11 Spearman Rank Correlations with Y2 Crash

Predictor	Spearman rho	p-value
ML20 crash prob	0.264	< 10^-300
ISS score	0.184	< 10^-300
Behavioral RR (Y1)	0.132	< 10^-300
FRED Peer Index (Y1)	0.089	< 10^-300
Crash RR (Y1)	0.044	1.2 × 10^-185
Equipment RR (Y1)	0.030	6.2 × 10^-89
Severe RR (Y1)	0.029	4.3 × 10^-82

All correlations are highly significant. Behavioral violations (rho=0.132) stand out as the strongest FRED component — 3× stronger than crash history (0.044) and 4× stronger than equipment (0.030) or severe (0.029). The composite peer_index (0.089) is diluted by the heavy crash weight (56%) given that crash history has the weakest temporal signal among the four components.

Methodology & Limitations

Empirical Bayes Computation: For each of the 4 components (crash, behavioral, equipment, severe), Gamma(alpha, beta) priors are estimated per fleet-size band via exposure-weighted method of moments with 1% trimming. EB posterior mean = (alpha + count) / (beta + exposure). Relative Risk = EB_rate / band_mean. Peer_index = 0.56×crash_rr + 0.18×behavioral_rr + 0.14×equipment_rr + 0.12×severe_rr.

Rate vs Count Mismatch: The FRED peer_index is a rate-based metric (risk per unit of exposure). The binary outcome "any crash in Y2" is count-based and heavily influenced by fleet size. This mismatch systematically depresses AUC and binary RR metrics. EB rates and Spearman correlations are more appropriate for evaluating rate-based scores.

Exposure Definition: Annual mileage from FMCSA MCS-150 filing, converted to 100k-mile units. This is a self-reported annual figure and may not precisely match actual 2024 mileage.

Prior Fallback: For components where the method-of-moments estimation degenerates (alpha or beta ≤ 0.01), a pooled prior is computed from the non-degenerate fleet-size bands, weighted by sample size. This primarily affects the severe and equipment components in the small-carrier band, where most carriers have zero violations and the Gamma-Poisson model cannot distinguish "low risk" from "no data."

Y1 Window: January 1 – December 31, 2024 for all predictor events. Y2 outcome window: January 1 – December 31, 2025. No overlap between predictor and outcome periods.