Fleetidy — Empirical Validation Study

The Question

How much harm will a motor carrier do on the road over the next year? The FRED Score answers that directly: it predicts each carrier's severity-weighted crash burden over the forward 12 months — not just how many crashes, but how bad — then grades it against similarly-sized peers.

We fit the model on one year of carrier history (2024) and test it against the crash burden that actually followed (2025). Because the model is never judged on the same data it learned from, the validation is genuinely out-of-time. Across roughly 1.1 million graded carriers, the fitted model ranks forward risk far better than fleet size alone and stays unbiased within every size band.

What the Model Predicts:

Burden = Frequency × Severity

Severity weight = 1 + 12×min(fatalities, 3) + 4×min(injuries, 5) + 3×hazmat

A transparent actuarial credibility model grades each carrier on its OWN observed record: crash burden and the violation signals are each turned into a credibility-shrunk relativity vs. the carrier's size cohort and combined in a geometric blend, with observed crashes setting a floor. A fatal crash counts far more than a minor tow-away, so the score reflects expected harm, not just event counts. Thin-data carriers are shrunk toward their size-cohort prior (Empirical-Bayes / Bühlmann-Straub) and tagged with a confidence tier.

How Well Does It Rank Forward Risk?

≈ 0.20

Out-of-Time Gini

Normalized, overall, rated carriers, forward year

0.11–0.64

By Fleet-Size Cohort

Strongest where exposure is richest (large fleets)

Monotone

Grade Gate

Realized burden rises Excellent→Critical, or the run aborts

On a carrier-disjoint holdout of the following year, the deployed model reaches an overall normalized Gini of about 0.20 among rated carriers — rising with exposure from 0.11 on single-power-unit carriers to 0.64 on 100+ unit fleets, where the data is richest — while realized forward burden rises monotonically from Excellent to Critical within every fleet-size cohort (observed-over-expected burden lands between 0.97 and 1.06). It ranks risk sharply and calibrates honestly, across roughly 1.1 million graded carriers. (Figures from the latest validated refit, outcome year ending 2026‑03‑08; the earlier development prototype, fit on a narrower high-data sub-population, reported a higher 0.59–0.61.)

Do Grades Predict Future Crashes?

Year 2 (2025) Crash Rates by Safety Grade

242,969 graded carriers • EB-adjusted rates per 100k miles

3.35×

Relative Risk

Critical vs Excellent (EB rate)

1.93×

RR (% Crashed)

18.7%

Critical Crash Rate

Carriers graded Critical in Year 1 were 3.35× more likely to crash in Year 2 than Excellent carriers, after controlling for fleet size via EB adjustment. 9.7% of Excellent carriers crashed, compared to 18.7% of Critical carriers.

Excellent

0.041 | 9.7%

Strong

0.049 | 13.0%

Satisfactory

0.056 | 18.5%

Marginal

0.067 | 22.2%

Poor

0.083 | 22.1%

Critical

0.137 | 18.7%

Left number: EB-adjusted crash rate per 100k miles. Right number: % of carriers with any Year 2 crash.

How the Model Is Built and Tested

Our Methodology

Rather than hand-assigning weights, we let the data set them. The pipeline:

1. Define the target: Each Year-2 crash is severity-weighted (1 + 12×min(fatalities,3) + 4×min(injuries,5) + 3×hazmat) and summed into a carrier's crash burden.
2. Build Year-1 signals: The carrier's crash burden plus its vehicle-out-of-service, severe-violation, hours-of-service/fatigue, unsafe-driving, and speeding rates — each shrunk to a credibility-weighted relativity vs. its size cohort.
3. Blend into a risk relativity: The per-signal relativities combine in a transparent geometric blend whose weights are learned out-of-time (a non-negative Poisson fit) from how strongly each predicts next-year burden; observed crashes set a floor that can't be averaged away. No black-box model sits between the carrier's record and its grade.
4. Apply credibility: Bühlmann-Straub shrinkage blends each carrier toward its size-band prior, yielding a confidence tier. Credibility is set by the carrier's observed evidence — the count of roadside inspections and adverse events, plus a fleet-size floor — not by self-reported mileage, so a thin record can't earn a top grade on the absence of data.
5. Validate out-of-time: Score the forward year and measure discrimination (normalized Gini) and calibration (per-band observed-over-expected).

Critical: Exposure enters as a log-offset so the model predicts a rate of crash burden, not a raw count, and calibration is enforced band-by-band. That is what lets a 3-truck owner-operator and a 3,000-truck fleet be compared fairly.

1. Prior-Crash EB Relativity

The single strongest feature in the fit

#1

Top Predictor

Past crashes are the single strongest predictor of future crashes. Carriers with elevated crash rates in Year 1 consistently had elevated rates in Year 2. This signal is robust across all fleet sizes and carrier ages.

No 2024 crash (baseline): 12.6% crashed in 2025

Had 2024 crash: 42.6% crashed in 2025

EB-adjusted relative risk: 1.41×

Why it leads: Prior crashes are the most direct evidence of forward burden, entered as a credibility-shrunk relativity rather than a raw count. A single crash doesn't doom a small carrier, because Bühlmann-Straub shrinkage blends each carrier's record toward its size-band prior — so the signal is reliable for both 3-truck operations and 3,000-truck fleets.

2. Behavioral Violation Rate

The best leading indicator the fit recovers

#2

Top Predictor

Driver decision violations — speeding, reckless driving, HOS violations, drug and alcohol offenses — are the strongest leading indicator among violation types. These capture the culture and discipline of a carrier's operation before crashes actually happen.

Empirical Relative Risk by Violation Type:

Reckless Driving 1.49× relative risk

Jumping OOS / Driving Fatigued 1.36× relative risk

Drugs / Alcohol 1.29× relative risk

Dangerous Driving 1.37× relative risk

Severe Speeding (15+ mph over) 1.30× relative risk

Texting / Phone Use 1.16× relative risk

False Log (HOS fraud) 1.12× relative risk

Cumulative Behavioral Dose-Response (EB-Adjusted):

0 types

11.3% | 1.00×

≥1 type

31.2% | 1.34×

≥2 types

50.2% | 1.50×

≥3 types

68.6% | 1.57×

≥4 types

84.7% | 1.64×

≥5 types

92.5% | 1.54×

% of carriers with any Year 2 crash, by number of distinct behavioral violation types in Year 1. EB RR is size-band normalized.

Why it ranks high: Behavioral violations are the best leading indicator — they reveal risk before crashes happen. A carrier whose drivers regularly speed or drive fatigued carries a higher forward burden. The behavioral rate enters the model as a feature whose importance is learned from how strongly it predicts the following year's crash burden; the relativities above are illustrative of the ordering the fit recovers and align with the Violation Types study.

3. Equipment Violation Rate

An independent maintenance signal

#3

Top Predictor

Equipment violations — brakes, tires, lighting, cargo securement — reflect a carrier's maintenance standards and operational discipline. Unlike behavioral violations (driver choices), equipment condition reveals systemic quality.

Lighting Violations 1.17× relative risk

Brakes Out of Adjustment 1.13× relative risk

Tire Violations 1.16× relative risk

Other Brake Issues 1.13× relative risk

Why it matters: Equipment violations are a moderate but consistent predictor. A carrier that regularly fails brake inspections has systemic maintenance issues. The equipment rate carries less weight in the fit than the behavioral rate (about 23% lower forward relative risk on average) but remains an important independent signal of operational quality.

4. Severe-Violation Rate

Captures tail risk other features miss

#4

Top Predictor

The severe-violation rate summarizes the most dangerous violations — those with FMCSA severity weight ≥ 7. It enters the model as its own exposure-normalized feature, so carriers with these violations are predicted to carry a higher forward crash burden than their other signals alone would suggest.

Critical Behavioral Flags:

These violations are flagged on carrier records and feed the severe-violation rate the model consumes — no hard score caps, just a learned feature importance:

Reckless Driving (any) Severity Weight ≥ 7

Lifts the severe-violation rate — carriers with these violations show a markedly higher predicted burden

Drug / Alcohol (any) Behavioral + Severe

Substance violations raise both the behavioral and the severe rate — a double signal that amplifies the predicted burden

Extreme Speeding Patterns Dose-Response

Carriers with 10+ speeding violations crashed at 98.4% in the following year — a strong upward push on predicted burden

Why it matters: Severity captures tail risk the other features might miss. A carrier with critical violations (severity weight ≥ 7) represents a qualitatively different risk profile. Because the rate is exposure-normalized and the carrier is credibility-shrunk toward its size-band prior, this signal is reliable across fleet sizes without disproportionately penalizing small carriers for isolated incidents.

Supporting Evidence: The Carrier Age Effect

Why Empirical Bayes Matters

Carrier age shows a moderate risk gradient (about 1.37× peak-to-low) captured through EB shrinkage, not a separate weight

Crash Rate by Years in Operation

Click fleet size buttons to compare different carrier segments

Carriers: 242,969 Peak Risk: <1 year (0.077/100k) Lowest Risk: 10-19 years (0.056/100k) Relative Risk: 1.37×

Carrier age remains relevant, but the effect is more modest in this cohort: the highest rates are among the newest carriers, and the lowest rates are in the 10-19 year range after normalizing for fleet size and exposure.

Rather than giving age its own weight in the formula, our Empirical Bayes shrinkage naturally handles this effect. New carriers with limited data get pulled toward their peer-group average (which includes the age-related risk), while mature carriers with extensive records keep their observed rates. This is more principled than adding experience as an arbitrary weighted factor.

FMCSA's BASIC Score Problem: BASIC treats all carriers equally regardless of age. A 20-year carrier with 2 crashes gets the same treatment as a 1-year carrier with 2 crashes. When we tested BASIC scores as a predictor, they showed inverse correlation (RR=0.17×) — carriers with better BASIC scores had higher crash rates, likely because established carriers accumulate more inspection history. We don't use BASIC in our model.

Supporting Evidence: The Fleet Size Effect

#

Small Carriers Have Higher Crash Rates Across All Ages

The interactive chart above reveals a consistent pattern: smaller fleets have higher crash rates regardless of experience. This is why we normalize per 100k miles and apply Empirical Bayes adjustment by fleet-size peer group — it controls for this effect rather than penalizing small carriers arbitrarily.

Crash Rates by Fleet Size (1-2 Year Carriers)

Small (1-5)

0.189

Medium (6-20)

0.136

Large (21-100)

0.060

Enterprise (100+)

0.044

Small carriers in their first two years have crash rates 4.3× higher than enterprise carriers of the same age.

How Our Model Handles This

Empirical Bayes by size band: Each fleet-size peer group (small, medium, large, enterprise) has its own prior. A small carrier's rate is shrunk toward the small-carrier average, not the overall fleet average.

Within-band grading: Grades compare each carrier to others of the same size. A risk relativity of 1.00× means "typical for your size band." This prevents small carriers from being automatically graded worse simply for being small.

Band-calibrated expectation: Each fleet-size band is calibrated so its total predicted crashes (and burden) match that band's own recent observed totals (observed-over-expected ≈ 1.0). A given grade therefore means the same forward risk regardless of fleet size.

What Predicts Risk for Mature Carriers?

10+

The Predictive Hierarchy Shifts After 10 Years

Once carriers reach 10+ years, operational performance metrics dominate. Crash history becomes the primary differentiator, while behavioral violations remain the best leading indicator of emerging risk.

Behavioral vs Equipment Violations (10+ Year Carriers)

When normalized per 100,000 miles to control for fleet size, behavioral violations emerge as the stronger predictor:

Behavioral Violations 9.34×

Speeding, reckless driving, HOS, drugs/alcohol

Best quintile: 0.041 crashes/100k mi
Worst quintile: 0.381 crashes/100k mi

Equipment Violations 7.57×

Brakes, tires, lights, cargo securement

Best quintile: 0.041 crashes/100k mi
Worst quintile: 0.312 crashes/100k mi

Key insight: For mature carriers, behavioral violations are 23% more predictive than equipment violations. This is why the model carries behavioral and equipment rates as distinct features rather than lumping all violations together — the fit gives the behavioral rate the heavier weight on its own.

Implications for Insurance Underwriting

For new carriers (<10 years): The EB shrinkage toward peer-group priors provides natural conservatism. Limited data means the score reflects the peer average more than individual history.

For mature carriers (10+ years): Crash history carries the most weight because these carriers have enough data for reliable rate estimation. Behavioral violations provide the best early warning of deteriorating safety culture before crashes materialize.

Grade Distribution

Grades are assigned from each carrier's within-band burden percentile — where its credibility-shrunk predicted crash burden sits among same-size peers. Alongside the grade, a risk relativity of 1.00× means the carrier is typical for its size band; below 1 is safer, above is riskier. Because the comparison is within band, small and large fleets are graded fairly against their own peers.

46.1%

Excellent

lowest-burden tier

9.7% crashed

8.3%

Strong

below-typical burden

13.0% crashed

19.4%

Satisfactory

typical for size band

18.5% crashed

10.1%

Marginal

above-typical burden

22.2% crashed

8.4%

Poor

high burden

22.1% crashed

7.7%

Critical

highest-burden tier

18.7% crashed

242,969 graded carriers in the study population. The distribution peaks at Excellent (46.1%), with a right-skewed tail reflecting that most carriers have zero or very few crashes. Each grade card shows the Year 2 crash rate — the percentage of carriers in that grade who experienced at least one crash in 2025. Critical carriers crash at 1.93× the rate of Excellent carriers.

Key Takeaways

1. Prior Crash Burden Is the Dominant Predictor

A carrier's credibility-shrunk prior-crash relativity is the single strongest feature in the fit. Bühlmann-Straub shrinkage makes this robust for all fleet sizes — small carriers aren't penalized by statistical noise, and large carriers keep their reliable observed history.

2. Behavioral Violations Are the Best Leading Indicator

Driver decision violations (speeding, reckless driving, substances, HOS fraud) are the strongest leading indicator of future crash burden. They capture risk before it materializes. For mature carriers, behavioral violations are 23% more predictive than equipment violations.

3. Equipment Condition Reflects Systemic Quality

Brake failures, tire issues, and lighting problems signal maintenance standards and operational discipline. It carries less weight than the behavioral rate, but contributes an independent and consistent signal.

4. Severity Captures Tail Risk

The most dangerous violations — reckless driving, substance offenses, extreme speeding — carry high severity weights and lift the carrier's exposure-normalized severe-violation rate. Carriers with these violations are predicted to carry a higher forward burden, catching qualitatively different risk that crash history alone might miss.

5. Fleet Size Is Controlled, Not Penalized

Small carriers (1-5 trucks) have crash rates 2-4× higher than enterprise carriers, but within-band grading, the log-exposure offset, and per-band calibration ensure carriers are compared to peers of similar size. A small carrier with clean records can still earn an Excellent grade.

How We Validated the FRED Score

The Question

What the Model Predicts:

How Well Does It Rank Forward Risk?

Do Grades Predict Future Crashes?

How the Model Is Built and Tested

Our Methodology

1. Prior-Crash EB Relativity

2. Behavioral Violation Rate

3. Equipment Violation Rate

4. Severe-Violation Rate

Supporting Evidence: The Carrier Age Effect

Why Empirical Bayes Matters

Supporting Evidence: The Fleet Size Effect

Small Carriers Have Higher Crash Rates Across All Ages

Crash Rates by Fleet Size (1-2 Year Carriers)

How Our Model Handles This

What Predicts Risk for Mature Carriers?

The Predictive Hierarchy Shifts After 10 Years

Behavioral vs Equipment Violations (10+ Year Carriers)

Implications for Insurance Underwriting

Grade Distribution

Key Takeaways

1. Prior Crash Burden Is the Dominant Predictor

2. Behavioral Violations Are the Best Leading Indicator

3. Equipment Condition Reflects Systemic Quality

4. Severity Captures Tail Risk

5. Fleet Size Is Controlled, Not Penalized