How We Score Safety
The FRED Score is a forward-looking model that predicts a carrier's expected severity-weighted crash burden over the next 12 months, then grades it against similarly-sized peers. It is fit on one year of history to predict the following year — out-of-time validated, not scored on the same data it learned from.
Scoring Pipeline at a Glance
Data Sources
Seven FMCSA datasets feed the scoring pipeline
Every score starts with public data from the Federal Motor Carrier Safety Administration. We pull seven distinct datasets, join them on DOT number, and filter to the population of active for-hire property carriers.
| Dataset | What It Contains | Updates |
|---|---|---|
| Census | Carrier registration, fleet size, address, authority type, officer names | Daily |
| Inspections | Roadside inspections with driver and vehicle out-of-service (OOS) counts | Daily |
| Crashes | Reportable crash events with fatality, injury, and tow-away details | Monthly |
| Violations | Individual violation records with category, severity weight, and inspection date | Monthly |
| BASIC Scores | SMS safety measure scores — SMS AB (interstate + intrastate hazmat) and SMS C (intrastate non-hazmat) merged by DOT number | Monthly |
| SMS Census | Authority classification fields: for-hire, exempt, private property, government, etc. | Monthly |
| History | Operating authority orders, revocations, and docket status changes | Daily |
Carrier Eligibility
Who gets scored — and who doesn't
Not every FMCSA-registered entity is a for-hire trucking carrier. We apply hard exclusions that remove non-carriers entirely — but beyond those, nearly every carrier with power units and usable exposure is graded. Thin-data carriers are not dropped to N/A; their estimate is shrunk toward their size-band prior and tagged with a confidence tier. This is why the FRED Score covers about 1.1 million carriers, versus roughly 475k under the prior approach that excluded thin carriers outright.
Pre-Scoring Exclusions
Carriers matching any of these criteria are removed before scoring begins:
Confidence Tiers, Not Exclusion
Rather than dropping carriers with thin data, the model shrinks their estimate toward their size-band prior and labels how much real track record backs the score:
Data & Exposure Normalization
Why raw counts are misleading
Public FMCSA records — crashes, inspections, and violations — are aggregated for each carrier over its most recent observation window. But comparing a 3-truck fleet to a 500-truck fleet by raw event counts is inherently unfair. A larger fleet drives more miles and naturally encounters more events. Exposure is how the model accounts for that: it enters the prediction as a log-exposure offset, so the model predicts a rate of crash burden, not a raw count.
Exposure is the carrier's mileage over the scoring window, in 100k-mile units. It enters the model as log(E) — an offset that puts every metric on a per-100k-mile footing so small and large fleets are predicted on the same scale:
Credibility & Confidence
How much to trust a carrier’s own history
Even on a per-mile basis, a tiny fleet with 1 crash in 50k miles looks far worse than a large fleet with 10 crashes in 5 million miles — even though the small fleet's rate is mostly just noise. One lucky or unlucky year can swing it wildly. Rather than throwing thin carriers away, the model blends each carrier's own signal with its size-band prior using Bühlmann-Straub credibility.
Each carrier gets a credibility weight $Z = E/(E+\beta_{\text{band}})$ — the share of the estimate that comes from the carrier's own record, with the size-band prior filling the rest. Carriers with lots of exposure ($E \gg \beta$) are governed almost entirely by their own history; thin carriers ($E \ll \beta$) are pulled ("shrunk") toward the band prior.
That same credibility weight drives each carrier's confidence tier:
Frequency × Severity Model
Predicting next year’s crash burden
The FRED Score targets a carrier's severity-weighted crash burden — not just how many crashes, but how bad. Each crash is weighted by its outcome, so a fatal collision counts far more than a minor tow-away:
That severity-weighted burden is modeled as frequency × severity with a Poisson / Tweedie GLM and a log-exposure offset, so the model predicts the carrier's expected burden per mile driven over the forward 12 months. A gradient-boosted XGBoost-Tweedie model is run alongside as a cross-check.
The coefficients $\beta_k$ are fit from data, not hand-set. In order of predictive strength, the strongest predictors are:
Relative-strength illustration; exact coefficients are refit each cycle from the most recent year-over-year history.
Violations as Fitted Predictors
Roadside violations are summarized into behavioral (driver-conduct) and equipment (vehicle-condition) rates, plus a severe-violation rate. These enter the model as features whose weight is learned from how strongly they predict the following year's crash burden — the relativities below are illustrative of the ordering the fit recovers, not pre-set multipliers baked into the score:
| Tier | Violation Type | Forward-crash RR |
|---|---|---|
| Critical — Immediate danger behaviors | ||
| Reckless Driving | 1.49 | |
| Dangerous Driving | 1.37 | |
| Jumping OOS / Driving Fatigued | 1.36 | |
| High — Serious behavioral risks | ||
| Speeding (high & excessive) | 1.30 | |
| Drugs / Alcohol | 1.29 | |
| Alcohol Possession | 1.27 | |
| Moderate Speeding | 1.20 | |
| Phone Call / Texting | 1.16–1.18 | |
| Moderate — Concerning behaviors | ||
| False Log | 1.12 | |
| Seat Belt | 1.11 | |
| Equipment — Vehicle condition | ||
| Lighting | 1.17 | |
| Tires | 1.16 | |
| Brakes (all types) | 1.13 | |
Each figure is the empirical relative risk — the ratio of next-year crash burden for carriers with that violation type versus those without. The model learns how much weight to give each behavioral and equipment feature directly from the year-over-year data, so its influence on the score reflects measured forward risk rather than a fixed assumption.
Per-Band Calibration
Unbiased predictions across every fleet size
A model can rank carriers well yet still systematically over- or under-predict for a given fleet size. To prevent that, each fleet-size band is calibrated so the total predicted burden matches the total observed burden — the observed-over-expected ratio lands at ≈ 1.0:
The forward expectation is then surfaced for underwriting as two figures per carrier — expected crashes and expected severity-weighted burden over the next 12 months:
Grades & Risk Relativity
Where a carrier sits among same-size peers
Each carrier's predicted burden is first expressed as a risk relativity — its predicted burden divided by the level that is typical for its size band. 1.00× means typical for its size; below 1 is safer than peers, above 1 is riskier.
Grades are then assigned from the credibility-shrunk within-band burden-rate percentile — where the carrier sits among same-size peers. Because the comparison is within band, a small fleet is graded against other small fleets, never against the national giants.
The percentile thresholds are the same for all size bands — a carrier in the safest tier earns “Excellent” whether it runs 3 trucks or 3,000. The 0–100 FRED Score (100 = safest) is the same ranking expressed on a friendlier scale.
Rolling Refit
Out-of-time, refreshed weekly
The model is refit on the most recent complete year→year pair — coefficients are learned from one year of carrier history paired with the crash burden that actually followed. Every carrier is then scored on its latest 12 months of history to predict the forward 12 months. Because the model is never scored on the same data it learned from, the FRED Score is genuinely out-of-time.
The data pipeline refreshes the score weekly, so each carrier's grade reflects their current safety posture. A carrier that improves will see it as older events age out; one that deteriorates feels the impact within months, not years.
The most recent ~45 days are held out for crash-reporting lag — crashes take time to appear in FMCSA's feed, so the very latest weeks aren't yet mature enough to score against. This keeps the forward target honest rather than artificially low for recent activity.
Automatic Rules & Flags
Hard overrides, eligibility gates, and informational flags
Beyond the statistical model, a set of deterministic rules handle edge cases where the math alone isn't sufficient. These fall into three categories: score overrides, data-quality adjustments, and informational flags.
Score Overrides
These rules supersede the calculated score entirely:
If a carrier holds an Unsatisfactory (U) FMCSA safety rating, the FRED Score is forced to 0 and the grade to Critical, regardless of the modeled burden — a regulator’s adverse finding overrides the statistics.
If a carrier has interstate-only scope, no active operating authority, and is not exempt — the FRED Score is set to N/A (no score produced). N/A is reserved for defunct or no-authority carriers, never for being small.
Eligibility Gates
A thin carrier is shrunk toward its size-band prior, not blocked. These gates only apply when there is no usable exposure at all or the reported data is implausible:
No reported mileage and zero inspections — nothing to anchor even a prior-based estimate. A carrier with any usable exposure is still scored as Provisional on its size-band prior.
Reported mileage is implausible: >300k miles per truck, <1k miles per truck for fleets ≥10, or >500M total miles. Scoring is blocked to prevent extreme rates.
Reports >300k miles per truck and has fewer than 2 inspections — high mileage can't be corroborated by inspection activity.
Data Quality Adjustments
Modifications to component scores when data quality is degraded:
When mileage is missing or implausible but the carrier has observed activity (inspections, crashes, or violations), the model falls back to inspection-based exposure rather than trusting the bad odometer figure.
Calculated exposure is floored at 50k miles (0.5 units) to prevent extremely volatile rates from tiny denominators.
Informational Flags
These flags are attached to carrier records for context but do not directly alter the score:
Chameleon detection cross-references every revoked DOT (any historical REVOCATION order) against active carriers using normalized address, officer name, phone, and DUNS. Matches are tiered by signal strength. The flag is informational and does not affect the FRED score — the underwriter decides what to do with the linkage.
Validation Standards
How we know it works
Every refit is validated out-of-time: the model is fit on one year, then judged on whether it ranks and calibrates the following year's crash burden on a carrier-disjoint holdout. Before a score update goes live it must clear a battery of discrimination and calibration gates — if any fails, the update is held.
Normalized Gini of ≈ 0.59 – 0.61 on the forward year — versus ~0.33 for a naive fleet-size-only baseline.
Observed-over-Expected must hold within tolerance of 1.0 in every fleet-size band — predictions are unbiased across sizes.
Observed forward burden must rise monotonically across risk deciles, with Spearman and Tweedie-deviance checks on the holdout.
All scoring, calibration, and out-of-time validation is performed by the reproducible pipeline
fred_postgres_v4.py.