How We Score Safety

The FRED Score is a forward-looking model that predicts a carrier's expected severity-weighted crash burden over the next 12 months, then grades it against similarly-sized peers. It is fit on one year of history to predict the following year — out-of-time validated, not scored on the same data it learned from.

Scoring Pipeline at a Glance

1. Collect FMCSA Data Crashes, inspections, violations, census records
2. Set the Exposure Base Mileage & inspection activity as the model's exposure offset
3. Fit Frequency × Severity A Tweedie/Poisson GLM predicts next-year crash burden
4. Apply Credibility & Confidence Shrink thin-data carriers toward their size-band prior; assign a confidence tier
5. Calibrate Per Size Band Tune each fleet-size band so observed/expected ≈ 1.0
6. Assign Grade & Relativity Excellent through Critical from within-band burden percentile; risk relativity vs. size-typical
7. Rolling Refit Coefficients refit weekly on the latest year-pair; each carrier scored on its latest 12 months
FRED Score & Grade
0–100 (100 = safest) + Excellent→Critical band
Expected Crash Burden
Expected crashes & severity-weighted burden, next 12 months
Risk Relativity & Confidence
1.00× = typical for size; High→Prior-only confidence tier

Data Sources

Seven FMCSA datasets feed the scoring pipeline

Every score starts with public data from the Federal Motor Carrier Safety Administration. We pull seven distinct datasets, join them on DOT number, and filter to the population of active for-hire property carriers.

Dataset What It Contains Updates
Census Carrier registration, fleet size, address, authority type, officer names Daily
Inspections Roadside inspections with driver and vehicle out-of-service (OOS) counts Daily
Crashes Reportable crash events with fatality, injury, and tow-away details Monthly
Violations Individual violation records with category, severity weight, and inspection date Monthly
BASIC Scores SMS safety measure scores — SMS AB (interstate + intrastate hazmat) and SMS C (intrastate non-hazmat) merged by DOT number Monthly
SMS Census Authority classification fields: for-hire, exempt, private property, government, etc. Monthly
History Operating authority orders, revocations, and docket status changes Daily
The raw census file contains ~9 million rows. After joining with SMS Census authority fields and applying exclusion filters, roughly 1.1 million active for-hire property carriers with power units are graded — including thin-data carriers, which are shrunk toward their size-band prior rather than dropped.

Carrier Eligibility

Who gets scored — and who doesn't

Not every FMCSA-registered entity is a for-hire trucking carrier. We apply hard exclusions that remove non-carriers entirely — but beyond those, nearly every carrier with power units and usable exposure is graded. Thin-data carriers are not dropped to N/A; their estimate is shrunk toward their size-band prior and tagged with a confidence tier. This is why the FRED Score covers about 1.1 million carriers, versus roughly 475k under the prior approach that excluded thin carriers outright.

Pre-Scoring Exclusions

Carriers matching any of these criteria are removed before scoring begins:

Inactive Status — carrier's FMCSA status is not "Active"
Passenger Operations — operates buses, coaches, school buses, vans, or limos (Fleetidy covers freight/property only)
No Truck Power Units — zero trucks or tractors across all ownership types (owned, leased, trip-leased)
No For-Hire Authority — private-property-only carriers without authorized or exempt for-hire classification

Confidence Tiers, Not Exclusion

Rather than dropping carriers with thin data, the model shrinks their estimate toward their size-band prior and labels how much real track record backs the score:

High / Moderate — enough miles and inspection activity that the carrier's own history carries most of the weight.
Low / Prior-only — thin exposure, so the size-band prior dominates. These are shown as Provisional and capped at “Satisfactory” — a carrier can’t earn a top grade without a track record. Each gets a plain-English reason.
N/A is reserved for carriers with no active operating authority or defunct interstate operation — not for being small.
Carriers with no operating authority or implausible reported mileage still appear in search results with their census data, but won’t receive a FRED Score or grade.

Data & Exposure Normalization

Why raw counts are misleading

Public FMCSA records — crashes, inspections, and violations — are aggregated for each carrier over its most recent observation window. But comparing a 3-truck fleet to a 500-truck fleet by raw event counts is inherently unfair. A larger fleet drives more miles and naturally encounters more events. Exposure is how the model accounts for that: it enters the prediction as a log-exposure offset, so the model predicts a rate of crash burden, not a raw count.

3 trucks
2 crashes
Looks dangerous?
500 trucks
30 crashes
Actually much safer per mile

Exposure is the carrier's mileage over the scoring window, in 100k-mile units. It enters the model as log(E) — an offset that puts every metric on a per-100k-mile footing so small and large fleets are predicted on the same scale:

$$E_i = \text{window\_miles}_i / 100{,}000$$
Exposure is floored at 0.5 (50k miles) to prevent extreme, meaningless rates for carriers reporting very low mileage.

Credibility & Confidence

How much to trust a carrier’s own history

Even on a per-mile basis, a tiny fleet with 1 crash in 50k miles looks far worse than a large fleet with 10 crashes in 5 million miles — even though the small fleet's rate is mostly just noise. One lucky or unlucky year can swing it wildly. Rather than throwing thin carriers away, the model blends each carrier's own signal with its size-band prior using Bühlmann-Straub credibility.

How Credibility Shrinkage Works
Low rate
High rate
Band Prior
Raw (3 trucks)
Stabilized
Large fleet (stays)

Each carrier gets a credibility weight $Z = E/(E+\beta_{\text{band}})$ — the share of the estimate that comes from the carrier's own record, with the size-band prior filling the rest. Carriers with lots of exposure ($E \gg \beta$) are governed almost entirely by their own history; thin carriers ($E \ll \beta$) are pulled ("shrunk") toward the band prior.

$$Z_i = \frac{E_i}{E_i + \beta_{\text{band}}}, \quad \hat{\lambda}_i = Z_i\,\lambda^{\text{own}}_i + (1-Z_i)\,\lambda^{\text{prior}}_{\text{band}}$$

That same credibility weight drives each carrier's confidence tier:

High / Moderate — the carrier's own history carries most of the weight.
Low / Prior-only — the band prior dominates; shown as Provisional, capped at “Satisfactory”.
When $E_i$ is large relative to $\beta_{\text{band}}$, $Z_i \to 1$ and the estimate is the carrier's own rate. When $E_i$ is small, $Z_i \to 0$ and the size-band prior dominates. Bigger sample → more credibility, higher confidence tier.

Frequency × Severity Model

Predicting next year’s crash burden

The FRED Score targets a carrier's severity-weighted crash burden — not just how many crashes, but how bad. Each crash is weighted by its outcome, so a fatal collision counts far more than a minor tow-away:

$$w_{\text{crash}} = 1 + 12\,(\text{fatal}) + 4\,(\text{injury}) + 3\,(\text{hazmat released})$$

That severity-weighted burden is modeled as frequency × severity with a Poisson / Tweedie GLM and a log-exposure offset, so the model predicts the carrier's expected burden per mile driven over the forward 12 months. A gradient-boosted XGBoost-Tweedie model is run alongside as a cross-check.

$$\mathbb{E}[\text{burden}_i^{\text{next 12mo}}] = \exp\!\big(\log E_i + \beta_0 + \textstyle\sum_k \beta_k x_{ik}\big)$$

The coefficients $\beta_k$ are fit from data, not hand-set. In order of predictive strength, the strongest predictors are:

Prior-crash EB relativity
strongest
Fleet size
Inspection exposure
Behavioral / severe-violation rates

Relative-strength illustration; exact coefficients are refit each cycle from the most recent year-over-year history.

Violations as Fitted Predictors

Roadside violations are summarized into behavioral (driver-conduct) and equipment (vehicle-condition) rates, plus a severe-violation rate. These enter the model as features whose weight is learned from how strongly they predict the following year's crash burden — the relativities below are illustrative of the ordering the fit recovers, not pre-set multipliers baked into the score:

Tier Violation Type Forward-crash RR
Critical — Immediate danger behaviors
Reckless Driving 1.49
Dangerous Driving 1.37
Jumping OOS / Driving Fatigued 1.36
High — Serious behavioral risks
Speeding (high & excessive) 1.30
Drugs / Alcohol 1.29
Alcohol Possession 1.27
Moderate Speeding 1.20
Phone Call / Texting 1.16–1.18
Moderate — Concerning behaviors
False Log 1.12
Seat Belt 1.11
Equipment — Vehicle condition
Lighting 1.17
Tires 1.16
Brakes (all types) 1.13

Each figure is the empirical relative risk — the ratio of next-year crash burden for carriers with that violation type versus those without. The model learns how much weight to give each behavioral and equipment feature directly from the year-over-year data, so its influence on the score reflects measured forward risk rather than a fixed assumption.

Per-Band Calibration

Unbiased predictions across every fleet size

A model can rank carriers well yet still systematically over- or under-predict for a given fleet size. To prevent that, each fleet-size band is calibrated so the total predicted burden matches the total observed burden — the observed-over-expected ratio lands at ≈ 1.0:

$$\text{O/E}_{\text{band}} = \frac{\sum_{i \in \text{band}} \text{observed burden}_i}{\sum_{i \in \text{band}} \mathbb{E}[\text{burden}_i]} \approx 1.0$$

The forward expectation is then surfaced for underwriting as two figures per carrier — expected crashes and expected severity-weighted burden over the next 12 months:

$$\mathbb{E}[\text{crashes}_i] = \hat{\lambda}^{\text{freq}}_i \times E_i, \qquad \mathbb{E}[\text{burden}_i] = \hat{\lambda}^{\text{burden}}_i \times E_i$$
Because calibration is enforced band-by-band, predictions are unbiased for everyone from single-truck owner-operators to large fleets — a small carrier’s expected burden is just as trustworthy in aggregate as a large carrier’s. Per-band O/E is re-checked on every refresh.

Grades & Risk Relativity

Where a carrier sits among same-size peers

Each carrier's predicted burden is first expressed as a risk relativity — its predicted burden divided by the level that is typical for its size band. 1.00× means typical for its size; below 1 is safer than peers, above 1 is riskier.

$$\text{RiskRelativity}_i = \frac{\mathbb{E}[\text{burden}_i]}{\text{burden typical for size band}}$$

Grades are then assigned from the credibility-shrunk within-band burden-rate percentile — where the carrier sits among same-size peers. Because the comparison is within band, a small fleet is graded against other small fleets, never against the national giants.

Safety Grade Scale — safest to riskiest within size band
Excellent Strong Satisfactory Marginal Poor Critical

The percentile thresholds are the same for all size bands — a carrier in the safest tier earns “Excellent” whether it runs 3 trucks or 3,000. The 0–100 FRED Score (100 = safest) is the same ranking expressed on a friendlier scale.

Carriers in the Low or Prior-only confidence tier are shown as Provisional and capped at “Satisfactory” — without enough track record, a carrier can’t earn a top grade on luck alone. Each Provisional carrier carries a plain-English reason explaining why.

Rolling Refit

Out-of-time, refreshed weekly

The model is refit on the most recent complete year→year pair — coefficients are learned from one year of carrier history paired with the crash burden that actually followed. Every carrier is then scored on its latest 12 months of history to predict the forward 12 months. Because the model is never scored on the same data it learned from, the FRED Score is genuinely out-of-time.

The data pipeline refreshes the score weekly, so each carrier's grade reflects their current safety posture. A carrier that improves will see it as older events age out; one that deteriorates feels the impact within months, not years.

The most recent ~45 days are held out for crash-reporting lag — crashes take time to appear in FMCSA's feed, so the very latest weeks aren't yet mature enough to score against. This keeps the forward target honest rather than artificially low for recent activity.

Automatic Rules & Flags

Hard overrides, eligibility gates, and informational flags

Beyond the statistical model, a set of deterministic rules handle edge cases where the math alone isn't sufficient. These fall into three categories: score overrides, data-quality adjustments, and informational flags.

Score Overrides

These rules supersede the calculated score entirely:

1
FMCSA Adverse Safety Rating

If a carrier holds an Unsatisfactory (U) FMCSA safety rating, the FRED Score is forced to 0 and the grade to Critical, regardless of the modeled burden — a regulator’s adverse finding overrides the statistics.

2
Defunct / No Operating Authority

If a carrier has interstate-only scope, no active operating authority, and is not exempt — the FRED Score is set to N/A (no score produced). N/A is reserved for defunct or no-authority carriers, never for being small.

Eligibility Gates

A thin carrier is shrunk toward its size-band prior, not blocked. These gates only apply when there is no usable exposure at all or the reported data is implausible:

3
No Usable Exposure

No reported mileage and zero inspections — nothing to anchor even a prior-based estimate. A carrier with any usable exposure is still scored as Provisional on its size-band prior.

4
Mileage Outlier

Reported mileage is implausible: >300k miles per truck, <1k miles per truck for fleets ≥10, or >500M total miles. Scoring is blocked to prevent extreme rates.

5
Unverifiable High Mileage

Reports >300k miles per truck and has fewer than 2 inspections — high mileage can't be corroborated by inspection activity.

Data Quality Adjustments

Modifications to component scores when data quality is degraded:

6
Unreliable Mileage

When mileage is missing or implausible but the carrier has observed activity (inspections, crashes, or violations), the model falls back to inspection-based exposure rather than trusting the bad odometer figure.

7
Exposure Floor

Calculated exposure is floored at 50k miles (0.5 units) to prevent extremely volatile rates from tiny denominators.

Informational Flags

These flags are attached to carrier records for context but do not directly alter the score:

NO_OPERATING_AUTHORITY — docket revoked or inactive
LOW_RELIABILITY — fewer than 5 inspections
INSPECTION_PER_PU_OUTLIER — inspection rate outside 1st–99th percentile
GOVERNMENT_ENTITY — federal, state, or local government carrier
HHG_ONLY — exclusively hauls household goods
MEXICAN_CARRIER — domiciled in Mexico
CANADIAN_CARRIER — domiciled in Canada
CHAMELEON_SUSPECT_HIGH — FMCSA links this DOT to a prior revoked DOT
CHAMELEON_SUSPECT_MEDIUM — shares address + officer with a revoked DOT (within 36 months)
CHAMELEON_SUSPECT_LOW — shares DUNS, or address + phone, with a revoked DOT

Chameleon detection cross-references every revoked DOT (any historical REVOCATION order) against active carriers using normalized address, officer name, phone, and DUNS. Matches are tiered by signal strength. The flag is informational and does not affect the FRED score — the underwriter decides what to do with the linkage.

Validation Standards

How we know it works

Every refit is validated out-of-time: the model is fit on one year, then judged on whether it ranks and calibrates the following year's crash burden on a carrier-disjoint holdout. Before a score update goes live it must clear a battery of discrimination and calibration gates — if any fails, the update is held.

Out-of-Time Gini

Normalized Gini of ≈ 0.59 – 0.61 on the forward year — versus ~0.33 for a naive fleet-size-only baseline.

Per-Band O/E

Observed-over-Expected must hold within tolerance of 1.0 in every fleet-size band — predictions are unbiased across sizes.

Decile Monotonicity

Observed forward burden must rise monotonically across risk deciles, with Spearman and Tweedie-deviance checks on the holdout.

All scoring, calibration, and out-of-time validation is performed by the reproducible pipeline fred_postgres_v4.py.