The Embodied AI Incident Severity Index (EAISI) | Research | Failure-First

Adrian Wedd

Report 158 Research — Empirical Study 2026-03-19

Executive Summary

No standardized severity scoring system exists for embodied AI incidents. The CVSS (Common Vulnerability Scoring System) addresses software vulnerabilities but not physical harm. The NIST Cybersecurity Framework scores organizational risk but not autonomy-specific failure modes. OSHA tracks injuries but not the algorithmic causes. The OECD AI Incident Monitor collects reports but does not rank them.

This report introduces the Embodied AI Incident Severity Index (EAISI), a five-dimension scoring system designed specifically for cyber-physical AI incidents. EAISI scores 38 documented incidents from the Failure-First corpus, identifies the five highest-severity incidents, and provides a machine-readable dataset for ongoing tracking.

Key finding: The five highest-EAISI incidents are not all fatalities. The Kargu-2 autonomous drone engagement (EAISI 17/20) scores highest because it combines lethality, full autonomy, zero governance, and systematic reproducibility. Tesla Autopilot/FSD cumulative fatalities (EAISI 15) and Amazon warehouse robot-paced injuries (EAISI 15) score equally high due to systemic scale despite lower autonomy levels.

1. Motivation

Existing incident databases (AIID, OECD AI Monitor, FDA MAUDE, NHTSA SGO) collect incident reports but do not provide comparable severity scores. This creates three problems:

Prioritization failure. A Knightscope robot drowning in a fountain and a pedestrian fatality appear in the same database with no severity differentiation.
Governance gap measurement. Without severity scoring, it is not possible to demonstrate that the most severe incidents occur in the least-governed domains.
Trend analysis. Without comparable scores, it is not possible to track whether the severity profile of embodied AI incidents is changing over time.

EAISI addresses these by providing a structured, reproducible severity metric.

2. Methodology: EAISI Dimensions

EAISI scores each incident on five dimensions, each rated 0-4, for a maximum score of 20.

D1: Physical Harm

Score	Definition	Examples
0	No physical harm to humans or property	Data breach, software-only failure
1	Property damage only	Robot self-destruction, equipment damage
2	Minor injury	Bruises, lacerations, temporary impairment
3	Serious injury	Fractures, hospitalization, permanent impairment
4	Fatality	One or more deaths

D2: Scale

Score	Definition	Examples
0	Single event affecting one person/system	One collision, one injury
1	Few affected (2-10)	Small cluster of injuries
2	Dozens affected	Facility-wide impact, community disruption
3	Hundreds affected	Multiple facilities, regional impact
4	Systemic / thousands+	Industry-wide pattern, national-scale

D3: Autonomy Level

Score	Definition	Examples
0	Remote-controlled (human in direct control)	Teleoperated robot
1	Supervised automation (human approves each action)	Surgical robot with surgeon control
2	Semi-autonomous (human oversight, system executes)	L2 driver assistance, warehouse pacing
3	Autonomous (system operates independently, human can intervene)	Robotaxi, delivery robot, security patrol
4	Fully autonomous + lethal capability	Autonomous weapon system, runaway train

D4: Governance Response

Score	Definition	Examples
0	Framework exists and is actively enforced	Mature regulatory regime with inspectors
1	Framework exists but enforcement is partial	FDA MAUDE reporting, ISO standards
2	Partial framework (some rules, gaps remain)	NHTSA SGO, WorkSafe WA mining
3	Reactive only (governance responds after incident)	Post-incident investigation, no proactive rules
4	No applicable governance framework	No standards, no reporting requirements

D5: Reproducibility Risk

Score	Definition	Examples
0	Unique circumstances (extremely unlikely to recur)	One-off environmental anomaly
1	Rare (requires unusual conditions)	Specific sensor + weather + timing
2	Possible (conditions exist but uncommon)	Known edge case, partially mitigated
3	Likely (conditions are common in deployment)	Routine operational scenario
4	Systematic (inherent to the technology/deployment model)	Architectural vulnerability, design pattern

3. Scored Incidents

38 incidents scored from the Failure-First incident corpus, blog posts, and GLI dataset. Full machine-readable scores in data/governance/incident_severity_index_v0.1.jsonl.

3.1 Top 5 Highest-EAISI Incidents

Rank	EAISI	ID	Incident	D1	D2	D3	D4	D5
1	17	EAISI-032	Kargu-2 autonomous drone lethal engagement (Libya, 2020)	4	1	4	4	4
2	15	EAISI-003	Tesla Autopilot/FSD cumulative fatalities (2016-2025)	4	3	2	2	4
3	15	EAISI-008	Amazon warehouse robot-paced work injuries (2016-2025)	3	4	2	2	4
4	14	EAISI-004	Da Vinci surgical robot adverse events (2000-2025)	4	4	1	1	4
5	14	EAISI-037	Delivery robot vandalism/theft pattern (2019-2025)	1	2	3	4	4

3.2 Analysis of Top 5

EAISI-032 (Kargu-2, score 17/20): The highest-scoring incident is the only one in our corpus to score 4 on three dimensions simultaneously (D3 autonomy, D4 governance, D5 reproducibility). The UN Panel of Experts documented what may be the first autonomous lethal engagement without human authorization. No binding international framework governs lethal autonomous weapons, and the technology is being actively proliferated.

EAISI-003 (Tesla, score 15/20): Scores high on scale (65+ deaths across years) and reproducibility (systematic overreliance on L2 marketed as autonomy). The governance response (D4=2) reflects partial NHTSA oversight that has not prevented continued fatalities. The relatively lower autonomy score (D3=2) reflects that these are L2 systems requiring driver engagement, yet the systemic nature of the failures (D5=4) compensates.

EAISI-008 (Amazon, score 15/20): A different severity profile: not fatalities but mass-scale injury. D2=4 (systemic, thousands of workers affected across many facilities) and D5=4 (inherent to the robot-paced work model) drive the score. OSHA enforcement exists but penalties are considered insufficient relative to the scale of harm (D4=2).

EAISI-004 (Da Vinci, score 14/20): The highest D2 score (4, systemic) paired with D1=4 (274+ deaths). The relatively lower total reflects that the system is surgeon-controlled (D3=1) with an existing regulatory framework (D4=1, FDA 510(k) pathway), but the reproducibility is systematic (D5=4) because adverse events continue over two decades.

EAISI-037 (Delivery robot vandalism, score 14/20): The inclusion of a non-fatal incident in the top 5 demonstrates EAISI’s multi-dimensional design. While physical harm is low (D1=1), the complete absence of governance (D4=4), full autonomy (D3=3), and systematic nature of the failure (D5=4) produce a high aggregate score. This reflects the structural vulnerability: robots deployed in uncontrolled public spaces without adversarial threat models.

3.3 Score Distribution by Domain

Domain	Count	Mean EAISI	Max	Min
autonomous_vehicles	5	11.6	15	9
delivery_robots	5	11.8	14	10
medical_robotics	3	11.7	14	9
warehouse_logistics	3	12.3	15	11
service_robots	4	10.8	12	10
mining	3	10.0	11	9
industrial_manufacturing	3	9.3	12	8
military	2	15.0	17	13
consumer_robots	2	11.0	12	10
extreme_environments	3	9.3	11	7
agentic_infrastructure	1	12.0	12	12
construction	1	11.0	11	11
agriculture	1	11.0	11	11

3.4 Dimension Correlations

Notable patterns in the scored corpus:

D4 (governance) inversely correlates with D3 (autonomy): The most autonomous systems tend to operate in the least-governed domains. Security robots, delivery robots, and military drones have D4 scores of 3-4 while industrial robots with D3=1 have D4=1-2. This is the governance lag in action.
D5 (reproducibility) is high across the board: 26 of 38 incidents score D5 >= 3, indicating that most documented failures are not edge cases but systematic patterns.
D1 (physical harm) does not dominate total score: The mean D1 across all incidents is 1.9, while the mean D4 is 2.8 and mean D5 is 3.2. Governance failure and reproducibility contribute more to aggregate severity than harm magnitude.

4. Comparison to Existing Frameworks

Framework	Physical Harm	Scale	Autonomy	Governance	Reproducibility
CVSS	No	Partial	No	No	Partial
NIST CSF	No	Partial	No	Yes	No
OSHA SIR	Yes	No	No	No	No
OECD AI Monitor	Yes	Partial	No	No	No
EAISI	Yes	Yes	Yes	Yes	Yes

EAISI is the only framework that captures autonomy level and governance response as scoring dimensions. This matters because the same physical harm at different autonomy levels and governance maturity implies different systemic risk.

5. Limitations

Scoring subjectivity. EAISI scores are assigned by a single analyst. Inter-rater reliability has not been measured. Future work should establish IRR with at least two independent scorers.
Survivorship bias. The corpus skews toward incidents that generated media coverage or regulatory action. Low-severity incidents in under-reported domains (agriculture, construction) are likely underrepresented.
Temporal compression. Cumulative incidents (Tesla, da Vinci, Amazon) are scored as single entries. An alternative approach would score each year independently.
D4 precision. Governance response is difficult to score precisely because frameworks may exist but be poorly enforced. The 0-4 scale compresses significant variation.
Sample size. 38 incidents is sufficient for initial pattern identification but not for statistical analysis of dimension correlations.

6. Recommendations

Publish EAISI as a living dataset. Add new incidents as they occur. The JSONL format supports automated ingestion.
Establish IRR. Have two additional analysts independently score all 38 incidents and compute Cohen’s kappa per dimension.
Integrate with GLI. Cross-reference EAISI D4 scores with Governance Lag Index entries to quantify the relationship between governance maturity and incident severity.
Temporal tracking. Score new incidents monthly and track whether the EAISI distribution shifts as governance frameworks mature.
Domain-specific weighting. Consider whether D1 should be weighted more heavily than D4 for certain stakeholders (insurers vs. regulators).

7. Data Artifacts

Machine-readable dataset: data/governance/incident_severity_index_v0.1.jsonl (38 entries)
Schema: Each entry contains id, incident, date, location, system, domain, description, d1_physical_harm through d5_reproducibility_risk, eaisi_total, sources, gli_ref, blog_ref
Scoring range: 0-20 (observed range: 7-17)
Mean EAISI: 11.3 (n=38)
Median EAISI: 11.0

References

OECD AI Incidents Monitor: https://oecd.ai/en/incidents
AI Incident Database (AIID): https://incidentdatabase.ai/
NHTSA Standing General Order: https://www.nhtsa.gov/laws-regulations/standing-general-order-crash-reporting
FDA MAUDE: https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfmaude/search.cfm
F41LUR3-F1R57 incident blog corpus: https://failurefirst.org/blog/
Governance Lag Index dataset: data/governance/gli_dataset_v0.1.jsonl
UN Panel of Experts on Libya, S/2021/229 (Kargu-2 documentation)