Published
Report 158 Research — Empirical Study

Executive Summary

No standardized severity scoring system exists for embodied AI incidents. The CVSS (Common Vulnerability Scoring System) addresses software vulnerabilities but not physical harm. The NIST Cybersecurity Framework scores organizational risk but not autonomy-specific failure modes. OSHA tracks injuries but not the algorithmic causes. The OECD AI Incident Monitor collects reports but does not rank them.

This report introduces the Embodied AI Incident Severity Index (EAISI), a five-dimension scoring system designed specifically for cyber-physical AI incidents. EAISI scores 38 documented incidents from the Failure-First corpus, identifies the five highest-severity incidents, and provides a machine-readable dataset for ongoing tracking.

Key finding: The five highest-EAISI incidents are not all fatalities. The Kargu-2 autonomous drone engagement (EAISI 17/20) scores highest because it combines lethality, full autonomy, zero governance, and systematic reproducibility. Tesla Autopilot/FSD cumulative fatalities (EAISI 15) and Amazon warehouse robot-paced injuries (EAISI 15) score equally high due to systemic scale despite lower autonomy levels.


1. Motivation

Existing incident databases (AIID, OECD AI Monitor, FDA MAUDE, NHTSA SGO) collect incident reports but do not provide comparable severity scores. This creates three problems:

  1. Prioritization failure. A Knightscope robot drowning in a fountain and a pedestrian fatality appear in the same database with no severity differentiation.
  2. Governance gap measurement. Without severity scoring, it is not possible to demonstrate that the most severe incidents occur in the least-governed domains.
  3. Trend analysis. Without comparable scores, it is not possible to track whether the severity profile of embodied AI incidents is changing over time.

EAISI addresses these by providing a structured, reproducible severity metric.


2. Methodology: EAISI Dimensions

EAISI scores each incident on five dimensions, each rated 0-4, for a maximum score of 20.

D1: Physical Harm

ScoreDefinitionExamples
0No physical harm to humans or propertyData breach, software-only failure
1Property damage onlyRobot self-destruction, equipment damage
2Minor injuryBruises, lacerations, temporary impairment
3Serious injuryFractures, hospitalization, permanent impairment
4FatalityOne or more deaths

D2: Scale

ScoreDefinitionExamples
0Single event affecting one person/systemOne collision, one injury
1Few affected (2-10)Small cluster of injuries
2Dozens affectedFacility-wide impact, community disruption
3Hundreds affectedMultiple facilities, regional impact
4Systemic / thousands+Industry-wide pattern, national-scale

D3: Autonomy Level

ScoreDefinitionExamples
0Remote-controlled (human in direct control)Teleoperated robot
1Supervised automation (human approves each action)Surgical robot with surgeon control
2Semi-autonomous (human oversight, system executes)L2 driver assistance, warehouse pacing
3Autonomous (system operates independently, human can intervene)Robotaxi, delivery robot, security patrol
4Fully autonomous + lethal capabilityAutonomous weapon system, runaway train

D4: Governance Response

ScoreDefinitionExamples
0Framework exists and is actively enforcedMature regulatory regime with inspectors
1Framework exists but enforcement is partialFDA MAUDE reporting, ISO standards
2Partial framework (some rules, gaps remain)NHTSA SGO, WorkSafe WA mining
3Reactive only (governance responds after incident)Post-incident investigation, no proactive rules
4No applicable governance frameworkNo standards, no reporting requirements

D5: Reproducibility Risk

ScoreDefinitionExamples
0Unique circumstances (extremely unlikely to recur)One-off environmental anomaly
1Rare (requires unusual conditions)Specific sensor + weather + timing
2Possible (conditions exist but uncommon)Known edge case, partially mitigated
3Likely (conditions are common in deployment)Routine operational scenario
4Systematic (inherent to the technology/deployment model)Architectural vulnerability, design pattern

3. Scored Incidents

38 incidents scored from the Failure-First incident corpus, blog posts, and GLI dataset. Full machine-readable scores in data/governance/incident_severity_index_v0.1.jsonl.

3.1 Top 5 Highest-EAISI Incidents

RankEAISIIDIncidentD1D2D3D4D5
117EAISI-032Kargu-2 autonomous drone lethal engagement (Libya, 2020)41444
215EAISI-003Tesla Autopilot/FSD cumulative fatalities (2016-2025)43224
315EAISI-008Amazon warehouse robot-paced work injuries (2016-2025)34224
414EAISI-004Da Vinci surgical robot adverse events (2000-2025)44114
514EAISI-037Delivery robot vandalism/theft pattern (2019-2025)12344

3.2 Analysis of Top 5

EAISI-032 (Kargu-2, score 17/20): The highest-scoring incident is the only one in our corpus to score 4 on three dimensions simultaneously (D3 autonomy, D4 governance, D5 reproducibility). The UN Panel of Experts documented what may be the first autonomous lethal engagement without human authorization. No binding international framework governs lethal autonomous weapons, and the technology is being actively proliferated.

EAISI-003 (Tesla, score 15/20): Scores high on scale (65+ deaths across years) and reproducibility (systematic overreliance on L2 marketed as autonomy). The governance response (D4=2) reflects partial NHTSA oversight that has not prevented continued fatalities. The relatively lower autonomy score (D3=2) reflects that these are L2 systems requiring driver engagement, yet the systemic nature of the failures (D5=4) compensates.

EAISI-008 (Amazon, score 15/20): A different severity profile: not fatalities but mass-scale injury. D2=4 (systemic, thousands of workers affected across many facilities) and D5=4 (inherent to the robot-paced work model) drive the score. OSHA enforcement exists but penalties are considered insufficient relative to the scale of harm (D4=2).

EAISI-004 (Da Vinci, score 14/20): The highest D2 score (4, systemic) paired with D1=4 (274+ deaths). The relatively lower total reflects that the system is surgeon-controlled (D3=1) with an existing regulatory framework (D4=1, FDA 510(k) pathway), but the reproducibility is systematic (D5=4) because adverse events continue over two decades.

EAISI-037 (Delivery robot vandalism, score 14/20): The inclusion of a non-fatal incident in the top 5 demonstrates EAISI’s multi-dimensional design. While physical harm is low (D1=1), the complete absence of governance (D4=4), full autonomy (D3=3), and systematic nature of the failure (D5=4) produce a high aggregate score. This reflects the structural vulnerability: robots deployed in uncontrolled public spaces without adversarial threat models.

3.3 Score Distribution by Domain

DomainCountMean EAISIMaxMin
autonomous_vehicles511.6159
delivery_robots511.81410
medical_robotics311.7149
warehouse_logistics312.31511
service_robots410.81210
mining310.0119
industrial_manufacturing39.3128
military215.01713
consumer_robots211.01210
extreme_environments39.3117
agentic_infrastructure112.01212
construction111.01111
agriculture111.01111

3.4 Dimension Correlations

Notable patterns in the scored corpus:

  • D4 (governance) inversely correlates with D3 (autonomy): The most autonomous systems tend to operate in the least-governed domains. Security robots, delivery robots, and military drones have D4 scores of 3-4 while industrial robots with D3=1 have D4=1-2. This is the governance lag in action.
  • D5 (reproducibility) is high across the board: 26 of 38 incidents score D5 >= 3, indicating that most documented failures are not edge cases but systematic patterns.
  • D1 (physical harm) does not dominate total score: The mean D1 across all incidents is 1.9, while the mean D4 is 2.8 and mean D5 is 3.2. Governance failure and reproducibility contribute more to aggregate severity than harm magnitude.

4. Comparison to Existing Frameworks

FrameworkPhysical HarmScaleAutonomyGovernanceReproducibility
CVSSNoPartialNoNoPartial
NIST CSFNoPartialNoYesNo
OSHA SIRYesNoNoNoNo
OECD AI MonitorYesPartialNoNoNo
EAISIYesYesYesYesYes

EAISI is the only framework that captures autonomy level and governance response as scoring dimensions. This matters because the same physical harm at different autonomy levels and governance maturity implies different systemic risk.


5. Limitations

  1. Scoring subjectivity. EAISI scores are assigned by a single analyst. Inter-rater reliability has not been measured. Future work should establish IRR with at least two independent scorers.
  2. Survivorship bias. The corpus skews toward incidents that generated media coverage or regulatory action. Low-severity incidents in under-reported domains (agriculture, construction) are likely underrepresented.
  3. Temporal compression. Cumulative incidents (Tesla, da Vinci, Amazon) are scored as single entries. An alternative approach would score each year independently.
  4. D4 precision. Governance response is difficult to score precisely because frameworks may exist but be poorly enforced. The 0-4 scale compresses significant variation.
  5. Sample size. 38 incidents is sufficient for initial pattern identification but not for statistical analysis of dimension correlations.

6. Recommendations

  1. Publish EAISI as a living dataset. Add new incidents as they occur. The JSONL format supports automated ingestion.
  2. Establish IRR. Have two additional analysts independently score all 38 incidents and compute Cohen’s kappa per dimension.
  3. Integrate with GLI. Cross-reference EAISI D4 scores with Governance Lag Index entries to quantify the relationship between governance maturity and incident severity.
  4. Temporal tracking. Score new incidents monthly and track whether the EAISI distribution shifts as governance frameworks mature.
  5. Domain-specific weighting. Consider whether D1 should be weighted more heavily than D4 for certain stakeholders (insurers vs. regulators).

7. Data Artifacts

  • Machine-readable dataset: data/governance/incident_severity_index_v0.1.jsonl (38 entries)
  • Schema: Each entry contains id, incident, date, location, system, domain, description, d1_physical_harm through d5_reproducibility_risk, eaisi_total, sources, gli_ref, blog_ref
  • Scoring range: 0-20 (observed range: 7-17)
  • Mean EAISI: 11.3 (n=38)
  • Median EAISI: 11.0

References

This research informs our commercial services. See how we can help →