April 18, 2026 Daily Paper

VULCAN: Vision-Language-Model Enhanced Multi-Agent Cooperative Navigation for Indoor Fire-Disaster Response

Evaluates multi-agent cooperative navigation systems under realistic fire-disaster conditions using VLM-enhanced perception, identifying critical failure modes in smoke, thermal hazards, and sensor...

arXiv:2604.12831 Empirical Study

Shengding Liu, Qiben Yan

multi-agent-navigationvision-language-modelsfire-disaster-responsesensor-degradationsmoke-diffusionhazard-aware-planning

Infographic: VULCAN: Vision-Language-Model Enhanced Multi-Agent Cooperative Navigation for Indoor Fire-Disaster Response

1. Introduction: The High Stakes of Fire-Disaster Response

Indoor fire disasters represent one of the most hostile frontiers for autonomous systems. In these high-stakes environments, rapid environmental evolution, extreme thermal gradients, and dense smoke create a “perceptual blackout” that mirrors the disorientation experienced by human first responders. A sobering motivator for this research is the fire in Tai Po, Hong Kong—the deadliest in decades—where the total collapse of situational awareness during the early response phase led to catastrophic casualties.

While multi-agent cooperative navigation offers a theoretical path toward faster, safer search and rescue (SAR), current vision-based systems are fundamentally built for “benign” settings. When these models encounter adversarial fire zones, their reliance on standard visual cues leads to systemic failure. We introduce VULCAN (named for the Roman deity of fire), a framework designed to grant multi-agent teams a multi-modal “sixth sense.” By integrating Vision-Language Models (VLMs) with smoke-robust sensing, VULCAN pivots from simple pathfinding to hazard-aware survival and exploration.

2. Why AI Fails When Things Get Hot: The Perceptual Breakdown

Our research identifies three fundamental challenges that trigger the collapse of standard robotic navigation in fire scenes: the rapid evolution of temperature and smoke, severe sensor degradation, and a total lack of hazard-aware decision logic. When standard vision-based agents are pushed into smoke-filled corridors, they exhibit three critical failure modes:

Perception Failure: Smoke causes a catastrophic drop in semantic grounding. For instance, an agent’s confidence in detecting a target—such as a chair—can plummet from $0.87$ to $0.15$ under smoke occlusion, leading to missed detections. Conversely, visual noise triggers “hallucinations,” where agents falsely identify objects like toilets in empty, smoke-obscured hallways.
Inefficient Exploration: High sensing uncertainty destabilizes the agent’s spatial abstraction. Instead of expanding the map frontier, agents exhibit redundant framing and erratic trajectories, failing to push into unexplored territory due to unreliable geometric cues.
Unsafe Planning: Standard planners prioritize the shortest path based on binary occupancy. Without a “thermal prior,” these agents frequently plan trajectories directly through high-risk zones, ignoring lethal thermal hazards that would destroy the robot and jeopardize the mission.

The VULCAN architecture represents a shift toward a hierarchical, failure-aware design. The perception pipeline is engineered to extract stable structural cues and detect hazards even when RGB-D data is compromised.

Sensory Modality	Smoke-Robust Contribution
RGB	Provides semantic context; aligned pixel-wise with Thermal for $T(u)$ estimation.
Depth	Captures 3D geometry; monitored for consistency to infer smoke density $S(u)$ .
Thermal	Essential for fire mapping; detects high-temperature zones and victim heat signatures.
mmWave Radar	The ultimate fallback; penetrates dense smoke to provide reliable distance/occupancy data.

At the heart of VULCAN is a VLM-based fusion operator $F_{vlm}(\cdot)$ . This operator processes aligned RGB, depth, and thermal observations to approximate a “smoke-transparent” view. This fused representation is passed to an Open-Vocabulary Detector (Det) and a Class-Agnostic Segmentation (Seg) model, allowing the system to identify objects and hazards without being restricted to a fixed label set.

The resulting local hazard-aware 3D point cloud $M_i$ is augmented with a hazard attribute vector $h = [T(u), S(u), \sigma(u)]$ , where $\sigma(u)$ encodes sensing uncertainty. These local maps are merged into a global representation, which is projected into a 2D grid where the hazard intensity $H(x, y)$ is calculated as: $H(x, y) = \sum_{x_g \in (x, y)} (w_T T + w_S S + w_\sigma \sigma)$ This allows the system to balance the contributions of temperature, smoke density, and perceptual uncertainty during planning.

4. Global and Local Planning: The Brain and the Feet

VULCAN decouples high-level semantic reasoning from low-level motion control to maintain responsiveness.

The VLM Global Planner The “brain” uses a Vision-Language Model to reason over multi-modal prompts. The VLM receives a top-down visual map and a textual Hazard Report in JSON format for each candidate frontier. This report includes critical safety metrics:

{
  "frontier": 1, 
  "smoke": 0.1, 
  "temperature": 300, 
  "severity": "safe", 
  "confidence": 0.9
}

By analyzing “frontier utility” against risk, the VLM assigns goals that maximize exploration while favoring lower-risk areas with high-confidence perception.

Hazard-Aware FMM Local Planner To execute these goals, we utilize a hazard-aware Fast Marching Method (FMM). We modulate the propagation speed $F(x)$ based on the hazard map $H(x)$ : $F(x) = \frac{1}{1 + \alpha H(x)}$ The parameter $\alpha$ controls the safety-efficiency trade-off. This modulation ensures that paths are biased toward low-risk corridors, even if they are physically longer, effectively penalizing traversal through hazardous regions.

5. Benchmarking Resilience: VULCAN vs. The Baselines

We extended the Habitat-Matterport3D benchmark using a Gazebo-based physics engine with particle-emitter support to simulate realistic smoke and thermal effects. We compared VULCAN’s representative VLM planner (Co-NavGPT) against traditional Greedy, Cost-Utility, and Random Sample baselines.

Table 1: Normal Conditions

Method	Steps (NS)	Success (SR)	SPL
Greedy	219.03	0.686	0.322
Cost-Utility	199.60	0.628	0.315
Random Sample	206.62	0.631	0.258
Co-NavGPT	185.43	0.666	0.388

Table 2: Fire Conditions

Method	Steps (NS)	Success (SR)	SPL	Hazard (CHE)
Greedy	267.40	0.651	0.319	11.233
Cost-Utility	207.49	0.608	0.306	8.517
Random Sample	214.96	0.625	0.251	7.174
Co-NavGPT	187.89	0.660	0.381	4.873

Analysis of the “Success Rate Paradox” In Table 1, the Greedy baseline actually achieves a slightly higher Success Rate (0.686) than Co-NavGPT (0.666). However, the Senior Researcher must look deeper: Co-NavGPT is far more efficient, with significantly lower steps and a superior SPL (0.388 vs 0.322). Greedy navigation often stumbles into success through exhaustive, inefficient movement, whereas VLM-based planning uses semantic priors to reach goals via optimized paths.

In fire conditions, the safety gap is undeniable. While Greedy’s Cumulative Hazard Exposure (CHE) spikes to 11.233, Co-NavGPT maintains a CHE of only 4.873. By nearly halving the risk exposure, VULCAN proves that hazard-aware planning is a prerequisite for safety-critical embodied AI.

6. Future Horizons and Red-Teaming for Safety

For AI safety researchers, VULCAN serves as a “red-teaming” tool for navigation, exposing how covert failure modes emerge when vision-only systems are stressed.

Beyond RGB-D: Mission-critical SAR requires multi-modal fusion. RGB-D is a liability in smoke; thermal and mmWave radar are non-negotiable for robust perception.
Semantic Reasoning in Chaos: VLMs excel as high-level reasoners in noisy environments where traditional cost-benefit algorithms become brittle and uncoordinated.
The Power of High-Fidelity Simulation: Utilizing physics-based particle emitters in Gazebo is essential. We cannot trust systems that have only been tested in “clean” simulations that ignore the fluid dynamics of smoke.

7. Conclusion: From Simulation to the Front Lines

The VULCAN framework demonstrates that by combining robust perception with semantic spatial abstraction, we can build agents capable of operating where human life cannot be risked. While our transition from Gazebo-based physics to real-world deployment is ongoing, the empirical results confirm that hazard-aware planning is the only viable path forward.

As we deploy these systems in increasingly volatile environments, we must ask: Can our agents remain “hazard-aware” as disasters evolve unpredictably? Our ability to answer this will determine if autonomous SAR becomes a reality or remains obscured by the smoke.

Read the full paper on arXiv · PDF