Daily Paper

Robust Multispectral Semantic Segmentation under Missing or Full Modalities via Structured Latent Projection

Introduces CBC-SLP, a structured latent projection that keeps multispectral remote-sensing segmentation accurate when input modalities drop out from sensor failure — without trading away performance when all modalities are present.

arXiv:2604.15856 Empirical Study

Irem Ulku, Erdem Akagündüz, Ömer Özgür Tanrıöver

multispectral-segmentationmissing-modalityremote-sensingsensor-failurecomputer-vision

Robust Multispectral Semantic Segmentation under Missing or Full Modalities via Structured Latent Projection

1. Introduction: The Fragility of Multimodal Vision

In the operational theater of Earth observation and autonomous sensing, the assumption of “perfect data” is a dangerous fallacy. Real-world deployments are perpetually compromised by hardware degradation and environmental volatility. Historically, systems like the Landsat-7 satellite suffered from Scan Line Corrector (SLC) failure, leading to systematic data gaps, while the MODIS sensor frequently encounters total information loss in specific spectral bands due to dense cloud cover.

Traditional multimodal AI models, which fuse heterogeneous sources such as RGB, Synthetic Aperture Radar (SAR), and Digital Elevation Models (DEM), typically struggle with a critical trade-off. By relying on “shared representation” learning—where multiple modalities are forced into a singular latent space—these models often achieve robustness at the cost of peak performance. Specifically, the pressure to align features forces highly informative sensors to down-sample their richness to match the lowest common denominator. This paper introduces CBC-SLP (Cross-Band Correlation with Structured Latent Projection), an architecture designed to break this bottleneck and achieve “failure-resilient” embodied AI.

2. The Modality Alignment Trap

A foundational objective in multimodal research has been the pursuit of “perfect alignment” between sensors in latent space. However, as demonstrated by Theorem 1, this pursuit is fundamentally flawed for downstream prediction tasks.

Enforcing alignment—where encoded features Z1=Z2==ZMZ_1 = Z_2 = \dots = Z_M—acts as an information-theoretic bottleneck. The cost of this alignment is quantified by the information gap Δp\Delta_p, defined as the difference between the most and least informative sensors regarding the target YY: Δp=maxi{1,,M}I(Xi;Y)mini{1,,M}I(Xi;Y)\Delta_p = \max_{i \in \{1,\dots,M\}} I(X_i;Y) - \min_{i \in \{1,\dots,M\}} I(X_i;Y) Where I(X;Y)I(X;Y) represents mutual information. By forcing identical representations, the model effectively discards modality-specific nuances to reach the “infimum” (the lower bound of error).

  • Modality-Invariant Information: Shared features (e.g., the structural outline of a canopy visible in both SAR and Optical data). While useful for consistency, these features lack the precision of raw sensor data.
  • Modality-Specific Information: Unique cues (e.g., SWIR’s sensitivity to moisture content or a DSM’s geometric context).

CBC-SLP avoids this trap by ensuring that H(YZ1,,ZM)H(Y | Z_1, \dots, Z_M) stays close to H(YX1,,XM)H(Y | X_1, \dots, X_M), preserving the semantic richness that alignment-based models typically filter out as noise.

3. Architecture of Resilience: The CBC-SLP Approach

The CBC-SLP architecture implements Structured Latent Projection as an architectural inductive bias rather than a regularization term. The model manages shared and private information through a specific structural flow:

  1. Inflation-Based Extraction: Heterogeneous modalities are processed by dedicated encoders.
  2. Structured Decomposition: The latent multimodal representation is projected into a modality-invariant Shared Latent Space (zshz_{sh}) and multiple Private Latent Spaces (zpr,mz_{pr,m}).
  3. Multiplicative Gating: A Binary Availability Mask (s{0,1}B×Ms \in \{0, 1\}^{B \times M}) acts as the routing mechanism. If a sensor fails (sm=0s_m=0), its private latent contribution is nullified.
  4. Decoder Composition: The decoder receives a concatenated representation z6RB×((M+1)Cz)×Dt×Ht×Wtz_6 \in \mathbb{R}^{B \times ((M+1)C_z) \times D_t \times H_t \times W_t}, ensuring a consistent input structure regardless of sensor status.

Architectural Inductive Bias vs. Loss-Based Design

FeatureTraditional Loss-Based ModelsCBC-SLP (Architectural Bias)
Alignment MethodAdditional loss terms (BCE/Regularization)Direct architectural routing/splitting
Information RetentionCompresses features into a shared spaceIsolates shared and private components
Gradient InterferenceHigh interference between availability statesDecoupled gradients via projection
Availability GatingImplicit (often fails during inference)Explicit multiplicative gating (ss)

By utilizing the Binary Availability Mask (ss), the architecture prevents “gradient pollution” during training. In a unified model trained with random modality dropout, the weights associated with private spaces are only updated when the corresponding sensor is active, protecting the shared latent space from noise injected by missing inputs.

4. Decoding the Method: From Encoders to Latent Routing

The functional pipeline of CBC-SLP ensures that every available photon or radar return is maximized:

  • Intra-modal Encoders: The system utilizes ResNet-50 backbones inflated into 3D convolutions. This allows the model to capture volumetric features across homogeneous (spectral) and heterogeneous (structural/topographic) data.
  • Inter-modal Correlation: Pixel-wise attention weights (α\alpha) facilitate cross-modal communication. One sensor guides another—for instance, using SAR structural cues to interpret land-cover in cloud-obscured Optical regions.
  • Latent Space Routing: The final representation z6z_6 concatenates the shared latent zshz_{sh} (derived from the inter-modal fused latent Xinter6X_{inter}^6) with all routed private components zpr,mz_{pr,m}. This ensures the decoder always has access to the most semantically dense representation available.

5. Empirical Evidence: Robustness Across Three Benchmarks

The model was rigorously evaluated across three datasets representing distinct sensing challenges: DSTL (Homogeneous: RGB/NIR/SWIR), Potsdam (Heterogeneous: RGB/IRRG/DSM), and Hunan (Heterogeneous: MSI/SAR/DEM).

Comparison of IoU Scores (Target Land-Cover)

DatasetScenarioCMXMMANetCBC-SLPGain (Ours)
DSTL (Crop)Full Modality0.8600.8760.903+0.027
Only SWIR0.8490.8510.865+0.014
Potsdam (Tree)Full Modality0.5410.6990.766+0.067
Only DSM0.4440.5610.576+0.015
Hunan (Forest)Full Modality0.6510.6500.659+0.008
Only MSI0.5990.6030.616+0.013

Scientific honesty is required regarding the Hunan-DEM scenario: CBC-SLP performed slightly lower when only the DEM was available (0.510 IoU). This is attributed to DEM features being significantly less discriminative for forest segmentation than the shared features learned from MSI/SAR. In such cases, the model correctly prioritizes the shared latent space over less informative private cues.

Information Gap Analysis (Δp(m)\Delta_p(m))

To validate the model’s internal richness, an Information Gap Analysis was conducted using an auxiliary 1x1 convolution classifier. This classifier was trained to measure conditional uncertainty (H(YUm)H(Y|U_m)) at the decoder input. The positive Δp(m)\Delta_p(m) values across nearly all scenarios prove that the CBC-SLP decoder remains semantically more informative than baseline models, successfully recovering complementary information that alignment-based methods discard.

6. Implications for AI Safety and Red-Teaming

For AI safety researchers and practitioners, CBC-SLP provides a blueprint for “Failure-First” engineering:

  1. Mitigating Covert Degradation: Traditional fusion models may fail silently when a sensor’s quality drops. The structured routing in CBC-SLP makes the impact of a missing sensor predictable, allowing system health monitors to correlate sensor outages with specific performance bounds.
  2. Gradient Isolation: By using the multiplicative gating mechanism, the architecture prevents “gradient interference.” The model does not learn to compensate for missing data by distorting its understanding of available data.
  3. The Unified Model Advantage: Training a single “Unified Model” with random modality dropout is computationally superior to training 2M12^M - 1 separate models for every possible sensor combination. CBC-SLP approaches the theoretical upper bound of specialized models without the combinatorial training overhead.

7. Conclusion: Takeaways for the Future of Robust AI

The transition from loss-based alignment to architectural structured latent projection is essential for the next generation of robust AI.

Critical Takeaways:

  1. Inductive Bias > Loss terms: Architectural constraints are more reliable than soft regularization for preserving modality-specific cues.
  2. Preservation of Discriminative Cues: Maintaining an information gap (Δp\Delta_p) is necessary for peak accuracy; “perfect alignment” is an information-theoretic sacrifice.
  3. Unified Resilience: A single model can achieve state-of-the-art performance across all sensor failure states when structured correctly.

Future research will extend these structured projections to include non-visual data, such as text-based metadata, to further insulate autonomous systems against unpredictable environmental variables.


Resources

  • Code: GitHub Repository
  • Acknowledgement: This research was supported by the Scientific and Technological Research Council of Turkey (TUBITAK) under Grant Number 124E725.

Read the full paper on arXiv · PDF