RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework
RAD-2 combines diffusion-based trajectory generation with RL-optimized discriminator reranking to improve closed-loop autonomous driving planning, validated through simulation and real-world...
RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework
1. Introduction: Beyond Imitation in Motion Planning
Current autonomous driving systems rely heavily on imitation learning (IL) and diffusion-based planners to model complex, multimodal trajectory distributions. However, these models face significant hurdles in safety-critical urban environments. Pure IL planners often suffer from stochastic instabilities, producing low-quality trajectories when the training data is noisy or unevenly distributed. Crucially, the traditional open-loop training paradigm of IL causes a structural mismatch with the closed-loop nature of real-world driving. This leads to causal confusion, where agents learn superficial “shortcut behaviors”—correlating states and actions without understanding the underlying causal factors of safety.
RAD-2 addresses these failures by bridging the gap between high-dimensional trajectory generation and low-dimensional reinforcement learning (RL) rewards. Framed as an inference-time scaling solution, RAD-2 raises the performance upper bound of autonomous agents without requiring additional expert supervision. By introducing a framework that evaluates long-term outcomes rather than just mimicking short-term expert samples, we provide a mechanism for corrective negative feedback that is missing in pure imitation frameworks.
2. The Architecture: A Decoupled Generator-Discriminator Synergy
RAD-2 utilizes a two-part design that separates the task of “imagining” trajectories from the task of “evaluating” them. This decoupling is essential for stabilizing RL optimization by preventing the application of sparse, low-dimensional rewards directly to high-dimensional, temporally structured trajectory spaces.
The Diffusion-Based Generator
The generator models a multimodal distribution over future trajectories . It first encodes the observation into a unified scene embedding by tokenizing static map elements (), dynamic agents (), and navigation waypoints (). These embeddings are fused via a cross-attention module : This embedding conditions a Diffusion Transformer (DiT), which iteratively denoises independent trajectory candidates. By using a diffusion process, the model maintains an expressive manifold of feasible future possibilities.
The RL-Optimized Discriminator
The discriminator functions as a learnable preference model over the generated trajectory manifold. It utilizes a Transformer-based architecture to process the candidates. Each trajectory is embedded, prepended with a learnable token, and passed through a Transformer encoder to produce a trajectory-level query . The discriminator performs multi-source cross-attention between and the scene context to produce a scalar “driving quality” score via a sigmoid activation .
Stability Through Decoupling:
Restricting RL to the discriminator allows the optimization signal to align naturally with low-dimensional scalar rewards, while the generator handles the high-dimensional spatial constraints required for physically feasible motion.
3. Cracking the Credit Assignment Problem: TC-GRPO
In continuous driving spaces, agents face a severe credit assignment problem due to weak instantaneous reward-action correlations. To solve this, RAD-2 introduces Temporally Consistent Group Relative Policy Optimization (TC-GRPO).
- Latched Execution Strategy (): To ensure behavioral coherence, a selected trajectory is reused over a fixed horizon . This prevents high-frequency mode-switching that would otherwise denoise advantage signals. Crucially, this strategy supports asynchronous termination if safety constraints are violated, maintaining reactive capability.
- Group Advantage Standardization: Advantages are computed and standardized relative to a group of rollouts generated from the same initial state: This ensures the reinforcement signal specifically reinforces coherent trajectory hypotheses that outperform the group mean.
The Multi-Objective Reward Function:
- Safety-Criticality (): A “bottleneck-style formulation” using the worst-case temporal margin. It calculates the Time-To-Collision () through counterfactual interpolation; any momentary safety violation within the sequence dominates the final reward.
- Navigational Efficiency (): Anchors the ego-vehicle’s progress within a target interval relative to expert demonstrations, penalizing both sluggishness and overly aggressive deviations.
4. Iterative Refinement: On-Policy Generator Optimization (OGO)
While the discriminator reranks candidates, the generator must be shifted toward high-reward manifolds. On-policy Generator Optimization (OGO) converts closed-loop feedback into structured longitudinal optimization signals. OGO preserves the spatial path of a trajectory while modifying only the temporal progression (velocity profile).
| Scenario | Action | Condition |
|---|---|---|
| Safety-driven | Deceleration via fixed ratio | threshold |
| Efficiency-driven | Acceleration via fixed ratio | Ego Progress lag + safety threshold |
These optimized segments are aggregated into an on-policy dataset to fine-tune the generator via a mean squared error loss, progressively shifting the distribution density toward safer behaviors.
5. Scaling Training with BEV-Warp Simulation
Large-scale RL requires high-throughput simulation. RAD-2 introduces BEV-Warp, which operates directly in the Bird’s-Eye View feature space, bypassing the computational overhead and “sim-to-real” gaps of image-level rendering.
- Spatial Equivariance: This is the core technical justification for BEV-Warp. Geometric transformations in the feature space (using a warp matrix derived from ego-pose deviation) correspond strictly to physical movements in the world.
- The Mechanism: Synthesized features are generated via bilinear interpolation: .
- Advantages: Unlike generative world models, which are susceptible to cumulative temporal drift, BEV-Warp maintains high feature-level fidelity and preserves complex semantic geometries like lane topologies.
6. The Proof: Benchmarks and Safety Improvements
RAD-2 was validated on the photorealistic Senna-2 benchmark and large-scale closed-loop tests. Quantitative results show that RAD-2 significantly raises the performance ceiling compared to strong diffusion baselines like ResAD.
- Collision Reduction: RAD-2 achieved a 56% reduction in collision rates () in safety-oriented scenarios.
- Performance Precision: In head-to-head testing against ResAD, the At-Fault Collision Rate () dropped from 0.264 to 0.092, while Safety@1s improved from 0.418 to 0.730.
Key Performance Metrics (Senna-2 & BEV-Warp):
- Collision Rate (CR): Overall frequency of safety incidents.
- At-Fault Collision Rate (AF-CR): Incidents specifically attributable to ego-vehicle decision errors.
- Safety@1s / Safety@2s: Proportion of clips where minimum TTC remained above the 1s or 2s safety buffer.
- Ego Progress (EP): Reliability of task completion relative to the reference route.
7. Conclusion: Takeaways for AI Safety Researchers
The RAD-2 framework demonstrates that safety in autonomous systems is not merely a product of data volume, but of architectural intent.
Three Critical Takeaways:
- Inference-time Scaling: By decoupling generation from ranking, we can increase the sample count at test-time to explore a denser action space, identifying safer solutions without retraining the core model.
- Temporal Consistency as a Physical Prior: Implementing “latched execution” and TC-GRPO is a prerequisite for solving the credit assignment problem in continuous robotics, ensuring rewards are tied to coherent intentions.
- The Efficiency of Feature-Level Simulation: Leveraging Spatial Equivariance through feature-warping provides a scalable path for closed-loop RL training, avoiding the fidelity loss and latency of generative video models.
Ultimately, RAD-2 provides a robust methodology for converting sparse, real-world environmental feedback into high-dimensional policy refinements, raising the performance upper bound for safe, human-aligned autonomous agents.
Read the full paper on arXiv · PDF