Daily Paper

RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

RAD-2 combines diffusion-based trajectory generation with RL-optimized discriminator reranking to improve closed-loop autonomous driving planning, validated through simulation and real-world...

arXiv:2604.15308 Empirical Study

Hao Gao, Shaoyu Chen, Yifan Zhu, Yuehao Song et al.

autonomous-driving-planningdiffusion-models-controlreinforcement-learning-trajectoryclosed-loop-feedbackmultimodal-uncertaintycredit-assignment-problem

RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

1. Introduction: Beyond Imitation in Motion Planning

Current autonomous driving systems rely heavily on imitation learning (IL) and diffusion-based planners to model complex, multimodal trajectory distributions. However, these models face significant hurdles in safety-critical urban environments. Pure IL planners often suffer from stochastic instabilities, producing low-quality trajectories when the training data is noisy or unevenly distributed. Crucially, the traditional open-loop training paradigm of IL causes a structural mismatch with the closed-loop nature of real-world driving. This leads to causal confusion, where agents learn superficial “shortcut behaviors”—correlating states and actions without understanding the underlying causal factors of safety.

RAD-2 addresses these failures by bridging the gap between high-dimensional trajectory generation and low-dimensional reinforcement learning (RL) rewards. Framed as an inference-time scaling solution, RAD-2 raises the performance upper bound of autonomous agents without requiring additional expert supervision. By introducing a framework that evaluates long-term outcomes rather than just mimicking short-term expert samples, we provide a mechanism for corrective negative feedback that is missing in pure imitation frameworks.


2. The Architecture: A Decoupled Generator-Discriminator Synergy

RAD-2 utilizes a two-part design that separates the task of “imagining” trajectories from the task of “evaluating” them. This decoupling is essential for stabilizing RL optimization by preventing the application of sparse, low-dimensional rewards directly to high-dimensional, temporally structured trajectory spaces.

The Diffusion-Based Generator

The generator models a multimodal distribution over future trajectories Gθ(τot)G_{\theta}(\tau | o_t). It first encodes the observation into a unified scene embedding EsceneE_{scene} by tokenizing static map elements (XmapX_{map}), dynamic agents (XagentX_{agent}), and navigation waypoints (XnavX_{nav}). These embeddings are fused via a cross-attention module F()F(\cdot): Escene=F(Tb,Tm,Ta,Tn)E_{scene} = F(T_b, T_m, T_a, T_n) This embedding conditions a Diffusion Transformer (DiT), which iteratively denoises MM independent trajectory candidates. By using a diffusion process, the model maintains an expressive manifold of feasible future possibilities.

The RL-Optimized Discriminator

The discriminator functions as a learnable preference model over the generated trajectory manifold. It utilizes a Transformer-based architecture to process the MM candidates. Each trajectory is embedded, prepended with a learnable [CLS][CLS] token, and passed through a Transformer encoder to produce a trajectory-level query QτQ_{\tau}. The discriminator performs multi-source cross-attention between QτQ_{\tau} and the scene context to produce a scalar “driving quality” score s(τ)[0,1]s(\tau) \in [0, 1] via a sigmoid activation σ\sigma.

Stability Through Decoupling:

Restricting RL to the discriminator allows the optimization signal to align naturally with low-dimensional scalar rewards, while the generator handles the high-dimensional spatial constraints required for physically feasible motion.


3. Cracking the Credit Assignment Problem: TC-GRPO

In continuous driving spaces, agents face a severe credit assignment problem due to weak instantaneous reward-action correlations. To solve this, RAD-2 introduces Temporally Consistent Group Relative Policy Optimization (TC-GRPO).

  • Latched Execution Strategy (HreuseH_{reuse}): To ensure behavioral coherence, a selected trajectory is reused over a fixed horizon HreuseH_{reuse}. This prevents high-frequency mode-switching that would otherwise denoise advantage signals. Crucially, this strategy supports asynchronous termination if safety constraints are violated, maintaining reactive capability.
  • Group Advantage Standardization: Advantages AiA_i are computed and standardized relative to a group of GG rollouts generated from the same initial state: Ai=rimean({r1,,rG})std({r1,,rG})A_i = \frac{r_i - \text{mean}(\{r_1, \dots, r_G\})}{\text{std}(\{r_1, \dots, r_G\})} This ensures the reinforcement signal specifically reinforces coherent trajectory hypotheses that outperform the group mean.

The Multi-Objective Reward Function:

  • Safety-Criticality (rcollr_{coll}): A “bottleneck-style formulation” using the worst-case temporal margin. It calculates the Time-To-Collision (TtT_t) through counterfactual interpolation; any momentary safety violation within the sequence dominates the final reward.
  • Navigational Efficiency (reffr_{eff}): Anchors the ego-vehicle’s progress within a target interval relative to expert demonstrations, penalizing both sluggishness and overly aggressive deviations.

4. Iterative Refinement: On-Policy Generator Optimization (OGO)

While the discriminator reranks candidates, the generator must be shifted toward high-reward manifolds. On-policy Generator Optimization (OGO) converts closed-loop feedback into structured longitudinal optimization signals. OGO preserves the spatial path of a trajectory while modifying only the temporal progression (velocity profile).

ScenarioActionCondition
Safety-drivenDeceleration via fixed ratio ρ(0,1)\rho \in (0,1)Tt<γsafeT_t < \gamma_{safe} threshold
Efficiency-drivenAcceleration via fixed ratio ρ>1\rho' > 1Ego Progress lag + Tt>T_t > safety threshold

These optimized segments τopt\tau^{opt} are aggregated into an on-policy dataset to fine-tune the generator GθG_{\theta} via a mean squared error loss, progressively shifting the distribution density toward safer behaviors.


5. Scaling Training with BEV-Warp Simulation

Large-scale RL requires high-throughput simulation. RAD-2 introduces BEV-Warp, which operates directly in the Bird’s-Eye View feature space, bypassing the computational overhead and “sim-to-real” gaps of image-level rendering.

  • Spatial Equivariance: This is the core technical justification for BEV-Warp. Geometric transformations in the feature space (using a warp matrix Mt+1M_{t+1} derived from ego-pose deviation) correspond strictly to physical movements in the world.
  • The Mechanism: Synthesized features Bt+1B_{t+1} are generated via bilinear interpolation: Bt+1=W(Bt+1ref,Mt+1)B_{t+1} = W(B^{ref}_{t+1}, M_{t+1}).
  • Advantages: Unlike generative world models, which are susceptible to cumulative temporal drift, BEV-Warp maintains high feature-level fidelity and preserves complex semantic geometries like lane topologies.

6. The Proof: Benchmarks and Safety Improvements

RAD-2 was validated on the photorealistic Senna-2 benchmark and large-scale closed-loop tests. Quantitative results show that RAD-2 significantly raises the performance ceiling compared to strong diffusion baselines like ResAD.

  • Collision Reduction: RAD-2 achieved a 56% reduction in collision rates (CRCR) in safety-oriented scenarios.
  • Performance Precision: In head-to-head testing against ResAD, the At-Fault Collision Rate (AFCRAF-CR) dropped from 0.264 to 0.092, while Safety@1s improved from 0.418 to 0.730.

Key Performance Metrics (Senna-2 & BEV-Warp):

  1. Collision Rate (CR): Overall frequency of safety incidents.
  2. At-Fault Collision Rate (AF-CR): Incidents specifically attributable to ego-vehicle decision errors.
  3. Safety@1s / Safety@2s: Proportion of clips where minimum TTC remained above the 1s or 2s safety buffer.
  4. Ego Progress (EP): Reliability of task completion relative to the reference route.

7. Conclusion: Takeaways for AI Safety Researchers

The RAD-2 framework demonstrates that safety in autonomous systems is not merely a product of data volume, but of architectural intent.

Three Critical Takeaways:

  1. Inference-time Scaling: By decoupling generation from ranking, we can increase the sample count MM at test-time to explore a denser action space, identifying safer solutions without retraining the core model.
  2. Temporal Consistency as a Physical Prior: Implementing “latched execution” and TC-GRPO is a prerequisite for solving the credit assignment problem in continuous robotics, ensuring rewards are tied to coherent intentions.
  3. The Efficiency of Feature-Level Simulation: Leveraging Spatial Equivariance through feature-warping provides a scalable path for closed-loop RL training, avoiding the fidelity loss and latency of generative video models.

Ultimately, RAD-2 provides a robust methodology for converting sparse, real-world environmental feedback into high-dimensional policy refinements, raising the performance upper bound for safe, human-aligned autonomous agents.

Read the full paper on arXiv · PDF