April 21, 2026 Daily Paper

RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

RAD-2 combines diffusion-based trajectory generation with RL-optimized discriminator reranking to improve closed-loop autonomous driving planning, validated through simulation and real-world...

arXiv:2604.15308 Empirical Study

Hao Gao, Shaoyu Chen, Yifan Zhu, Yuehao Song et al.

autonomous-driving-planningdiffusion-models-controlreinforcement-learning-trajectoryclosed-loop-feedbackmultimodal-uncertaintycredit-assignment-problem

RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

1. Introduction: Beyond Imitation in Motion Planning

Current autonomous driving systems rely heavily on imitation learning (IL) and diffusion-based planners to model complex, multimodal trajectory distributions. However, these models face significant hurdles in safety-critical urban environments. Pure IL planners often suffer from stochastic instabilities, producing low-quality trajectories when the training data is noisy or unevenly distributed. Crucially, the traditional open-loop training paradigm of IL causes a structural mismatch with the closed-loop nature of real-world driving. This leads to causal confusion, where agents learn superficial “shortcut behaviors”—correlating states and actions without understanding the underlying causal factors of safety.

RAD-2 addresses these failures by bridging the gap between high-dimensional trajectory generation and low-dimensional reinforcement learning (RL) rewards. Framed as an inference-time scaling solution, RAD-2 raises the performance upper bound of autonomous agents without requiring additional expert supervision. By introducing a framework that evaluates long-term outcomes rather than just mimicking short-term expert samples, we provide a mechanism for corrective negative feedback that is missing in pure imitation frameworks.

2. The Architecture: A Decoupled Generator-Discriminator Synergy

RAD-2 utilizes a two-part design that separates the task of “imagining” trajectories from the task of “evaluating” them. This decoupling is essential for stabilizing RL optimization by preventing the application of sparse, low-dimensional rewards directly to high-dimensional, temporally structured trajectory spaces.

The Diffusion-Based Generator

The generator models a multimodal distribution over future trajectories $G_{\theta}(\tau | o_t)$ . It first encodes the observation into a unified scene embedding $E_{scene}$ by tokenizing static map elements ( $X_{map}$ ), dynamic agents ( $X_{agent}$ ), and navigation waypoints ( $X_{nav}$ ). These embeddings are fused via a cross-attention module $F(\cdot)$ : $E_{scene} = F(T_b, T_m, T_a, T_n)$ This embedding conditions a Diffusion Transformer (DiT), which iteratively denoises $M$ independent trajectory candidates. By using a diffusion process, the model maintains an expressive manifold of feasible future possibilities.

The RL-Optimized Discriminator

The discriminator functions as a learnable preference model over the generated trajectory manifold. It utilizes a Transformer-based architecture to process the $M$ candidates. Each trajectory is embedded, prepended with a learnable $[CLS]$ token, and passed through a Transformer encoder to produce a trajectory-level query $Q_{\tau}$ . The discriminator performs multi-source cross-attention between $Q_{\tau}$ and the scene context to produce a scalar “driving quality” score $s(\tau) \in [0, 1]$ via a sigmoid activation $\sigma$ .

Stability Through Decoupling:

Restricting RL to the discriminator allows the optimization signal to align naturally with low-dimensional scalar rewards, while the generator handles the high-dimensional spatial constraints required for physically feasible motion.

3. Cracking the Credit Assignment Problem: TC-GRPO

In continuous driving spaces, agents face a severe credit assignment problem due to weak instantaneous reward-action correlations. To solve this, RAD-2 introduces Temporally Consistent Group Relative Policy Optimization (TC-GRPO).

Latched Execution Strategy ( $H_{reuse}$ ): To ensure behavioral coherence, a selected trajectory is reused over a fixed horizon $H_{reuse}$ . This prevents high-frequency mode-switching that would otherwise denoise advantage signals. Crucially, this strategy supports asynchronous termination if safety constraints are violated, maintaining reactive capability.
Group Advantage Standardization: Advantages $A_i$ are computed and standardized relative to a group of $G$ rollouts generated from the same initial state: $A_i = \frac{r_i - \text{mean}(\{r_1, \dots, r_G\})}{\text{std}(\{r_1, \dots, r_G\})}$ This ensures the reinforcement signal specifically reinforces coherent trajectory hypotheses that outperform the group mean.

The Multi-Objective Reward Function:

Safety-Criticality ( $r_{coll}$ ): A “bottleneck-style formulation” using the worst-case temporal margin. It calculates the Time-To-Collision ( $T_t$ ) through counterfactual interpolation; any momentary safety violation within the sequence dominates the final reward.
Navigational Efficiency ( $r_{eff}$ ): Anchors the ego-vehicle’s progress within a target interval relative to expert demonstrations, penalizing both sluggishness and overly aggressive deviations.

While the discriminator reranks candidates, the generator must be shifted toward high-reward manifolds. On-policy Generator Optimization (OGO) converts closed-loop feedback into structured longitudinal optimization signals. OGO preserves the spatial path of a trajectory while modifying only the temporal progression (velocity profile).

Scenario	Action	Condition
Safety-driven	Deceleration via fixed ratio $\rho \in (0,1)$	$T_t < \gamma_{safe}$ threshold
Efficiency-driven	Acceleration via fixed ratio $\rho' > 1$	Ego Progress lag + $T_t >$ safety threshold

These optimized segments $\tau^{opt}$ are aggregated into an on-policy dataset to fine-tune the generator $G_{\theta}$ via a mean squared error loss, progressively shifting the distribution density toward safer behaviors.

5. Scaling Training with BEV-Warp Simulation

Large-scale RL requires high-throughput simulation. RAD-2 introduces BEV-Warp, which operates directly in the Bird’s-Eye View feature space, bypassing the computational overhead and “sim-to-real” gaps of image-level rendering.

Spatial Equivariance: This is the core technical justification for BEV-Warp. Geometric transformations in the feature space (using a warp matrix $M_{t+1}$ derived from ego-pose deviation) correspond strictly to physical movements in the world.
The Mechanism: Synthesized features $B_{t+1}$ are generated via bilinear interpolation: $B_{t+1} = W(B^{ref}_{t+1}, M_{t+1})$ .
Advantages: Unlike generative world models, which are susceptible to cumulative temporal drift, BEV-Warp maintains high feature-level fidelity and preserves complex semantic geometries like lane topologies.

6. The Proof: Benchmarks and Safety Improvements

RAD-2 was validated on the photorealistic Senna-2 benchmark and large-scale closed-loop tests. Quantitative results show that RAD-2 significantly raises the performance ceiling compared to strong diffusion baselines like ResAD.

Collision Reduction: RAD-2 achieved a 56% reduction in collision rates ( $CR$ ) in safety-oriented scenarios.
Performance Precision: In head-to-head testing against ResAD, the At-Fault Collision Rate ( $AF-CR$ ) dropped from 0.264 to 0.092, while Safety@1s improved from 0.418 to 0.730.

Key Performance Metrics (Senna-2 & BEV-Warp):

Collision Rate (CR): Overall frequency of safety incidents.
At-Fault Collision Rate (AF-CR): Incidents specifically attributable to ego-vehicle decision errors.
Safety@1s / Safety@2s: Proportion of clips where minimum TTC remained above the 1s or 2s safety buffer.
Ego Progress (EP): Reliability of task completion relative to the reference route.

7. Conclusion: Takeaways for AI Safety Researchers

The RAD-2 framework demonstrates that safety in autonomous systems is not merely a product of data volume, but of architectural intent.

Three Critical Takeaways:

Inference-time Scaling: By decoupling generation from ranking, we can increase the sample count $M$ at test-time to explore a denser action space, identifying safer solutions without retraining the core model.
Temporal Consistency as a Physical Prior: Implementing “latched execution” and TC-GRPO is a prerequisite for solving the credit assignment problem in continuous robotics, ensuring rewards are tied to coherent intentions.
The Efficiency of Feature-Level Simulation: Leveraging Spatial Equivariance through feature-warping provides a scalable path for closed-loop RL training, avoiding the fidelity loss and latency of generative video models.

Ultimately, RAD-2 provides a robust methodology for converting sparse, real-world environmental feedback into high-dimensional policy refinements, raising the performance upper bound for safe, human-aligned autonomous agents.

Read the full paper on arXiv · PDF

RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

1. Introduction: Beyond Imitation in Motion Planning

2. The Architecture: A Decoupled Generator-Discriminator Synergy

The Diffusion-Based Generator

The RL-Optimized Discriminator

3. Cracking the Credit Assignment Problem: TC-GRPO

4. Iterative Refinement: On-Policy Generator Optimization (OGO)

5. Scaling Training with BEV-Warp Simulation

6. The Proof: Benchmarks and Safety Improvements

Key Performance Metrics (Senna-2 & BEV-Warp):

7. Conclusion: Takeaways for AI Safety Researchers