Daily Paper

Aligning Agents via Planning: A Benchmark for Trajectory-Level Reward Modeling

Introduces Plan-RewardBench, a trajectory-level preference benchmark for evaluating reward models in tool-using agent scenarios, and benchmarks three RM families (generative, discriminative,...

arXiv:2604.08178 Empirical Study

Jiaxuan Wang, Yulan Hu, Wenjin Yang, Zheng Pan et al.

reward-modelingtrajectory-level-preferencestool-use-agentsrlhf-benchmarkingagentic-alignmentlong-horizon-planning

Aligning Agents via Planning: A Benchmark for Trajectory-Level Reward Modeling

1. Introduction: The Evolution from Chatbots to Proactive Agents

We are witnessing the death of the passive chatbot. The era of Large Language Models (LLMs) that merely react to prompts with static text is being superseded by “proactive agents”—systems capable of autonomous tool invocation, environment interaction, and multi-step reasoning. In this new agentic frontier, model behavior is no longer defined by a single response; it is defined by a trajectory.

A trajectory is a complex sequence: user intent, internal reasoning, tool calls, environment feedback, and strategy shifts. This shift imposes a massive new requirement on Reinforcement Learning from Human Feedback (RLHF). Our Reward Models (RMs) must stop judging the destination and start judging the journey. If we only evaluate the final word, we miss the logic-gaps, safety-violations, and inefficiencies lurking in the intermediate steps. To solve this, we are introducing Plan-RewardBench, a trajectory-level preference benchmark designed to measure how well reward models handle the messy, long-horizon reality of agentic planning.

2. The Evaluation Gap: Why Standard Benchmarks Fall Short

The current evaluation landscape is built for a different age. Existing Reward Model benchmarks primarily focus on short-context, response-level preferences. Even specialized “tool-use” benchmarks often only check if a single API call is formatted correctly, completely ignoring the coherence of a long-horizon plan.

This creates a “critical void” in alignment. Standard evaluators often fall victim to fluent but false reasoning—what we call “tool-grounded contradictions”—where an agent provides a polished answer that actually contradicts the raw data returned by a tool. Plan-RewardBench closes this gap by shifting the unit of evaluation from the response to the full trajectory.

Unit of EvaluationMulti-turn ContextTool Execution LogsError Recovery
RewardBench / FC-RewardBenchSingle Response / Tool CallNoNo
Plan-RewardBench (Ours)Full TrajectoryYesIncluded (Env Feedback)

3. Inside Plan-RewardBench: The Four Pillars of Agentic Evaluation

We designed Plan-RewardBench around four representative “task families” that serve as the stress-test for modern agents:

  • Safety Refusal: Evaluating trajectory-level safety. We distinguish between robust policy-based refusals and “late refusal” (where an agent starts a harmful task before stopping) or unsafe compliance.
  • Tool-Irrelevance / Unavailability: Assessing if an agent stays honest about its limits. We treat Tool-Grounded Fabrication—claiming tool use without an actual call—as a critical failure (score 1 in our rubric).
  • Complex Planning: Testing tool-grounded logic. This pillar penalizes agents that fabricate facts contradicting tool responses or generate redundant, inefficient plans.
  • Robust Error Recovery: Measuring how agents handle setbacks. We reward a “strategy shift” (diagnosing and fixing an error) and penalize a “blind retry” (repeating a failed command without change).

4. The “Secret Sauce”: Constructing High-Resolution Hard Negatives

To ensure models are judged on logic rather than superficial cues like verbosity, we developed a “reusable blueprint” for creating preference data. Our dataset is composed of 70% natural rollouts from agents like Qwen and OpenAI, 8% rule-based injections, and 22% perturbations.

The true “logic-trap” for evaluators lies in our Minimal-edit Perturbations. We take a successful, high-scoring trajectory and make surgical edits to the assistant’s reasoning text while leaving the tool logs identical. This forces the Reward Model to identify failures in planning logic rather than just spotting an execution error. This rigorous approach was validated by a multi-LLM judge panel and a human audit, achieving a substantial Cohen’s κ\kappa agreement of 0.71 to 0.86, proving our labels are high-fidelity signals for alignment.

5. The Reality Check: How Modern Evaluators Stack Up

We tested three evaluator families: Discriminative RMs (Pointwise), Generative RMs (Pairwise), and General LLM Judges. The results are a wake-up call for the industry:

  • Top Performers: Qwen-Plus (69.96%) and the scalar Inf-ORM-70B (69.21%) lead the pack, but no model dominates across the board.
  • Safety-First, Logic-Second: Safety Refusal accuracy is highly polarized. For example, GPT-5 achieved the highest Safety Refusal score (84.80%) but a much lower overall average (68.54%) due to poor performance on Multi-turn Planning. SOTA models are often over-aligned for safety at the expense of agentic logic.
  • Context Collapse: We observed a sharp performance drop-off once trajectories exceed the 32k token threshold. Pairwise judges suffer significantly more because they must process two full trajectories simultaneously, doubling the context load and leading to a collapse in judgment quality.

6. The Diagnosis: Why Judges Fail

Our diagnostic analysis revealed four qualitative failure modes that every developer should have on their radar:

  • Effort Bias: Evaluators often reward “performative” behavior. They praise agents for using tools even when a direct, efficient answer was possible, essentially training agents to be verbose rather than effective.
  • Compliance Inertia (The Halo Effect): Judges often overlook a safety violation at the very end of a trajectory if the agent was helpful and compliant in all previous turns.
  • Stale Constraints: Evaluators struggle to track mid-trajectory updates. If a user changes their mind halfway through, weak judges still reward the agent for completing its original, now-irrelevant plan.
  • Superficial Recovery: Many reward models praise a “blind retry” simply because the agent attempted a fix, failing to notice that the retry had no logical chance of succeeding.

7. Conclusion: The Future of Agentic Alignment

Plan-RewardBench is more than just a leaderboard; it is a recipe for building better agents. We have shown that alignment is shifting from “outcome-based” to “process-based.” If we want reliable agents, we must train them with reward signals that understand the nuances of planning and tool grounding.

Key Takeaways for Developers:

  1. Trajectory is King: Move beyond final-word evaluation. Your reward signals must encompass thoughts, tool calls, and recoveries.
  2. Beware the Long-Horizon: Performance collapses after 32k tokens. Specialized training is required to help models maintain logic in deep trajectories.
  3. Logic over Verbosity: Fight “Effort Bias.” Ensure your RMs penalize unnecessary tool use and prioritize efficient, grounded reasoning.

The code and benchmark data for Plan-RewardBench are being released on HuggingFace and GitHub. It’s time to move beyond the response and start aligning the plan.

Read the full paper on arXiv · PDF