Pause or Fabricate? Training Language Models for Grounded Reasoning
Large language models have achieved remarkable progress on complex reasoning tasks.
Pause or Fabricate? Training Language Models for Grounded Reasoning
Introduction: The High Cost of Confident Fabrication
In the current landscape of large language model (LLM) evaluation, standard benchmarks operate under a strong implicit assumption: every input is information-complete. When presented with a mathematical word problem, the model is expected to proceed directly to an answer. However, real-world deployment rarely mirrors these sanitized conditions. Users provide information incrementally, express requirements ambiguously, or omit critical premises entirely.
When faced with these gaps, most current LLMs exhibit a “covert” failure mode we term ungrounded reasoning. Rather than identifying missing information, models implicitly fabricate the necessary premises and proceed with elaborate reasoning chains built on invented foundations. These outputs often appear internally coherent and valid, yet they are fundamentally unmoored from reality. This failure stems not from a lack of raw reasoning capability, but from a lack of inferential boundary awareness—the meta-cognitive ability to recognize when the necessary premises for valid inference are absent.
Crucially, this is an architectural byproduct of standard training. Current RLHF (Reinforcement Learning from Human Feedback) and RLVR (Reinforcement Learning from Vocabulary Rewards) paradigms create massive optimization pressure to produce an answer at all costs. This “pleaser” dynamic forces models to override their own uncertainty signals, prioritizing response completion over grounded truth.
The Anatomy of a Failure: Early Uncertainty vs. Final Hallucination
Empirical analysis of model trajectories reveals a characteristic “early suspicion, late action” pattern. Often, a model will express initial uncertainty—using phrases like “we need to know”—only to override its own signal seconds later. Within the same response, it typically transitions to fabrication (“let us assume”) and generates long, baseless inference chains. This represents a fundamental failure of inferential control: the model identifies a gap but lacks the mechanism to halt.
To quantify this behavior, we employ two specific metrics:
- GapRatio: The proportion of tokens generated after the model first expresses uncertainty (). A higher ratio indicates more severe ungrounded inference.
- Premise Detection Rate (PD): The frequency with which a model correctly identifies an input as incomplete and requests clarification rather than attempting a solution.
Current models frequently exhibit GapRatios exceeding 47.7% while maintaining PD rates below 41%. This confirms that while models often possess the “suspicion” of a gap, they are structurally unable to act on it.
Introducing GRIL: Grounded Reasoning via Interactive Reinforcement Learning
To address this, the Grounded Reasoning via Interactive Reinforcement Learning (GRIL) framework rehashes reasoning as a sequential decision-making process. GRIL moves beyond single-turn response generation, training the model to interact with its environment to ensure grounding. The reasoning process is decomposed into two distinct stages:
- Stage 1: Clarify and Pause: The model assesses input sufficiency. If premises are missing, it must proactively halt and request the specific information needed.
- Stage 2: Grounded Reasoning: Once the “Interactive Environment” provides the missing premise, the model integrates this new data to solve the task.
The environment utilizes asymmetric feedback dynamics. If a model attempts to solve an incomplete problem, it receives negative feedback. Conversely, a correct clarify action triggers the provision of the missing premise, enabling the transition to Stage 2.
The Reward Mechanism: Incentivizing Silence and Accuracy
The core of GRIL is a stage-specific reward design that uses mathematical rigor to penalize hallucination trajectories and reward efficient detection.
- Detection Rewards (): To solve the “late action” issue, we implement a time-decay factor. The reward is formulated as: Where is the base reward, is the decay factor (e.g., 0.5), and is the number of tokens generated before clarification. This provides a strong incentive to stop as early as possible.
- Solving Rewards (): Success is only rewarded if the final answer is correct after information integration. This ensures clarification is a means to accuracy, not a “lazy” escape.
- Penalty for Over-Caution: To prevent “false alarms” on complete problems, a penalty coefficient () is applied for unnecessary clarifications, forcing the model to learn a precise decision boundary.
Empirical Benchmarks: Quantifying the Shift in Inferential Control
The impact of GRIL training is most evident on the GSM8K-Insufficient benchmark. Notably, GRIL significantly outperforms both zero-shot baselines and Supervised Fine-Tuning (SFT), proving that RL is superior to simple imitation for learning boundary awareness.
| Model & Method | Success Rate (SR) | Premise Detection (PD) | Response Length (Tokens) |
|---|---|---|---|
| Qwen2.5-1.5B (Base) | 1.8% | 4.6% | 810 |
| Qwen2.5-1.5B (SFT) | 31.6% | 64.6% | 750 |
| Qwen2.5-1.5B (GRIL) | 61.6% | 90.8% | 479 |
| Qwen2.5-3B (Base) | 20.6% | 28.0% | 887 |
| Qwen2.5-3B (SFT) | 62.7% | 90.3% | 575 |
| Qwen2.5-3B (GRIL) | 73.5% | 88.0% | 448 |
GRIL achieves a 34x improvement in Success Rate for the 1.5B scale and a 40-70% reduction in response length. By learning to stop ungrounded chains, the model becomes both more accurate and significantly more efficient.
Safety and Robustness: Beyond Mathematical Reasoning
GRIL’s benefits generalize to out-of-distribution (OOD) domains, suggesting the model learns a meta-cognitive boundary rather than a domain-specific heuristic:
- Multi-Domain Generalization: GRIL improved PD and SR on HotpotQA-Insufficient (multi-hop) and CommonsenseQA-Insufficient. The model learns when “knowing” is impossible across diverse logical structures.
- Resilience to Noisy Interactions: In real-world scenarios where user feedback is messy, GRIL-trained models show remarkable robustness. On Qwen2.5-1.5B, the Success Rate under Noisy Feedback (irrelevant conversation) jumped from 12.8% to 47.2%—a 3.6x improvement in the ability to filter distractions and extract relevant premises. It also handles Uninformative Responses (e.g., “I don’t know”) by terminating gracefully rather than hallucinating a path forward.
Conclusion: Strategic Takeaways for AI Practitioners
Inferential boundary awareness is a distinct, trainable skill. The “capability boost” observed—where training for detection improves performance on complete problems—suggests that learning when to pause forces the model into a more thorough structural analysis of every prompt.
Practitioner Takeaways:
- Architecture: Moving Beyond “Answering Machines”: Optimization pressure in standard RLHF forces models to provide answers to unsolvable queries. Deployment-ready models require explicit “Clarify and Pause” training.
- Implementation: Prioritize Time-Decay Incentives: To curb hallucination trajectories, use with exponential decay (). This effectively counters the model’s tendency to override its own internal uncertainty signals.
- Safety as a Capability Enhancer: Training for boundary awareness does not degrade performance; it enhances it. By forcing a structural assessment of problem sufficiency, models become more effective at grounded solving for all query types.
Limitations: While GRIL demonstrates high efficacy, it currently utilizes an “oracle-like” environment for premise provision. Future iterations must address real-scenario dialogues where user feedback may be partial, inconsistent, or driven by subjective preferences.
Read the full paper on arXiv · PDF