A Physical Agentic Loop for Language-Guided Grasping with Execution-State Monitoring
Introduces a physical agentic loop that wraps learned grasp primitives with execution monitoring and bounded recovery policies to handle failures in language-guided robotic manipulation.
A Physical Agentic Loop for Language-Guided Grasping with Execution-State Monitoring
The Hook: Why Robots “Fail Silently”
Current language-guided robotic systems, including state-of-the-art Vision-Language-Action (VLA) models, typically operate in a “plan once, execute once” (open-loop) manner. This approach relies on an implicit assumption that the physical world will perfectly mirror the model’s intent. In practice, however, robots frequently encounter silent execution failures: the gripper may stall, an object might slip during the lift, or a visual distractor may lead the robot to pick the wrong item entirely.
For AI safety and red-teaming practitioners, these failures are catastrophic precisely because they are not surfaced to the decision layer in a structured way. When a robot “hallucinates” task completion while its gripper is empty, the failure is “covert”—evading standard evaluations that only track coordinate reaching. To bridge this gap, physical actions must be treated like digital “tool calls” in software agents, characterized by a standardized lifecycle (start, progress, success, or failure) that makes physical outcomes legible to the agent’s high-level reasoning.
The “Physical Agentic Loop” Framework
Recent research introduces a Physical Agentic Loop that reformulates robotic manipulation as a bounded autonomous process operating over grounded execution states. A significant value proposition of this framework is its “wrapper” architecture: it improves the reliability of existing, unmodified manipulation primitives (such as ForceSight) without requiring expensive retraining or risking the “catastrophic forgetting” associated with fine-tuning foundation models.
The loop consists of four distinct stages:
- Observe: The agent receives a structured goal (e.g., “I want the toy”) and processes current RGB-D observations.
- Act: The robot invokes a manipulation primitive to perform the grasp-and-lift sequence.
- Evaluate: This is the critical “missing link.” The system infers a discrete execution outcome from sensory feedback, converting continuous telemetry into a physical tool-state.
- Decide: A deterministic policy maps the evaluated outcome to a high-level decision , such as finalizing the task or triggering a recovery.
The Watchdog Layer: Converting Noise into Data
The “Evaluate” stage is powered by Watchdog, a modular monitoring layer designed to produce a physical state abstraction from noisy sensor streams. Unlike simple binary detectors, Watchdog employs contact-aware fusion and temporal stabilization to interpret gripper telemetry (effort, current dynamics, and closure) alongside post-action visual cues.
To ensure high precision, Watchdog requires a “short settle window” where cues remain consistent and utilizes micro-lift evidence to reject contact-only false positives—instances where the gripper touches an object but fails to secure it. This process converts continuous signals into the discrete “tool-states” listed below.
Watchdog Outcome Labels: The New Physical Tool-State
| Label | Definition (high-level) |
|---|---|
| SUCCESS | Execution evidence consistent with stable object acquisition (physical completion). |
| EMPTY | Execution evidence consistent with no object acquired (recoverable failure candidate). |
| SLIP | Transient acquisition followed by loss consistent with slipping. |
| WEAK | Marginal or unstable acquisition consistent with a fragile grasp. |
| STALL | Execution fails to progress to a stable terminal signature (stalled motion/closure). |
| TIMEOUT | Execution exceeds a fixed time budget without reaching a stable terminal signature. |
Bounded Recovery: The Policy for Safe Retries
In embodied AI, “boundedness” is a core safety requirement to prevent infinite loops or unsafe oscillations. The agentic loop employs a deterministic policy that maps Watchdog outcomes to a decision space .
The logic is strictly governed by a fixed retry budget to guarantee finite termination:
- If EMPTY: The agent triggers a RETRY/RESELECT action. This may involve a bounded re-selection step—re-evaluating candidate object instances and repositioning before the next attempt.
- If Ambiguity Persists: If the system cannot confirm a successful grasp after the retry budget is exhausted, it escalates to WAIT CLARIFY, bringing a human into the loop to resolve semantic or physical uncertainty.
- If Physical Failure (SLIP/WEAK/STALL/TIMEOUT): The system defaults to FINALIZE (safe termination). This conservative approach prevents the robot from performing potentially damaging maneuvers after mechanical instability is detected.
Empirical Proof: How the Agentic Loop Outperforms Baselines
Validation was performed on the Hello Robot Stretch 3 platform using an eye-in-hand D405 camera. The experiments compared an open-loop baseline (ForceSight) against the Agentic Loop wrapper across scenarios specifically designed to induce failures.
Quantitative Results (Success Rates)
| Scenario | Baseline Success Rate | Agentic Loop Success Rate |
|---|---|---|
| Single Target | 80% | 100% |
| Ambiguity (Color/Spatial) | 40% | 80% |
| Distractor Robustness | 0% | 100% |
| Multiple Identical Targets | 10% | 100% |
A qualitative behavioral trace of an “Induced Empty Grasp” scenario demonstrates the loop’s efficacy: when a target is moved during execution, Watchdog identifies the resulting EMPTY state via gripper closure dynamics, triggers a bounded retry, and successfully secures the object on the second attempt. This robustness comes with marginal cost; the mean runtime for the agentic loop is 15.94s, representing only a 7.8% increase in latency over the 14.78s baseline.
Why This Matters for AI Safety Researchers
This research provides a concrete model for building failure-aware embodied systems. By forcing a robot to verify its own execution outcomes, we eliminate the “covert failures” that typically evade safety benchmarks. The transition from continuous sensor noise to discrete, “legible” events allows high-level planners to operate with a grounded understanding of the physical world, rather than hallucinating task progress.
For safety researchers, the modularity of this approach is key: execution monitoring serves as a plug-and-play safety layer that can be wrapped around any VLA or black-box controller to enforce deterministic, bounded, and interpretable behavior.
Conclusion & Key Takeaways
True robustness in robotics is not simply a product of better perception models, but of a more reliable, closed-loop architecture. By treating physical interactions as verifiable tool calls, we can transform fragile, single-shot execution into a resilient agentic process.
Final Takeaways:
- Visibility: Physical outcomes become legible and actionable when execution states (like EMPTY or SLIP) are discretized through contact-aware fusion.
- Safety: Bounded recovery policies with fixed retry budgets ensure finite termination and prevent unsafe robot oscillations.
- Modularity: This framework acts as a modular safety wrapper that can be applied to unmodified, existing manipulation primitives.
Resources For source code, demonstrations, and further technical details, visit the project page: https://wenzewwz123.github.io/Agentic-Loop/
Read the full paper on arXiv · PDF