Daily Paper

Reinforcing 3D Understanding in Point-VLMs via Geometric Reward Credit Assignment

Point-VLMs suffer geometric hallucination where predicted 3D structures contradict observed 2D reality. Geometric Reward Credit Assignment disentangles holistic supervision into field-specific signals, boosting 3D keypoint accuracy from 0.64 to 0.93.

arXiv:2604.21160 Empirical Study
point-vlmsgeometric-hallucinationreinforcement-learningcredit-assignmentembodied-ai3d-understanding

Reinforcing 3D Understanding in Point-VLMs via Geometric Reward Credit Assignment

Point-Vision-Language Models promise embodied agents spatial reasoning they can act on — but they frequently suffer from geometric hallucination, where predicted 3D structures contradict the 2D reality the model is observing. This is not a minor perceptual glitch; it is a systematic failure mode that produces physically impossible spatial predictions, which translate directly into unsafe actions when deployed on real robots.

The Problem: Reward Dilution in Sequence-Level Supervision

The root cause is causal dilution: in standard RL training for point-VLMs, sparse geometric tokens (the ones encoding spatial information) are drowned out by noisy, broadcast sequence-level rewards. The reward signal evaluates the entire output sequence, and geometric tokens receive the same gradient signal as every other token. The model optimizes for sequence-level fluency at the expense of geometric precision.

This is structurally analogous to a known failure mode in the failure-first framework: constraint erosion under optimization pressure. The model receives supervision, but the supervision is structurally incapable of producing the alignment the evaluator expects.

Geometric Reward Credit Assignment

The proposed fix is architecturally clean: Geometric Reward Credit Assignment (GRCA) disentangles holistic supervision into field-specific signals and routes them exclusively to their responsible token spans. Instead of one broadcast reward, each geometric field (x, y, z coordinates, bounding box dimensions, rotation angles) receives its own targeted reward signal.

A Reprojection-Consistency term serves as a cross-modal verifier: it penalizes predicted 3D geometries that, when projected back to 2D, contradict the observed image. This is a physical constraint, not a learned one — predictions must be geometrically consistent with the evidence.

Results

On a calibrated benchmark derived from ShapeNetCore:

MetricBaselineGRCA (Ours)
3D KPA0.640.93
3D BBox IoU0.686
Reprojection Consistency0.852

The 0.64 → 0.93 KPA jump crosses the threshold from “plausible text” to “physically verifiable spatial predictions.”

Failure-First Implications

The geometric hallucination problem and its fix through targeted credit assignment reveal a general principle: safety-relevant tokens need safety-relevant supervision. Broadcasting a single reward across all output tokens dilutes the signal that matters most for embodied safety — the spatial tokens that determine what the robot actually does.

The reprojection consistency term is notable as a physics-based verification mechanism: before executing a spatial action, verify that the predicted outcome is consistent with current observation. This is a different verification strategy than text-only refusal detection, and it translates directly to physical systems.

Read the full paper on arXiv · PDF