Daily Paper

PokeVLA: Empowering Pocket-Sized Vision-Language-Action Model with Comprehensive World Knowledge Guidance

Recent advances in Vision-Language-Action (VLA) models have opened new avenues for robot manipulation, yet existing methods exhibit limited efficiency and a lack of high-level knowledge and spatial...

arXiv:2604.20834 Empirical Study

Yupeng Zheng, Xiang Li, Songen Gu, Yuhang Zheng et al.

failure-resiliencecomputer-visionlanguage-modelsmachine-learningro

PokeVLA: Empowering Pocket-Sized Vision-Language-Action Model with Comprehensive World Knowledge Guidance

1. Introduction: The Fragility of Current VLA Systems

Current Vision-Language-Action (VLA) models, while impressive in controlled laboratory settings, exhibit a profound fragility when subjected to the entropic variables of the real world. Models like OpenVLA have demonstrated that scaling parameters does not inherently solve the problem of environmental distribution shifts. For a Robustness Specialist, the “failure modes” of these models are predictable and stem from three core architectural and data-driven bottlenecks:

  1. Domain Gap in Pre-trained Knowledge: A disconnect exists between general-purpose web knowledge and the specific physical priors required for robotic manipulation.
  2. Lack of Multi-View Spatial Consistency: Standard VLAs often fail on instructions involving absolute or relative positioning because they lack a unified 3D spatial representation across disparate camera views.
  3. Absence of High-Level Goal Guidance: Without an explicit mechanism to filter visual inputs, models struggle to distinguish the manipulation target from irrelevant environmental noise.

PokeVLA emerges as a “pocket-sized” (1.22B parameter) solution to these systematic failures. By implementing a framework of “World Knowledge Guidance,” it achieves state-of-the-art (SOTA) performance, proving that targeted representation injection is more critical for resilience than raw parameter scale.


2. The Two-Stage Training Paradigm: From VLM to VLA

PokeVLA utilizes a two-stage framework designed to first build cognitive foundations and then specialize those foundations for manipulation.

Stage 1: VLM Pre-training (PokeVLM)

We curated 2.4 million samples to train PokeVLM, a tiny-scale model utilizing a Qwen2.5-0.5B backbone. This stage establishes the “common sense” of the physical world—grounding objects in space and understanding their functional affordances.

CategorySamplesRepresentative Datasets
General Understanding665KLLaVA-Instruct-665K
Grounding694KRoboPoint, RefSpatial, RoboRefit, RoboSpatial
Affordance553KHOVA (Ego4D, Epic100), MolmoAct
Reasoning511KRefSpatial (2D Reasoning), Cosmos-Reason1-SFT (Agibot), Cosmos-Reason1-SFT (Holoassist)

Stage 2: VL-Action Post-training

The model transitions to PokeVLA by injecting manipulation-relevant representations. We utilize Action Latents (atR(T×Da)×Da_t \in \mathbb{R}^{(T \times D_a) \times D}) to generate action chunks, optimizing for efficiency and temporal consistency.


3. Core Methodology: Semantic and Geometric Alignment

PokeVLA’s resilience is built on the explicit injection of semantic and geometric data into the action space.

  • Goal-Aware Segmentation ( Token): We employ an “embedding-as-mask” paradigm where a special <SEG> token is added to the vocabulary. The hidden state embedding of this token guides a Coarse-to-Fine decoding process.
    • Coarse Stage: Captures holistic contextual relationships via SAM prompt encoders.
    • Fine-Grained Stage: Utilizes the coarse logits as mask prompts to generate refined prediction maps. This ensures the model maintains a unified, cross-view consistent representation of the target object, effectively filtering out background interference.
  • Geometry Alignment: To mitigate the failure of 2D-only models in 3D reasoning, we leverage the VGGT geometric foundation model. During training, we use cosine similarity to align the visual hidden states with VGGT features. This “internalizes” 3D structural awareness into the 1.22B backbone without adding a single millisecond to inference latency.
  • The Action Head Hierarchy: The action head employs a specific 4-module attention hierarchy to fuse representations:
    1. Self-Attention (SA): Over action latents.
    2. Query Cross-Attention (CA): Between latents, query embeddings, and robot state.
    3. Vision CA: Incorporating geometrically-aligned visual states.
    4. SEG CA: Injecting target-specific semantic embeddings.

Training Objective: The total loss is defined as: L=Laction+λsegLseg+λgeoLgeoL = L_{action} + \lambda_{seg}L_{seg} + \lambda_{geo}L_{geo}, where we set λseg=0.2\lambda_{seg} = 0.2 and λgeo=0.4\lambda_{geo} = 0.4.


4. Failure-First Robustness Analysis: LIBERO-Plus Transfer Performance

The true test of an embodied agent is the “Transfer” setting—evaluating a model on unseen perturbations without task-specific fine-tuning. We compared PokeVLA against OpenVLA-OFT (7B) and VLA-Adapter (1.22B).

Success Rates in LIBERO-Plus Transfer Setting

Perturbation CategoryOpenVLA-OFT (7B)VLA-Adapter (1.22B)PokeVLA (1.22B)
Camera Viewpoint56.4%36.2%84.7%
Robot Initialization31.9%37.9%46.1%
Language Instruction79.5%74.6%84.8%
Lighting Conditions88.7%70.6%94.6%
Background Texture93.3%76.1%82.6%
Sensor Noise75.8%58.0%89.8%
Object Layout74.2%69.7%77.2%
Average Total69.6%59.1%79.3%

PokeVLA’s 20.2% improvement over the comparable 1.22B baseline (VLA-Adapter) is primarily due to its goal-aware segmentation filter. By focusing attention on the manipulation target, the model rejects the visual noise that typically causes VLA policies to drift.


5. Real-World Deployment and Generalization

Deployment was validated on a UFACTORY xArm7 with Realsense D435 sensors. We collected 3,000 trajectories across 97 objects. PokeVLA consistently outperformed baselines in eight complex tasks involving spatial referencing and vertical-level discrimination.

Performance Under Real-World Perturbations (P1–P5): While VLA-Adapter struggled with environmental shifts, PokeVLA maintained a significant gap in robustness:

  • Initial Pose (P1): PokeVLA achieved a 70% success rate on Task 3, compared to VLA-Adapter’s 56%.
  • Lighting Fluctuations (P4): PokeVLA reached 80% success on Task 3, significantly gapping the 1.22B baseline.
  • Overall Resilience: Across five perturbation types (Initial Pose, Object Interference, Background, Lighting, and Language), PokeVLA improved the average success rate by 20% over models of a similar scale.

6. Insights for AI Safety and Red-Teaming

For safety practitioners, PokeVLA demonstrates that systematic failures are often data-deficiency problems rather than scaling problems. Our “Failure-First” takeaways include:

  1. Grounding Data Prevents Hallucination: Visual grounding is the primary defense against language-scene matching failures. Without it, models frequently “hallucinate” targets in complex scenes.
  2. Affordance Data and “Grasp Point” Recovery: Explicit affordance training (HOVA/MolmoAct) allows the robot to identify valid interaction regions. This is critical for safety; if the robot’s initial pose is shifted (P1), affordance knowledge allows for grasp point recovery rather than a catastrophic collision or missed attempt.
  3. Reasoning Data as Adversarial Defense: By training on “Cosmos-Reason1-SFT,” the model gains the logical depth required to parse non-standard or “adversarial” instructions that might otherwise trigger a policy failure.

7. Conclusion: The Future of Efficient Embodied AI

PokeVLA proves that SOTA robustness can be achieved with a 1.22B backbone if that backbone is properly “guided” by world knowledge. The success of this model underscores a shift in the field: from raw parameter competition to the strategic injection of grounding, affordance, and reasoning data. Multi-dimensional pre-training is no longer optional for creating resilient AI agents—it is the prerequisite for safe, real-world autonomy.

Project Link: https://getterupper.github.io/PokeVLA

Read the full paper on arXiv · PDF