Daily Paper

Can Vision Language Models Judge Action Quality? An Empirical Evaluation

Comprehensive evaluation of state-of-the-art Vision Language Models on Action Quality Assessment tasks, revealing systematic failure modes and biases that prevent reliable performance.

arXiv:2604.08294 Empirical Study

Miguel Monte e Freitas, Rui Henriques, Ricardo Rei, Pedro Henrique Martins

vision-language-modelsaction-quality-assessmentfine-grained-video-understandingmodel-bias-analysisembodied-task-evaluationprompt-engineering-limitations

Can Vision Language Models Judge Action Quality? An Empirical Evaluation

1. Introduction: The Gap Between General Capability and Fine-Grained Precision

Current state-of-the-art Vision Language Models (VLMs), such as Gemini 3.1 Pro Preview and Qwen3-VL-235B-A22B, are frequently lauded for their “general intelligence” and zero-shot reasoning across multiple modalities. However, for those of us in the safety and red-teaming space, these broad capabilities mask a significant, high-risk deficiency: the inability to perform Action Quality Assessment (AQA) with any degree of reliability.

AQA—the automated evaluation of how well an action is executed—is a critical frontier for embodied AI. Its applications are high-stakes, ranging from autonomous physical therapy feedback to professional sports coaching and objective competitive judging. Despite their massive parameter counts and extensive pre-training, empirical evidence reveals that off-the-shelf VLMs perform at near-random levels on AQA tasks. This is not merely a data scarcity issue; it is a fundamental diagnostic failure driven by deep-seated systematic biases where linguistic priors consistently override visual ground truth.

2. The Empirical Challenge: Benchmarking the Best

A recent comprehensive evaluation tested the flagship models from three leading VLM families to determine if general multimodal training translates to movement expertise.

Models Evaluated:

  • Gemini 3.1 Pro Preview: Google’s most capable model, configured for high thinking levels.
  • Qwen3-VL-235B-A22B: Evaluated in both Instruct and Thinking (reasoning-based) variants.
  • InternVL3.5-241B-A28B: An open-source flagship tested in standard and thinking modes.

Evaluation Scope: The models were subjected to a rigorous battery across five specific datasets covering diverse activity domains:

  • Gym/Fitness: LLM-FMS (image-based VQA), EgoExo-Fitness (egocentric/exocentric technical adherence), and Fitness-AQA (error detection in squats/overhead press).
  • Competitive Sports: FineFS (figure skating GOE scores) and MTL-AQA (diving execution scores).

The results highlight a “covert failure” mode. In classification tasks, models achieved balanced accuracy scores only marginally above random chance. While Gemini 3.1 Pro Preview showed stronger performance on static image-based VQA (LLM-FMS), all models struggled significantly with the temporal complexity of video. Furthermore, in regression tasks, the Spearman rank correlation values remained weak and inconsistent across all architectures, failing the requirements for professional-grade coaching or therapy.

3. Anatomy of a Failure: Modality Collapse and Systematic Bias

The failure of VLMs in AQA is not stochastic; it is the result of what we term Modality Collapse—a phenomenon where the model’s linguistic knowledge overrides its visual processing.

The Correctness Prediction Bias

VLMs exhibit a pathological tendency to predict that an action is performed “correctly” regardless of the visual evidence. This is a prior-knowledge trap: because the training distribution is saturated with “ideal” form descriptions, the model defaults to a linguistic script. If the model recognizes a “squat,” it assumes the trunk must be parallel to the ground because that is the definition of a correct squat. It isn’t observing the human in the video; it is hallucinating a “correct” version of the action label it has identified.

Linguistic Sensitivity (Framing Bias)

Model outputs are dangerously sensitive to superficial prompt phrasing. By merely flipping the polarity of a guideline—even when the semantic meaning is identical—we can trigger a total shift in the model’s judgment.

Prompt TypeDescriptionObserved Effect on Prediction
Positive GuidelinesPhrases describing correct form (e.g., “The trunk should be parallel to the calf.”)Artificially inflates the likelihood of a “True” or “Correct” prediction.
Negative GuidelinesPhrases describing common errors (e.g., “The trunk not being parallel to the calf.”)Triggers a bias toward detecting errors, regardless of actual execution quality.

This sensitivity suggests the models are exploiting linguistic shortcuts in the prompt rather than performing genuine visual reasoning.

4. The Limits of Common Fixes: The “Regression Trap” and Skeleton Failures

Standard mitigation strategies—visual preprocessing and advanced prompting—provide a false sense of security while offering minimal gains.

  • The Skeleton Failure: One might assume that providing explicit pose information via SAM 3D Body model estimations would ground the model. However, “skeleton-only” renders (using the COCO 17 keypoint format) caused performance to collapse. VLMs struggle to interpret abstract skeletal structures without the RGB context. They are not “seeing” the movement; they are recognizing a “scene” and applying a text-based heuristic.
  • The Regression Trap: In scoring tasks (regression), models occasionally show lower relative 2ℓ2 distance (R2R-ℓ2) values. A naive observer might see this as a success. In reality, this is a covert failure: the models are defaulting to conservative, low-variance predictions that cluster near the ground truth mean. They are simply “guessing the average” rather than accurately judging the nuance of the movement.
  • Advanced Prompting Limitations: Strategies such as Visual Grounding and Structured Reasoning (using XML tags like <look>, <decompose>, and <analyse>) failed to yield consistent gains. The reasoning traces often became “deceptive,” providing a confident-sounding narrative that had no basis in the actual video frames.
  • The ICL Bottleneck: While In-Context Learning (ICL) improved performance for static images, it is non-viable for video AQA. The massive token overhead required to process multiple few-shot video examples quickly exceeds the context window limitations of even the most advanced models.

5. Qualitative Red Flags: Deceptive Alignment in Reasoning Traces

An analysis of the “Thinking” traces from models like Qwen3-VL-235B-A22B-Thinking reveals a disturbing disconnect between reasoning and observation. The models frequently use prior knowledge as a substitute for visual grounding, creating a form of “deceptive alignment” where the reasoning looks correct but the conclusion is a hallucination.

Diagnostic Quotes from Reasoning Traces:

  • “Considering the Olympics, the execution is likely high.” (Contextual assumption over visual observation).
  • “The person is in a squat, so the trunk is likely parallel.” (Action-definition as a substitute for observation).
  • “Overhead press is usually standing.” (Relying on “standard” definitions to bypass visual verification).
  • “Higher jumps usually get better GOE.” (Heuristic-based scoring rather than measurement).

These traces prove that the models are not analyzing the specific instance in front of them; they are reciting a script about the general category of the action.

6. Conclusion: Moving Toward Reliable Embodied AI

General-purpose VLMs are currently unfit for safety-critical AQA deployment. Their reliance on linguistic priors and their susceptibility to framing makes them “covert” failures—systems that provide confident, plausible-sounding explanations for incorrect assessments.

Actionable Takeaways for Researchers and Red-Teamers:

  1. Mitigate Covert Hallucination in Reasoning Traces: We must develop evaluation protocols that penalize “thinking” steps not grounded in specific pixel-level or temporal evidence.
  2. Benchmark Beyond Generalities: Standard benchmarks are too easy to “game” via linguistic priors. Use Contrastive Task Reformulation—asking the model to distinguish which of two similar videos is “better”—to strip away correctness bias. However, note that currently, models only succeed here when the quality gap is massive.
  3. Address Context and Modality Constraints: We need architectural shifts that prevent “modality collapse.” Visual evidence must be weighted such that it can override linguistic priors in the final decision layer.

Until these systematic failure modes are addressed, the use of VLMs for human-centric physical assessment should be viewed as a high-risk endeavor. We do not yet have models that “see” movement; we have models that “know” definitions.

Read the full paper on arXiv · PDF