Gemini: A Family of Highly Capable Multimodal Models
Introduces the Gemini family of multimodal models capable of reasoning across text, images, audio, and video, demonstrating state-of-the-art performance on 30 of 32 benchmarks while detailing the safety evaluation framework for natively multimodal systems.
Gemini: A Family of Highly Capable Multimodal Models
Focus: Google’s Gemini models were the first natively multimodal large models, processing text, images, audio, and video within a single architecture. The paper documented both the capability frontier and the expanded safety evaluation required when models can reason across modalities.
Key Insights
-
Native multimodality expands the attack surface multiplicatively. Unlike models that bolt vision onto language, Gemini’s native multimodal architecture meant that adversarial inputs could exploit interactions between modalities. An adversarial image could influence text generation, and cross-modal prompt injection became possible.
-
Capability gains outpace safety evaluation. Gemini Ultra achieved state-of-the-art on 30 of 32 benchmarks, including the first score above 90% on MMLU. But the safety evaluation could not comprehensively cover the combinatorial space of cross-modal interactions.
-
Safety evaluation must become multimodal. Existing safety benchmarks were predominantly text-based, creating a systemic gap in evaluation infrastructure for multimodal systems.
Executive Summary
The Gemini technical report introduced three model sizes designed for different deployment contexts:
Model Family
- Gemini Ultra: Maximum capability for complex tasks (data center)
- Gemini Pro: Balanced performance and efficiency (cloud deployment)
- Gemini Nano: On-device deployment (mobile, edge)
All were trained from the ground up as multimodal models capable of processing interleaved text, image, audio, and video inputs.
Capability Achievements
- MMLU: 90.0% — first reported score surpassing human performance
- State-of-the-art on 30/32 benchmarks across text, vision, and multimodal tasks
- Native multimodal reasoning across text, image, audio, and video
Safety Evaluation
The safety section documented multi-stage assessment:
- Pre-training data analysis — Content filtering and bias detection in training data
- Safety fine-tuning — RLHF with safety-specific preference data
- Multi-modal red-teaming — Testing across text, image, and combined inputs
- Safety classifiers — Applied to both inputs and outputs across modalities
Cross-Modal Attack Surfaces
The report acknowledged fundamental challenges:
- Image-to-text influence: Images containing adversarial perturbations could manipulate text outputs.
- Text-to-image manipulation: Text prompts could bias how the model interpreted ambiguous images.
- Audio-based injection: Voice inputs could contain adversarial commands.
- Combinatorial explosion: The space of possible multimodal inputs is vastly larger than text-only, making comprehensive testing impractical.
Relevance to Failure-First
Gemini’s multimodal architecture represents the type of system the failure-first framework was designed to evaluate:
-
Combinatorial failure space. When models process multiple input modalities, the space of possible failures grows combinatorially — exactly analogous to embodied AI systems that integrate diverse sensor inputs with language reasoning.
-
Evaluation-capability gap. The acknowledgment that safety evaluation could not keep pace with multimodal capability validates the framework’s position that adversarial evaluation must be continuous and adaptive.
-
Embodied-specific attack vectors. Cross-modal attacks introduce failure modes specific to embodied systems: adversarial manipulation of visual scenes, corrupted sensor data influencing safety reasoning, and audio-based prompt injection in voice-controlled robots.
Read the full paper on arXiv · PDF