January 14, 2026 Daily Paper

GPT-4 Technical Report

Documents the capabilities and safety evaluation of GPT-4, a large multimodal model that accepts image and text inputs, demonstrating substantial improvements over GPT-3.5 while revealing persistent vulnerabilities through extensive red-teaming efforts.

arXiv:2303.08774 Empirical Study

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal et al.

foundation-modelsmultimodal-aisafety-evaluationred-teamingcapability-assessmentdeployment-safety

GPT-4 Technical Report

Focus: OpenAI’s GPT-4 technical report documented both the model’s substantial capability improvements and the extensive safety evaluation infrastructure deployed before release, including red-teaming by over 50 external experts. The report established a precedent for pre-deployment safety assessment while revealing the persistent gap between safety investment and adversarial robustness.

Key Insights

Multimodal capabilities expand the attack surface. GPT-4’s ability to process both text and images introduced new vulnerability classes: adversarial images, cross-modal prompt injection, and attacks exploiting the interaction between visual and textual reasoning.
Expert red-teaming finds risks that automated testing misses. The 50+ domain experts recruited for red-teaming discovered risks in specialized domains (chemistry, biology, cybersecurity) that general-purpose automated evaluation would not have identified.
Safety training reduces but does not eliminate sophisticated attacks. GPT-4 showed an 82% reduction in disallowed content responses compared to GPT-3.5. However, adversarial prompts from the red-teaming effort could still bypass these protections.

Executive Summary

The GPT-4 technical report described a large multimodal transformer model trained on both text and image data and fine-tuned using RLHF.

Capability Achievements

On standardized benchmarks, GPT-4 demonstrated performance at or above the 90th percentile on many professional exams:

Bar exam: ~90th percentile
SAT Math: ~89th percentile
GRE Quantitative: ~80th percentile
AP exams: High scores across multiple subjects

This represented a substantial leap over GPT-3.5 in both breadth and depth of capability.

Safety Evaluation Infrastructure

The safety section was notably more comprehensive than previous model releases:

External red-teaming. Over 50 domain experts in areas from cybersecurity to biosecurity tested the model before release.
Risk taxonomy. A structured taxonomy spanning cybersecurity, biological threats, persuasion, and autonomous replication.
Iterative safety training. Multiple rounds of RLHF with safety-specific preference data, informed by red-teaming findings.
Deployment mitigations. Input and output classifiers, rate limiting, and monitoring systems.

Persistent Limitations

Despite the extensive safety investment, the report acknowledged:

GPT-4 still hallucinated factual information.
It exhibited demographic and ideological biases.
Adversarial prompts could elicit harmful outputs.
The absence of training details (dataset, model size, compute) undermined reproducibility.

Relevance to Failure-First

GPT-4’s technical report is both a milestone and a cautionary tale for the failure-first framework:

Aggregate vs. adversarial safety. The 82% reduction in disallowed content demonstrates that safety training works in the aggregate — but the remaining vulnerability includes the most consequential failures, which is what adversarial evaluation targets.
Multimodal attack surface expansion. Capability improvements create new failure modes faster than safety measures can address them, validating the failure-first prediction.
Expert evaluation limitations. Even 50 experts cannot comprehensively evaluate a model with GPT-4’s breadth, motivating systematic, automated adversarial evaluation supplemented by expert testing.

Read the full paper on arXiv · PDF