GPT-4 Launches
OpenAI released GPT-4, a multimodal model capable of processing both text and images with significantly improved reasoning abilities compared to its predecessors. It scored in the top percentiles on professional exams including the bar exam and medical licensing tests. GPT-4 set a new benchmark for what large language models could achieve.
On March 14, 2023, OpenAI released GPT-4, a multimodal large language model that represented a significant leap in capability over its predecessors. GPT-4 could process both text and images as input, demonstrated substantially improved reasoning abilities, and scored at or above the 90th percentile on numerous professional and academic exams. It quickly became the benchmark against which all other AI models were measured.
The Capabilities
GPT-4's performance on standardized tests was striking. It scored in the 90th percentile on the Uniform Bar Exam (compared to GPT-3.5's 10th percentile), the 99th percentile on the Biology Olympiad, and passed the US Medical Licensing Exam with a comfortable margin. It demonstrated improved ability to handle complex multi-step reasoning, follow nuanced instructions, and produce factually accurate responses. While it still made errors, the reduction in hallucinations compared to GPT-3.5 was substantial.
Multimodal Input
For the first time, a GPT model could accept images as input alongside text. Users could upload photographs, charts, diagrams, or screenshots and ask GPT-4 to analyze, describe, or reason about them. The model could read text in images, interpret charts, solve visual puzzles, and even understand memes. While it could not generate images (that capability came later with DALL-E integration), its visual understanding opened new application possibilities.
The Mystery of Scale
OpenAI made the unusual decision to release almost no technical details about GPT-4's architecture, training data, or size. The technical report focused on capabilities and safety evaluations rather than methodology. This departure from traditional academic transparency was criticized by many researchers but reflected OpenAI's increasing focus on competitive advantage and safety concerns about enabling replication of the most powerful AI systems.
Safety Efforts
GPT-4 represented OpenAI's most significant investment in safety and alignment to date. The model underwent six months of adversarial testing before release, with red-teaming by external experts in areas like cybersecurity, persuasion, and biosecurity. Safety mitigations reduced the model's tendency to produce harmful content by 82 percent compared to GPT-3.5. However, creative jailbreaks continued to emerge, highlighting the ongoing challenge of making AI systems robustly safe.
Integration and Applications
GPT-4 was rapidly integrated into products across industries. Microsoft embedded it in Bing Chat, GitHub Copilot X, and Microsoft 365 Copilot. Khan Academy used it for personalized tutoring. Morgan Stanley used it for financial analysis. Duolingo used it for language learning. Be My Eyes used it to help visually impaired users understand their surroundings. The breadth of applications demonstrated GPT-4's versatility as a general-purpose reasoning engine.
The Competitive Response
GPT-4's release intensified the AI race. Google accelerated its Gemini project. Anthropic scaled up Claude's capabilities. Open-source efforts like Llama 2 aimed to close the gap. The model's demonstrated capabilities raised the stakes for everyone in the industry and increased the pace of development across the board.
Legacy
GPT-4 was a inflection point in AI capability. It was the first AI model that many experts considered genuinely useful for professional-level cognitive work across a wide range of domains. Whether drafting legal briefs, analyzing medical images, tutoring students, or writing code, GPT-4 demonstrated that AI could serve as a capable assistant for knowledge work at a level that was previously the exclusive domain of trained professionals.
Key Figures
Lasting Impact
GPT-4 set a new standard for AI capability, demonstrating professional-level performance across diverse domains and becoming the first model widely considered genuinely useful for professional cognitive work. It accelerated the AI arms race and expanded the commercial applications of large language models.