Anthropic Sonnet 3.5 Sets New Benchmark Standards
GPT-4o Has Competition, and Anthropic responds to Gemini threat
Anthropic released a new AI foundation model today. Claude 3.5 Sonnet is the latest iteration of the large language model (LLM) that has morphed into a multimodal AI model that includes capabilities around language and images.
The new offering arrived less than four months after the introduction of the Claude 3 model family, which was among the first to claim superiority over OpenAI’s GPT-4. Last month, OpenAI introduced GPT-4o, which set a new standard for quality and capability. While Claude 3 Opus is still widely regarded as a leading frontier AI model, the introduction of Claude 3.5 Sonnet is an important step, given Anthropic’s objective to surpass the performance of the OpenAI GPT-4o offering.
Claude 3.5 Sonnet raises the industry bar for intelligence, outperforming competitor models and Claude 3 Opus on a wide range of evaluations, with the speed and cost of our mid-tier model, Claude 3 Sonnet.
Claude 3.5 Sonnet is now available for free on Claude.ai and the Claude iOS app, while Claude Pro and Team plan subscribers can access it with significantly higher rate limits. It is also available via the Anthropic API, Amazon Bedrock, and Google Cloud’s Vertex AI.
…
Our aim is to substantially improve the tradeoff curve between intelligence, speed, and cost every few months. To complete the Claude 3.5 model family, we’ll be releasing Claude 3.5 Haiku and Claude 3.5 Opus later this year.
Benchmark Standard
Claude 3 Opus is widely regarded as one of the top three generative AI foundation models. Many see it as superior to GPT-4 Turbo and competitive with GPT-4o. However, Claude 3.5 Sonnet has set a new standard among Anthropic’s AI models. This is significant, in part, because it is a smaller model.
The diagram at the top of the page emphasizes the implications of delivering higher quality with a smaller model. Claude 3.5's Sonnet is “more intelligent” than Claude 3's Opus, and it delivers this quality at a far lower price point.
It is worth noting that the Claude 3 announcement included results from ten language and reasoning AI model benchmarks and the Claude Sonnet 3.5 only presents eight. The missing two benchmarks are ARC-Challenge and Hellaswag. You can draw your own conclusions about why results for those benchmarks didn’t make it into the current announcement.
However, it is to Anthropic’s credit that the current release added data for 0-shot MMLU, which is one of two reported results where GPT-4o outperformance Claude 3.5 Sonnet. Granted, the results are so close that you can easily conclude the models are comparable for the undergraduate level knowledge domain.
The other benchmark where Claude 3.5 Sonnet is surpassed by GPT-4o is MATH. GPT-4o achieved a score of 76.6% for 0-shot chain of thought, whereas 3.5 Sonnet was only 71.1%.
The new model excels across the other benchmarks. Some show comparable performance to GPT-4o, while others demonstrate a significant lead. The HumanEval benchmark for coding was seven percentage points higher, and GPQA Diamond for graduate-level reasoning was six percentage points. Anthropic also reported better scores in four out of five vision benchmarks compared to GPT-4o.
Addressing the Google Challenge
GPT-4o is clearly a target competitor for the Claude 3.5 model family, but so are Google Gemini 1.5 and Gemini 1.5 Flash. First, the Gemini 1.5 models dethroned Anthropic as the leader in LLM context window size. While Anthropic was promoting an industry-leading 200,000 token context window, Google casually introduced a model that could do 1,000,000. Less than three months later, it said a 2,000,000 token context window would be available this year.
Google also started establishing impressive benchmark scores, and then, with Gemini Pro 1.5 Flash, it added high quality combined with low latency and cost. Anthropic had held the position as the key OpenAI foundation model alternative for a year. Suddenly, Google was making a credible claim to that position. The higher performance and lower latency of Claude 3.5 Sonnet directly address advances made in the Gemini 1.5 Pro and the Flash models.
What’s Next?
Anthropic has raised a lot of funding and secured a few customers along the way. Its $18 billion valuation during the most recent funding round will be hard to justify. However, Anthropic had already established itself as OpenAI’s key LLM substitute. That position is under threat, but Claude 3.5 is likely to keep Anthropic in the first tier of LLMs.
Beyond performance, Anthropic will add new experiences and capabilities that may provide enhanced value for users.
In addition to working on our next-generation model family, we are developing new modalities and features to support more use cases for businesses, including integrations with enterprise applications. Our team is also exploring features like Memory, which will enable Claude to remember a user’s preferences and interaction history as specified, making their experience even more personalized and efficient.
OpenAI has not wrapped up the market. The company is clearly ahead of its competitors in several areas and has a far larger market share for model use and digital assistant adoption. Anthropic represents the key alternative to OpenAI. Claude 3.5 Sonnet and its sister models are likely to keep the company at or near the top of the foundation model hierarchy.
The main improvement to Claude seems to be that it offers similar levels of intelligence at a lower price (just like OpenAI did with GPT4o). Another data point that points towards diminishing returns?
PS. I love the updates to the Claude user interface, they deserve a shout out for that too if you ask me. If model improvement continues to be slow and incremental, I predict the focus will shift towards UX; a truly underestimated component of delivering value to users.