Anthropic's Claude 2.1 LLM Has a 200K Context Window, API Tools and Poses a New Challenge to OpenAI
Developers will also appreciate increased accuracy and system prompts
Anthropic has introduced Claude 2.1. Its latest large language model (LLM) boasts “an industry-leading 200K token context window,” lower hallucinations, tool APIs, and system prompts. It also arrives with a new price reduction for LLM use.
This announcement follows OpenAI’s introduction of a 128K context window for GPT-4 that bested Anthropic’s previous version of 100K. It also follows a weekend when more than 100 enterprise OpenAI users reached out to Anthropic seeking a backstop to their reliance on the industry’s most popular LLM provider, which is in the throes of a complete meltdown.
The timely release of the new model with enhanced features will only make Anthropic appear more attractive. The $4 billion in support from Amazon and $2.5 billion from Google, as well as promotion from AWS and Google Cloud, had already set Anthropic up as the logical alternative to OpenAI. The context window differentiation, improved accuracy, and access to tools will likely reinforce that sentiment.
Larger Context, More Use Cases
The context window refers to how much information the LLM can keep in its memory while executing user tasks. This is measured in data tokens and, depending on how they define a data token, the total word count this equates to is about 75% of the token count. Thus, 200K tokens translate into roughly 150,000 words.
The more tokens that the model supports, the larger the amount of information that can be added to an LLM conversation as context. That means more source data and longer conversational chats can be executed without losing context.
Indeed, most LLM use cases do not require a giant context window. Smaller context windows of 8K, 16K, and 32K should cover most use cases. However, for some use cases, the large context window is essential. This is important for organizations that want a single model to support use cases with both small and large context window requirements. Anthropic said introducing the 200K context window was developed in response to user requests.
In discussions with our users, they’ve asked for larger context windows and more accurate outputs when working with long documents.
In response, we’re doubling the amount of information you can relay to Claude with a limit of 200,000 tokens, translating to roughly 150,000 words, or over 500 pages of material. Our users can now upload technical documentation like entire codebases, financial statements like S-1s, or even long literary works like The Iliad or The Odyssey.
While the needs of classics professors might not be the most critical market for Anthropic to serve, financial services and software development are. The larger context window directly targets high-value customer segments that might have been inclined to give GPT-4’s new 128K context window.
Better Accuracy
Hallucinations that provide inaccurate LLM responses are problematic. I encountered this firsthand using different LLMs, but Anthropic had particularly notable trouble. Much of that came down to the use of the larger context window feature, which, to be fair, was not available from other models at the time. The more data in context memory, the more likely the model will become confused and return an untruthful response or completely ignore information provided in context.
Anthropic said that it has improved truthfulness for answering hard questions and working with long-context questions. For what the company characterizes as hard questions (e.g., “What is the fifth most populous city in Bolivia?”), the error rate for Claude 2.0 was nearly 50%, and the model only declined to answer, citing its uncertainty, about 25% of the time. The response error rate in Claude 2.1 is still 25%, with about 45% of the time declining to answer. These numbers are not great but are much higher than a question set that includes easy questions.
Users should continue to use retrieval augmented generation (RAG) and other grounding methods to drive these error rates down to single-digit percentages or below one percent.
For long context question answering, Claude 2.1 also shows dramatic improvements, particularly in the beginning and middle of the responses. According to the company:
Claude 2.1 demonstrated a 30% reduction in incorrect answers and a 3-4x lower rate of mistakenly concluding a document supports a particular claim.
While we are encouraged by these accuracy improvements, enhancing the precision and dependability of outputs for our users remains a top priority for our product and research teams.
These numbers are still not nearly good enough. This means users must continue to verify every output. However, when considering the time it can take to review some of these documents manually, a partial solution plus verification is still a significant time saver. My own experience with long text analysis using OpenAI’s GPT models with Code Interpreter was more accurate than Claude 2.0. However, both models were prone to errors. It will be interesting to see if Claude 2.1 proves superior.
API Tools
Introducing APIs for accessing third-party tools may be the most important capability arriving with Claude 2.1. There are more use cases that benefit from integration to external services than those that require large context windows. Anthropic elaborated on this topic:
By popular demand, we’ve also added tool use, a new beta feature that allows Claude to integrate with users' existing processes, products, and APIs. This expanded interoperability aims to make Claude more useful across our users’ day-to-day operations.
Claude can now orchestrate across developer-defined functions or APIs, search over web sources, and retrieve information from private knowledge bases. Users can define a set of tools for Claude to use and specify a request. The model will then decide which tool is required to achieve the task and execute an action on their behalf, such as:
Using a calculator for complex numerical reasoning
Translating natural language requests into structured API calls
Answering questions by searching databases or using a web search API
Taking simple actions in software via private APIs
Connecting to product datasets to make recommendations and help users complete purchases
Ease of API connectivity was a clear differentiator for OpenAI, particularly after adding the Assistant API earlier this month. Anthropic admits these features are “early in development” but does demonstrate where the product is headed.
System Prompts
The introduction of system prompts is also a welcomed addition. System prompts enable developers to add information to each user prompt in the background before a message is sent to the LLM. This strategy is effective for improving overall LLM performance and use case alignment.
System prompts set helpful context that enhances Claude’s ability to take on specified personalities and roles or structure responses in a more customizable, consistent way aligned with user needs.
Lower Cost
To top it off, Anthropic lowered its model use pricing for Claude 2.0 and 2.1, setting the cost below the price for OpenAI’s GPT-4. The current pricing data include:
OpenAI GPT-4 Turbo: $0.01/1K token inputs and $0.03/1K token outputs
OpenAI GPT-4: $0.03/1K token inputs and $0.06/1K token outputs
OpenAI GPT-3.5 Turbo: $0.001/1K token inputs and $0.002/1K token outputs
Claude 2.0 and 2.1: $0.008/1K token inputs and $0.024/1K token outputs
That new Anthropic pricing reflects a 20% discount to OpenAI’s most powerful models but is still considerably higher than the price of the less powerful and smaller context window GPT-3.5 Turbo.
Anthropic is positioning Claude as the alternative among the most advanced models and offering a meaningful price advantage. A key strategic question is whether Anthropic intends to provide more cost-effective models that can be price-competitive for more use cases.
Anthropic’s Rise
Anthropic already had a good story as a competent and somewhat differentiated alternative to OpenAI for users who want access to the most powerful LLMs. The tenuous status of OpenAI created by the firing of CEO Sam Altman, followed by the threatened departure of 85% of its staff, has enterprises and software developers urgently seeking an alternative to reduce their own product risk.
Amazon was also separately seeking an alternative offering for its AWS users that could compete effectively with OpenAI’s GPT-4. At the same time, Google wanted to hedge its bets in case its PaLM LLM couldn’t gain traction or its forthcoming Gemini model faced delays, which is apparently happening. The backing of AWS and Google Cloud has positioned Anthropic nicely as the key alternative to OpenAI and the Azure cloud ecosystem. The latest product updates should reinforce that positioning.
Every LLM provider is getting a fresh look from prospective customers. What better time to provide far-reaching product updates and lower pricing to look even more attractive?
Let me know your thoughts in the comments. Is Anthropic about to be the biggest winner coming out of the OpenAI meltdown?
Synthedia is a community-supported publication. Please check out our sponsor, Dabble Lab, an independent research and software development agency that is entirely automated with a mix of in-house AI tools and GitHub Copilot. The founder, Steve Tingiris, is also the author of Exploring GPT-3, the first developer training manual for building applications on GPT-3.
It's worth mentioning that Antropic's API is in closed Beta, and I've submitted a request for access every month or so, and never even got a email confirming my request.
Still very encouraging stuff.
FYI, Perplexity, does offer an api, thought I have not yet rolled up my sleeves to run it through it's paces. More info here: https://blog.perplexity.ai/blog/introducing-pplx-api