New Research: Anthropic’s Claude 2.1 LLM Remains Inferior to OpenAI’s GPT-4 at Context Recall

This is not investment advice. The author has no position in any of the stocks mentioned. has a disclosure and ethics policy.

The limited ability of the current iterations of Large Language Models (LLMs) to comprehend increasing loads of context remains one of the biggest impediments at the moment to achieving AI singularity – a threshold at which artificial intelligence demonstrably exceeds human intelligence. At first glance, the 200K-token context window for Anthropic’s Claude 2.1 LLM appears impressive. However, its context recall proficiency leaves much to be desired, especially when compared with the relatively robust recall abilities of OpenAI’s GPT-4.

Anthropic announced yesterday that its latest Claude 2.1 LLM now supports an “industry-leading” context window of 200K tokens while delivering a 2x decrease in model hallucinations – a situation where a generative AI model perceives non-existent patterns or objects often as a result of unclear or contradictory input, delivering an inaccurate or nonsensical output.

For the benefit of those who might not be aware, a token is a basic unit of text or code that LLMs use to process and generate language. Depending on the tokenization method employed, a token might be a character, word, subword, or an entire segment of text or code. Claude 2.1’s enlarged context window allows the LLM to understand and process a nearly 470-page book.

Of course, the 200K-token context window of Anthropic’s Claude 2.1 is quite impressive when compared with OpenAI’s GPT-4, which only supports a 128K-token window. However, the real-world application of this enlarged context window loses some of its luster when one considers Claude 2.1’s less-than-impressive ability to recall context.

Context Recall: Anthropic’s Claude 2.1 vs. OpenAI’s GPT-4

AI expert Greg Kamradt recently pitted Claude 2.1 against GPT-4 via a standardized test that aimed to determine how accurately a particular model recalled a specific piece of fact embedded at varying passage depths.

Specifically, Kamradt embedded the following text at varying passage depths:

“The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day.”

The researcher divided his input text into 35 equal parts and then placed the above fact at each of these 35 depths, asking Claude 2.1 to answer a related question each time. The researcher also varied the context window, which ranged from 1K tokens all the way to 200K tokens, divided into 35 equal increments. Go to this X post for further details on the methodology employed.


Above, you will find how accurately Anthropic’s Claude 2.1 was able to recall the embedded fact at a given document depth and context window length. Each red block represents a failure to recall. As is evident from the above snippet, the LLM’s recall ability progressively degrades as the context window increases.

GPT-4 Test Results

For comparison, the results of a similar test conducted with OpenAI’s GPT-4 are displayed above. Here, the depth at which the fact was embedded as well as the context window of the LLM were changed in 15 distinct increments. Head over to this X post for further details.

Do note GPT-4’s materially fewer 100 percent recall failures at its maximum context window length of 128K tokens.

We had noted in a previous post that GPT-4 outscored xAI’s Grok and Anthropic’s Claude 2 LLMs in a held-out math exam. It remains to be seen how Claude 2.1 performs against GPT-4 in the same setting.

Share this story