X.ai Announces Grok-1.5V Multimodal Foundation Model and a New Benchmark
This should strengthen Musk's argument for a decacorn-level valuation
X.ai announced a new multimodal large language model (LLM) yesterday, Grok-1.5V. The “V” stands for vision. Grok-1.5V can interpret both text and images. The new model arrived the same week OpenAI’s GPT-4-turbo-vision model was released for general availability.
Grok-1.5V, our first-generation multimodal model. In addition to its strong text capabilities, Grok can now process a wide variety of visual information, including documents, diagrams, charts, screenshots, and photographs. Grok-1.5V will be available soon to our early testers and existing Grok users.
Grok-1.5V is competitive with existing frontier multimodal models in a number of domains, ranging from multi-disciplinary reasoning to understanding documents, science diagrams, charts, screenshots, and photographs. We are particularly excited about Grok’s capabilities in understanding our physical world. Grok outperforms its peers in our new RealWorldQA benchmark that measures real-world spatial understanding.
With the usual caveats about cherry-picking benchmark comparisons, Grok does look very competitive with the leading frontier LLMs such as GPT-4 and Claude 3 Opus. Across the benchmarks for MMLU, Mathvista, AI2D, TextVQA, ChatQA, DocVQA, and RealWorldQA, Grok-1.5V scored comparable to the other models. Google’s self-reported Gemini Ultra benchmarks for these evaluations are not shown, but they generally outpace Grok-1.5V. However, the difference is not substantial, and some CMU AI researchers suggested Gemini Ultra, which had slightly weaker results than Google reported.
The key takeaway is that Grok-1.5V seems to have attained a standard that can be compared favorably with the leading multimodal LLMs. This is a significant development for the LLM competitive landscape, especially if Grok-1.5 is released under an open-source license similar to Grok-1.
Multimodal Examples
X.ai provided several examples of Grok-1.5V correctly interpreting images. Below are examples of calculating calories, explaining a meme, and converting a table to a CSV file format. Other examples included generating a story from a simple hand drawing, diagnosing rotten wood on a deck, and two software coding problems. You can see all of them here.
I wrote recently about Yan LeCun’s hypothesis that autoregressive LLMs similar to GPT-4 and Grok-1.5V could not reason and did not have a world model. His argument is compelling. However, when you look at the meme interpretation, it’s hard not to think both are characteristics of these foundation models. Feel free to draw your own conclusions on those points. I’m still leaning toward LeCun’s position, but you can see why X.ai and OpenAI freely talk about LLM’s ability to reason.
A New Benchmark
The Grok-1.5V announcement coincided with X.ai releasing a new public benchmark for multimodal LLMs with vision capabilities.
In order to develop useful real-world AI assistants, it is crucial to advance a model's understanding of the physical world. Towards this goal, we are introducing a new benchmark, RealWorldQA. This benchmark is designed to evaluate basic real-world spatial understanding capabilities of multimodal models. While many of the examples in the current benchmark are relatively easy for humans, they often pose a challenge for frontier models.
The initial release of the RealWorldQA consists of over 700 images, with a question and easily verifiable answer for each image. The dataset consists of anonymized images taken from vehicles, in addition to other real-world images. We are excited to release RealWorldQA to the community, and we intend to expand it as our multimodal models improve. RealWorldQA is released under CC BY-ND 4.0.
The inclusion of new images, particularly those collected from automobiles, almost certainly leverages information collected to train Tesla’s self-driving AI models. It could come from elsewhere, but Musk already integrates Grok technology into the X social media platform, and at one time, he proposed that Tesla acquire OpenAI.
Musk’s ownership of several companies that employ AI (X.ai, X, Tesla, SpaceX, Neuralink, among others) or hold a lot of language and image data (X and Tesla) is an asset that offers X.ai a significant market advantage. Large quantities of high-quality data are likely to drive the next wave of AI foundation model advances.
It is unclear whether other organizations will also test on this benchmark, but it looks like a strong contribution to the corpus of foundation model test data. Since X.ai referred to this as an initial release, it is likely the corpus will grow significantly in the future.
Grokmentum
X.ai was founded in July 2023 and debuted its first large language model (LLM), Grok-1, less than four months later. It also introduced a beta of the Grok AI Assistant in the X platform, available exclusively to premium Twitter subscribers. In March, the company released Grok-1 under an Apache 2.0 open-source license and, in early April, announced the Grok-1.5 model. With its 1.5V model arriving less than two weeks later, it is clear that X.ai has a lot of product momentum.
The other companies with similarly torrid product release cadences are OpenAI, Google, Mistral, and Anthropic. OpenAI has introduced at least nine models since GPT-4. Google has introduced at least eight proprietary and two open-source models. Mistral as introduced six. Anthropic introduced Claude 1.3, Claude 2, Claude 2.1, and three models in the Claude 3 family. Meta also qualifies for this category, but mostly because of its wide variety of models, and most are not available to the public.
The number of models released does not necessarily correlate with users. However, it is a market signal to take the developers seriously, particularly those with ever-increasing scores on public benchmarks. One of the key concerns that many corporate buyers have is selecting a vendor that falls behind state-of-the-art (SOTA) progress and then finding it difficult to access the new capabilities. Frequent model updates are no guarantee, but it is a signal that the vendor is likely to keep pace with market developments. X.ai has quickly joined this cohort, characterized by strong performance and fast innovation cycles.
It seems many market observers have underestimated Musk and X.ai. It’s unclear whether this is driven by acrimony associated with his acquisition of Twitter, a belief that X.ai was too late to market, or another reason. However, Grok-1.5V should be viewed as a wake-up call for developers interested in open-source models. X.ai is clearly not a hobby project for Musk. It now appears to be among the leaders in AI foundation model development.
These developments may justify, to some investors, Musk’s hoped-for $18 billion valuation for X.ai a mere ten months after its founding. Granted, that will still be a stretch in the highly competitive AI foundation model market. Add a few large customers to the list and a cloud hosting provider partnership with AWS along with the AI model performance momentum, and the argument at least looks more plausible than it did a month ago. Still, the gulf between technology innovation and customer adoption can be very large. This is Musk’s challenge, and frankly, he has a good track record on this front.