Google's Gemini LLM Arrives Next Week and It May Just Outperform GPT-4 (sort of)
Access to Gemini Pro begins on December 13th and New Bard is Coming
Google CEO Sundar Pichai announced in a blog post that the Gemini large language model (LLM) will launch next week as an API in Google AI Studio and Vertex AI in Google Cloud. The solution is regarded as a replacement for the company’s current flagship LLM, PaLM. This represents the third LLM product family that Google has used this year, including the PaLM predecessor, LaMDA. There is no mention of PaLM’s future, but it will likely be deprecated at some point in favor of the new Gemini foundation models.
Demis Hassabis, CEO and co-founder of Google Deepmind, wrote today that Gemini 1.0 will be offered in three sizes:
Gemini is also our most flexible model yet — able to efficiently run on everything from data centers to mobile devices. Its state-of-the-art capabilities will significantly enhance the way developers and enterprise customers build and scale with AI.
We’ve optimized Gemini 1.0, our first version, for three different sizes:
Gemini Ultra — our largest and most capable model for highly complex tasks.
Gemini Pro — our best model for scaling across a wide range of tasks.
Gemini Nano — our most efficient model for on-device tasks.
A fine-tuned version of Gemini Pro will power Google Bard starting today. While Google Bard today supports more than 40 languages and is available in 230 countries, the Gemini update will initially be for English-only and text-only use in 170 countries. Hassabis said that new languages and modalities will be available “in the near future.”
Next year, Gemini is expected to arrive in other Google products such as Duet AI, Pixel 8, Ads, search, and search generative experience (SGE). The SGE upgrade is expected to reduce latency by as much as 40% and deliver better quality. These presumably will all be powered by Gemini Pro. However, Gemini Ultra is the higher-performing model that won’t be available until early next year.
Beating GPT-4 in Text and Reasoning
Gemini’s comparison with OpenAI’s GPT-4 is the most significant element of the long-awaited announcement. There are a lot of benchmarks and other tests listed in the Gemini technical paper. Google Deepmind’s chief scientist, Jeff Dean, called out Gemini’s state-of-the-art performance in 30 of 32 benchmarks.
Some of the more notable scores are 90.04% for Gemini Ultra using chain-of-thought (CoT) for MMLU, which covers general knowledge across 57 subjects. That compares favorably with 87.29% for GPT-4. However, GPT-4 still wins for the 5-shot (i.e., examples) evaluation, with Gemini achieving 83.7% and GPT-4 at 86.4%. The 90% threshold figure is called out as a significant achievement, given it is the first time an LLM has surpassed “human expert performance.”
You can see in Table 2 from the technical report that Gemini also beats GPT-4 in other popular benchmarks, such as GSM8K, MATH, and HumanEval. However, GPT-4 still has a substantial lead in performance on Hellaswag.
There is an additional note on data contamination in the training dataset related to at least one benchmark. That result was removed from the report. Data contamination occurs when the pre-training dataset includes information about the benchmark. For Hellaswag, there is some information in many LLM training datasets that Google removed before training Gemini. When it added that information back into the training it achieved a score comparable to GPT-4.
Another point made by Google researchers is that Gemini Pro compares favorably to GPT-3.5. Gemini Pro is a smaller model and is being positioned as an alternative to GPT-3.5. Many other LLM developers also like to compare their models to GPT-3.5 as it has considerably lower performance across numerous benchmarks than GPT-4. Google’s position seems to be that Pro and GPT-3.5 are the smaller LLMs in the respective model families and can be compared directly. Gemini Pro posted higher results for six of eight head-to-head tests with GPT-3.5 around text, coding, and reasoning.
Image and Video Understanding
Gemini Ultra also posted better results than GPT-4V for image understanding. Gemini Pro also performed comparable or better than GPT-4V in two of eight benchmarks in the category. Overall, the variance between Gemini Ultra and GPT-4 is small for image understanding tasks.
A potentially more significant result is Gemini Ultra’s performance in video understanding. This is a category where you might expect Google to establish a technical lead, given its large video catalog on YouTube and the features it could develop for users. There is no comparable data for GPT-4 for these tests, though the Gemini results for both the Ultra and Pro models outpace previous SoTA scores.
There is so much video content today that being able to extract information from it could support numerous use cases. You can extract useful data from transcripts and then conduct textual analysis. However, that overlooks a lot of rich data resident in the visual data. Google is positioning Gemini as a model that can automate the extraction of both language and visual information from videos.
Bard Upgrade
Bard is also getting an immediate upgrade with a switchover from the PaLM 2 model to the Gemini Pro model for the English-language version. Sissie Hsiao, Google’s vice president and general manager of Google Assistant and Bard commented in a blog post:
Gemini is rolling out to Bard in two phases: Starting today, Bard will use a specifically tuned version of Gemini Pro in English for more advanced reasoning, planning, understanding and more. And early next year, we’ll introduce Bard Advanced, which gives you first access to our most advanced models and capabilities — starting with Gemini Ultra…
We’ve specifically tuned Gemini Pro in Bard to be far more capable at things like understanding, summarizing, reasoning, coding and planning. And we’re seeing great results: In blind evaluations with our third-party raters, Bard is now the most preferred free chatbot compared to leading alternatives.
That sounds like Google is saying Bard with Gemini is preferred to ChatGPT, but the language is vague, so it could mean Anthropic’s Claude. Regardless, there is a pretty amusing video of YouTube science creator Mark Rober using Bard to plan and execute a new challenge.
An upgraded Bard will be necessary to compete effectively with ChatGPT, which has a significant lead in mindshare and user base. Gemini Pro capabilities should help close the feature gap, but it may take the Gemini Ultra features to make Bard stand out. I attempted to try Bard with Gemini Pro earlier today, but it is unclear whether it is deployed yet.
The Alternative to GPT-4?
While Bard fights to become the primary alternative to ChatGPT, Gemini is angling to be the primary alternative to GPT-4. Buyers in every market want alternatives. There are many choices in the LLM segment. Anthropic has a lot of momentum as a key alternative, as evidenced by the strong interest in accessing its models during OpenAI’s recent management crisis. However, the runner-up position is a wide-open competition, with Meta, Amazon, and now Google making strong moves.
Any company wanting to dethrone OpenAI must first become the top alternative and build upon that beachhead. Gemini is the first model to claim across-the-board superiority to GPT-4. With Google’s resources behind it, Gemini has a shot.
The key drawback will be that Google’s LLMs are only available through Google Cloud, a distant third in cloud hyperscaler market share. That will put a cap on Gemini’s attractiveness. A few enterprises are trying Azure for the first time in order to access OpenAI, but the appetite for Google will be less. Google will surely announce some high-profile Gemini customers who are already Google Cloud users and others who want free computing credits, but the barrier to adoption is real.
Of course, Google could offer Gemini through AWS, which would be a strategic masterstroke for both companies. I suspect the likelihood of that scenario is slim. Regardless, using Gemini within Google applications will be a significant benefit over PaLM’s somewhat limited repertoire.
This means that a Gemini-fueled Bard might be the better near-term opportunity for a break-out hit. ChatGPT has a lot of users, but the generative AI assistant market is in its infancy, and there is no obvious alternative with any significant user base.
I agree with my colleague Eric Schwartz that Google should rename Bard as Gemini and relaunch the assistant in the first quarter with the new features. It would be the fastest way to make a big splash and quickly become the primary alternative. Google is probably better positioned to win in the consumer generative AI space. The rebranding is also low-likelihood, but it would definitely be interesting.
Let me know what you think in the comments.
You sure it's 230 countries?