GPT-4 is Better Than GPT-3.5 - Here Are Some Key Differences
Numerous Upgrades Available to ChatGPT Plus Users and Some Developers
TLDR;
OpenAI announced GPT-4 today and it is available to a limited set of API developers and to ChatGPT Plus subscribers.
GPT-4’s multimodal features only include image inputs at this point and they are only available to developers through the API.
GPT-4 outperforms GPT-3.5 in just about every evaluation except it is slower to generate outputs - this is likely caused by it being a larger model.
GPT-4 also apparently outperform’s both GPT-3.5 and Anthropic’s latest model for truthfulness. OpenAI also showed data to suggest hallucinations are less frequent in GPT-4 than GPT-3.5, which in turn gives the new model a higher factuality score, which is the inverse of hallucination frequency.
An evaluation by Synthedia in ChatGPT Plus confirms the output is slower but does appear to be superior in terms of the generated output.
OpenAI did not disclose how many parameters are in the GPT-4 model, breaking from past practices and, instead, talked solely about the performance improvements.
The New Bing search engine already uses the GPT-4 model with some Microsoft customizations for the search use case.
The Wait is Over … Sort of
OpenAI today announced the availability of GPT-4. It is better than GPT-3.5 in many ways but is slower to generate results than its predecessor. That is likely a function of the size of the model and the fact that it may take time to optimize performance. The official announcement on OpenAI stated:
We’ve created GPT-4, the latest milestone in OpenAI’s effort in scaling up deep learning. GPT-4 is a large multimodal model (accepting image and text inputs, emitting text outputs) that, while less capable than humans in many real-world scenarios, exhibits human-level performance on various professional and academic benchmarks.
We’ve created GPT-4, the latest milestone in OpenAI’s effort in scaling up deep learning. GPT-4 is a large multimodal model (accepting image and text inputs, emitting text outputs) that, while less capable than humans in many real-world scenarios, exhibits human-level performance on various professional and academic benchmarks. For example, it passes a simulated bar exam with a score around the top 10% of test takers; in contrast, GPT-3.5’s score was around the bottom 10%. We’ve spent 6 months iteratively aligning GPT-4 using lessons from our adversarial testing program as well as ChatGPT, resulting in our best-ever results (though far from perfect) on factuality, steerability, and refusing to go outside of guardrails…
In a casual conversation, the distinction between GPT-3.5 and GPT-4 can be subtle. The difference comes out when the complexity of the task reaches a sufficient threshold—GPT-4 is more reliable, creative, and able to handle much more nuanced instructions than GPT-3.5.
Following User Instructions More Closely
Greg Brockman, president and co-founder of OpenAI, started off the new developer demonstration by commenting”
The first thing I want to show you is the first task that GPT-4 can do that we never really got GPT-3.5 to do…You can paste anything you want as a user and the model will return messages as an assistant. The way to think of it is we are moving away from raw text in and raw text out where you can’t tell where different parts of the conversation come from towards this much more structured format that gives the model the opportunity to know this is the user asking me to do something that the developer didn’t intend.”
He then demonstrated a feature that could summarize an article where every word began with a “g”, “a”, and “q.” He then showed how GPT-4 could compare two articles and extract common themes. If you watch the video, be sure to start after 54 minutes because otherwise, you will just be watching a blank screen.
Hallucinations Remain but Are Less Frequent
Another update that will interest readers is the reduction in hallucinations in GPT-4 compared to earlier models. OpenAI positions this as an improvement in factuality.
While still a real issue, GPT-4 significantly reduces hallucinations relative to previous models (which have themselves been improving with each iteration). GPT-4 scores 40% higher than our latest GPT-3.5 on our internal adversarial factuality evaluations:
OpenAI also showed results from GPT-3.5, GPT-4, and Anthropic in the TruthfulQA benchmark. The benchmark is an 800-question test across a variety of categories designed to measure the truthfulness of a large language model (LLM) — that is how infrequently it generates incorrect answers. OpenAI’s models both show superior performance according to the data presented. To be clear, this appears to be OpenAI data. Anthropic may have something to say about this at some point.
Comparing GPT-4 and GPT-3.5 in ChatGPT
OpenAI suggested a limited number of developers can now access the GPT-4 API, and everyone else can get access through ChatGPT. That second statement appears to be a bit misleading. GPT-4 is now an option for ChatGPT Plus subscribers but does not appear as an option for the free ChatGPT version.
In ChatGPT Plus, OpenAI provides a model selection drop-down at the top of the page and as you select various models it provides a short summary of the model and how the company ranks its performance in reasoning, speech, and conciseness.
Legacy GPT-3.5 is the model that ChatGPT launched in November 2022. The default GPT-3.5 is the new turbo model announced on March 1st. So, you are really choosing between Default and GPT-4. It is a savvy move by OpenAI to show the Legacy model since many users may not have yet tried the new Default turbo model and now they have data to assess the difference.
The ratings show the Default model as inferior to GPT-4 in terms of reasoning and conciseness. However, GPT-4 is no match for the Default GPT-3.5 model for speed. Bigger model, more latency? That seems likely.
However, what is missing from all of the discussions around GPT-4 is the model size. I could not find a reference to this formerly all-important LLM talking point in the OpenAI announcements or the research paper it published about GPT-4. We noted this was also true for AI21 Lab’s new Jurassic-2 model announced last week. The absence of this data point may signal a new trend of LLM vendors revealing less about their models.
Multimodal Limitations
The multimodal features are less ambitious than many people expected. You may recall that Microsoft Germany’s CTO said GPT-4 would be announced this week, will be multimodal, and that will include video. I guess two-out-of-three isn’t too bad. Or, maybe one-and-a-half since the image input isn’t really available yet except for one limited-access app.
GPT-4 will allow text and image inputs with text outputs for the developer API edition and for the Be My Eyes app. That app looks pretty extraordinary and merits its own dedicated newsletter article. However, the image input option is not available on ChatGPT. For most people, GPT-4 will remain single mode text in and text out until new applications are released.
The Drip Feature Release Approach
I mentioned in the article about GPT-4’s imminent release that OpenAI’s CEO Sam Altman had suggested GPT-4 features may actually be implemented over time as they undergo additional testing. That is what we saw today. Performance improvements in the core model, and more faithfulness to the user prompt request. Most of the novel features will be available in future GPT-4 updates.
The demo of the Be My Eyes app showed a the multimodal input capability that, for the moment, is limited to blind users. To be clear, the demo showed a more robust set of use cases than any image input capability I have seen to date. It will be particularly helpful for the blind and those with limited sight, but it is also likely to be popular with sighted users as well.
GPT-4 won’t be a game changer right now, but it is likely to be the platform of choice very soon. Well, that will definitely be true if the token cost is comparable and OpenAI can improve the speed of text generation. 😀
One last point. Microsoft’s Yusuf Mehdi confirmed today that the New Bing search engine uses GPT-4.
We are happy to confirm that the new Bing is running on GPT-4, which we’ve customized for search. If you’ve used the new Bing preview at any time in the last five weeks, you’ve already experienced an early version of this powerful model. As OpenAI makes updates to GPT-4 and beyond, Bing benefits from those improvements.