Falcon-40B LLM, Trained on 1 Trillion Data Tokens, Knocks LLaMA from Open-Source LLM Top Spot
Abu Dhabi's Technology Innovation Institute makes a strong debut
There’s a new large language model (LLM) leader on the Open LLM Leaderboard hosted by Hugging Face. Falcon-40B from the Technology Innovation Institute (TII) is a 40 billion parameter LLM trained on 1 trillion data tokens of the Falcon RefinedWeb.
Meta’s LLaMA-65B model and a variant called LLaMA-30B-Supercot have been at the top of the leaderboard for open-source LLM ranking since it debuted in February. The leaderboard calculates an average performance across four benchmarks from the Eleuther AI Language Model Evaluation Harness. Falcon-40B scored an average 60.4 to beat the 59.8 for LLaMA-30B-Supercot and 58.3 LLaMA-65B.
Tests include the AI2 Reasoning Challenge (science questions), Hellaswag (commonsense inference), MMLU (multitask accuracy for elementary mathematics, US history, computer science, law, and other tasks), TruthfulQA (how truthfully the model answers). Falcon-40B bested LLaMA-65B in every test and was ahead of Supercot in each test except for TruthfulQA. \
Why This Matters
It may seem that all of the action is centered around OpenAI, Microsoft, and Google with some interesting start-ups and Nvidia thrown in the mix. The common thread among all of these companies is proprietary LLMs. Open-source provides a true market alternative that is not controlled completely by a single company.
Of course, that is only true if the open-source foundation models deliver sufficient performance. More on that topic is below, but the general trend is favorable towards open-source.
The New Model
Thomas Wolf, a Hugging Face co-founder, posted about the achievement on LinkedIn and pointed out a couple of interesting features. The model card largely echoes, Wolf’s comments and adds some addition color, including:
It is the best open-source model currently available. Falcon-40B outperforms LLaMA, StableLM, RedPajama, MPT, etc. See the OpenLLM Leaderboard.
It features an architecture optimized for inference, with FlashAttention (Dao et al., 2022) and multiquery (Shazeer et al., 2019).
It is made available under a license allowing commercial use, see the details of the TII Falcon LLM License below.
Note that LLaMA was developed by Meta, and only offers a non-commercial license for use. It was only released to academic institutions, but it didn’t take long for it to be leaked on the internet. StableLM and the Vicuna models are derivatives of LLaMA and also are restricted non-commercial licenses. Together, which is the primary developer behind RedPajama, just raised $20 million, and offers an Apache 2.0 license for commercial use.
Falcon-40B offers a licensing agreement that TII says is based on Apache 2.0. A key difference is that commercial use requires a royalty to be negotiated with TII with a base rate of 10%, that will be reviewed annually. So, it is open-source but not if you plan to generate revenue.
Big Data
However, the most notable element of the Falcon-40B LLM may be its training data set. The 1,000B tokens, translates into 1 trillion and it twice the size of Nemo-Megatron. This is likely a key reason it outperforms LLaMA-65B despite having fewer parameters.
The training dataset is called Falcon RefinedWeb which was built by TII and is available for use under a straight Apache 2.0 open-source license.
RefinedWeb is built through stringent filtering and large-scale deduplication of CommonCrawl; we found models trained on RefinedWeb to achieve performance in-line or better than models trained on curated datasets, while only relying on web data.
RefinedWeb is also "multimodal-friendly": it contains links and alt texts for images in processed samples.
Falcon RefinedWeb was created to serve as an English large-scale dataset for the pretraining of large language models. It may be used on its own, or augmented with curated sources (e.g., Wikipedia, StackOverflow).
While the Falcon licensing approach may dissuade some users from adopting it, RefinedWeb is almost certain to become a popular training dataset. It has scale, is said to be be easily augmented, and the deduplication and curation efforts appear to have delivered high quality.
Multiple Model Options
Falcon-40B is a general purpose model. There is also a Falcon-40B-instruct model which is fine-tuned for chat applications and carries a similar commercial licensing approach as the foundation model.
There is also a Falcon-7B model which currently ranks 24th on the Open LLM Leaderboard. It outperformed several other 7 billion parament models but not all of them. It provides an option in the same model family that can be run at lower cost.
What is TII
The Technology Innovation Institute is “part of Abu Dhabi Government’s Advanced Technology Research Council, which oversees technology research in the emirate.” Its mission is to “help society overcome its biggest hurdles through a rigorous approach to scientific discovery and inquiry, using state-of-the-art facilities and collaboration with leading international institutions.”
Open Source LLMs on the Rise
The rapid improvement of open-source LLMs was the main topic of the leaked Google engineer’s memo titled, “We Have No Moat, and Neither Does OpenAI.” That document pointed out a few advantages of open-source models and disadvantages of the very large LLMs.
Retraining models from scratch is the hard part (and the very expensive part)
Large models aren’t more capable in the long run if we can iterate faster on small models.
Data quality scales better than data size
Directly Competing With Open Source Is a Losing Proposition
This recent progress has direct, immediate implications for our business strategy. Who would pay for a Google product with usage restrictions if there is a free, high quality alternative without them?
The emergence of Falcon-40B follows the strategies of smaller models with faster iteration cycles and improved data quality. Of course, this model is not fully open-source for commercial use, unlike several other models.
Whether or not TII’s foundation model becomes popular, you should expect another open-source model to supplant its performance relatively soon. That model might even use an augmented or refined version of the Falcon RefinedWeb training dataset.
It is unclear how these models perform against OpenAI’s GPT-3 or GPT-4, but there are some indicators that the gaps is closing.
A study by LMSYS.org found that the Vicuna-13B model had reached 92% of ChatGPT’s performance. For many people, ChatGPT’s performance exceeds requirements and a 90% solution could be good enough. That factor along with steadily improving performance, the ability to control the model, and the option to make modifications may make leading open-source models more attractive in the coming year.
For now, Falcon-40B is sure to get a close look from a wide variety of developers.
Impressive.