RedPajama's Giant 30T Token Dataset Shows that Data is the Next Frontier in LLMs
Data quality is important, but volume is essential
Move over parameters. Training data tokens are the new metric that is driving large language model (LLM) performance, and Together.ai has an impressive new open-source corpus. A year ago, parameter count was the most common metric referenced when discussing an LLM. 175B parameters for GPT-3. Megatron-Turing 530B. LaMDA 137B.
However, times have changed. The LLM providers learned that adding more parameters doesn’t always deliver superior quality, but it always results in longer processing latency and cost. The same-sized model with better data and more data provides better results. Oftentimes, it appears smaller parameter models with more and better data can outperform larger models while consuming less computing resources and cost.
Llama 2 was trained on 2.4T tokens and PaLM 2 on 3.6T tokens. GPT-4 is thought to have been trained on 4T tokens and may have a novel architecture of submodels consuming all of that data. Billions is so 2022. Trillions are the benchmark for training datasets in 2023, and they keep getting bigger.
Together AI introduced a 1 trillion (1T) token dataset called RedPajama in April 2023. A few days ago, it introduced a 30T token dataset that is open-source and based on over 100 billion documents. According to the announcement:
Over the last half a year, we have been pleased to see that RedPajama-1T, which we released in March, has ignited the creation of many new language models. So many people from the community have downloaded this 5TB dataset---more than 190,000 times and have been using them in such creative ways! RedPajama-1T consists of 1 trillion high-quality English tokens, but it was only the first step. Today, with the release of RedPajama-V2, we are making a further step towards the development of open datasets by releasing a massive, 30 trillion token web dataset. This is, to our best knowledge, the largest public dataset released specifically for LLM training. Even more excitingly, we include 40+ pre-computed quality annotations, allowing the community to further filter and weigh the data. Specifically, this release includes:
Over 100 billion text documents with 100+ trillion raw tokens from 84 CommonCrawl dumps;
40+ of the most widely used quality annotations pre-computed for a deduplicated 30 trillion tokens subset;
Five languages: English, French, Spanish, German, and Italian
All data processing scripts are open source and available on GitHub; all data are available on HuggingFace.
Just because you can easily access 30T data tokens doesn’t mean you will use them all to train your model. It depends on what you are trying to accomplish. That is what the annotations are for. You can more easily extract the subsets of data that best meet your objectives. The critical development is that larger and more expertly filtered datasets are being made available to the open-source community.
The Shift from Parameters to Data
The shift to ultra-large datasets appears to have emerged from Google, though it is unclear if that was obvious at the time. GPT-3 only had slightly more data tokens used in training than the total parameter count. However, Google LaMDA’s 1.5T training data tokens were ten times more than its parameter count.
OpenAI’s GPT-3.5 seemed to keep that low parameter to training data token ratio, but Llama 1 and 2 from Meta, GPT-4, and PaLM 2 have all stretched variance between parameter counts and training data tokens. Training data is essentially another feature of LLMs. More data seems to correlate with output quality.
The shift to more data tokens has been accompanied by more work around data filtering, deduplication, and other quality control efforts. Sam Altman, CEO of OpenAI, said on the Lex Fridman podcast, “a lot of our work is building a great dataset.”
Together AI is helping the open-source community by offering the type of scale (and more) that the large proprietary LLM makers have assembled. The company added:
A central ingredient to state-of-the-art open LLMs like Llama, Mistral, Falcon, MPT, and the RedPajama models is the large amounts of high-quality data that these models are trained on. For example, Llama 2 is trained on 2.4 trillion carefully curated tokens. The most prominent data sources are the crawls made publicly available by CommonCrawl. However, this data is crude and is not ideal for direct use for LLM training due to artifacts arising from the conversion of HTML to plain text, sources of generally low quality, and biases inherent to the distribution of content on the web. Getting the right dataset and data mixture is painful and any LLM developer has to go through the laborious, time-consuming, energy-intensive and expensive steps of processing and filtering this crude data. Although there have been several community projects around this effort, such as C4, RedPajama-1T, Refinedweb (Falcon), Dolma (AI2) and SlimPajama, many of them only cover a small portion of the CommonCrawl crawls; moreover, they represent a very specific way in which data are filtered.
With RedPajama-Data-v2, our goal is to lift this burden off the community and provide a pool of web data serving as a base from which high quality datasets for LLM training can be extracted and based on which LLM training data can be thoroughly researched.
A key element of the LLM arms race is access to NVIDIA A100 and H100 GPUs for running foundation model training jobs. The reason these high-performance chips are in such high demand is that the training datasets keep getting larger, and more computing throughput is required to complete the processing.
RedPajama-V2 may be adding fuel to the fire, but it may also help open-source LLMs catch up with their proprietary peers.