Scale AI's $1B Funding Round Highlights a New Phase in the Data and AI Wars
The shift from data volume to data quality is well underway
AI foundation models, whether they are large language models (LLM) or multimodal models with capabilities extending to audio, video, and images, rely on data to work their magic. Scale AI’s founder, Alexandr Wang, commented in a recent blog:
AI is built from three fundamental pillars: data, compute, and algorithms.
In this construction, NVIDIA, other chip makers, and the cloud hyperscalers are focused on “compute.” Foundation model developers are responsible for “algorithms.” Companies like Scale live in the world of data.
However, when considering data and AI, it is useful to break that segment down further. One way to segment the solution space in relation to AI is to consider whether the feature focus relates to using, storing, labeling, collecting, or generating data.
Foundation models use data. Data lakes and databases store data. Scale AI and competitors, such as Snorkel, Appen, and TELUS International, label, collect, and generate data. This is a fiercely competitive segment that serves the needs of foundation model developers, teams looking to fine-tune foundation models, automobile manufacturers, application developers, and numerous other buyers. It is also differentiated in that automation is often coupled with humans in the loop.
The Big Round
The rise of AI for machine learning applications and self-driving had already created outsized demand for data labeling, collection, and generation services before 2022. Generative AI’s expansion over the past 18 months has amplified this demand, and Scale’s recent funding round is evidence of the data segment’s momentum.
Scale AI announced last week that it had raised $1 billion in new funding at a $14 billion valuation. That is nearly double the company’s $7.3 billion valuation from 2021 and three times the $3.5 billion in 2020. The round was led by Accel and included a mix of prominent venture capital and private equity firms and tech giants, including Amazon, Meta, NVIDIA, Qualcomm, and ServiceNow.
The funding round and valuation increase were driven by strong revenue growth, particularly in the generative AI segment. A Fortune interview with Wang revealed:
Scale AI provides human workers and software services that help companies label and test data for AI model training, a critical step in getting AI to be effective. For Scale AI, that business is growing quickly as corporate customers race to adopt products related to generative AI—so much so that a whopping 90% of its business is now driven by that spending on that subset of AI.
In an exclusive interview with Fortune, Scale AI CEO Alexandr Wang shared previously undisclosed details about the company’s financials that show just how quickly that growth is happening. The company’s annual recurring revenue—the money paid by businesses for Scale AI's services over extended periods of time—tripled in 2023 to an undisclosed amount and is expected to reach $1.4 billion by the end of 2024, he said.
Wang told Fortune that growth was 200% year-over-year, which suggests that revenue in 2023 was about $465 billion, though some sources report it was $600-$700 million. He also said that Scale AI would be profitable by year-end, and the company was preparing for an IPO, though he did not indicate the timing.
Some of Scale AI’s competitors have already achieved profitability. However, investors have been attracted to the company’s strong growth rates and high-profile customers among generative AI foundation model builders, automakers, and U.S. government agencies.
From Services to Products
Another key shift in Scale AI’s business model is from a completely services-oriented model to a product model. While the company previously focused on collecting and annotating data using human raters, it has been developing curated datasets annotated by domain experts to sell to developers and companies for AI model fine-tuning. This presents an opportunity to drive higher margins from product sales.
Scale AI is also developing tools to evaluate models and applications based on these curated datasets. This will enable the company to provide value beyond serving as a data collection and labeling vendor.
What’s Next
Foundation model quality improvements were initially driven by increasing parameter counts. That was followed by a combination of novel architectures and significantly larger training datasets. More recently, foundational model improvements have been attributed to higher quality datasets that include curated data by knowledge domain and verified by experts in law, medicine, science, mathematics, and other disciplines.
The next phase of development is likely to be a rise in more sophisticated model fine-tuning by enterprise users and developers of domain-optimized models. These data users start with general-purpose foundation models and enhance them with additional training that relies on high quality domain-expert reviewed datasets.
However, these fine-tuning customers often have far smaller training budgets compared to the leading foundation model developers. Many will find it cost-prohibitive to pay for large-scale datasets verified by domain experts.
As a result, a practical business model choice is to develop and sell curated datasets to many customers instead of contracting the work to a single foundation model developer. This approach provides the opportunity to amortize the collection and annotation cost across a larger customer base and could eventually drive higher margins if demand rises as forecasted. By building these domain-curated datasets in advance of that demand increase, Scale AI could be positioned for another round of growth.
Scale AI will not be alone in these efforts. Its competitors have similar capabilities, and some have adopted similar strategies. The data lake wars and foundation model wars are already at a fever pitch and well-recognized by industry observers and the news media. The dataset wars are not as visible but are no less fierce. Scale AI’s latest funding round shines a spotlight on another AI landscape battlefront that is sure to become more visible over the next two years.