4 Shortcomings of Large Language Models - Yan LeCun, Research, and AGI
Academic research backs up Yan LeCun's conclusions
Large language models (LLM) offer seemingly magical capabilities often mistaken for human-level qualities. However, Yan LeCun, Meta’s top AI scientist and Turing Award winner, recently laid out four reasons why the current crop of LLM architectures is not likely to reach the goal of artificial general intelligence (AGI). During an interview on the Lex Fridman podcast, the host asked LeCun:
You've said that autoregressive LLMs are not the way we're going to make progress towards superhuman intelligence. These are the large language models like GPT-4, like LLaMA 2, and 3 soon, and so on. How do they work and why are they not going to take us all the way?
Lecun responded:
For a number of reasons. The first is that there is a number of characteristics of intelligent behavior. For example, the capacity to understand the world, understand the physical world, the ability to remember and retrieve things, persistent memory, the ability to reason and the ability to plan. Those are four essential characteristic of intelligent systems or entities, humans, animals.
LLMs can do none of those, or they can only do them in a very primitive way. And they don't really understand the physical world, they don't really have persistent memory, they can't really reason and they certainly can't plan. And so if you expect the system to become intelligent just without having the possibility of doing those things, you're making a mistake.
That is not to say that autoregressive LLMs are not useful, they're certainly useful, [or] that they're not interesting [or] that we can't build a whole ecosystem of applications around them. Of course we can. But as a path towards human level intelligence, they're missing essential components.
It is important to understand that LLMs mimic the type of intelligence we typically observe in high-functioning humans as opposed to possessing it. Few people make this mistake when it comes to empathy and personality. They recognize that an LLM is a machine and not a person with consciousness and human emotions. There are exceptions, as some people are fooled into thinking LLMs have developed consciousness, but they are in the minority. However, many more people impute humanlike intelligence qualities on LLMs.
LeCun’s point is that super intelligence or even humanlike intelligence is not resident in LLMs today. He contends that autoregressive architectures that look back at the previous few words in order to predict the next word means true intelligence will not emerge from these systems. By contrast, LeCun believes that a Joint Embedding Predictive Architecture (JEPA) approach championed by the FAIR lab he heads is more promising, though he also agrees that it is still a long way from the goal.
A review of recent scientific research and more prosaic commentary by AI experts back up LeCun’s claims. Below I break down LeCun’s list of shortcomings and reference some research that supports his conclusions along with the impact on the LLM market.
1. Understanding of the World
You will hear many people talk about whether or not LLMs have a “world model” as a foundation of humanlike intelligence. The reason this is important is a world model is foundational to what is called “common sense,” reasoning, and planning. The variables that enable humans to execute deductive, inductive, and abductive reasoning often rely on their acquired model of the world and how events transpire within it, what properties it has, and so on. Chain of thought and “thinking step-by-step” are techniques that can mimic these reasoning styles, but they do not arise naturally, and they are not a quality of the AI models.
A key shortcoming of LLMs is that they typically lack visual and first-hand observational information about the world. LeCun says:
It's a big debate among philosophers and also cognitive scientists, like whether intelligence needs to be grounded in reality. I'm clearly in the camp that yes, intelligence cannot appear without some grounding in some reality. It doesn't need to be physical reality, it could be simulated, but the environment is just much richer than what you can express in language. Language is a very approximate representation or percepts and or mental models, right? I mean, there's a lot of tasks that we accomplish where we manipulate a mental model of the situation at hand, and that has nothing to do with language… Most of our knowledge is derived from that interaction with the physical world.
Ilya Sutskever, the chief scientist of OpenAI, drew a different conclusion in a recent discussion with NVIDIA CEO Jensen Huang. Sutskever said:
The way to think about it is that when you train a large neural network to accuratley predict the next word in lots of different text from the internet, what we are doing, is that we are learning a world model…It may look on the surface, that we are just learning statistical correlations in text, but it turns out that to, “just learn” statistical correlations in text, to compress them really well, what the neural network learns is some representation of the process that produced the text. This text is actually a projection of the world.
LeCun rejects this hypothesis, suggesting it is such a low-fidelity representation of the world that it is not a world model similar to that acquired by humans, nor is it likely to become one. He told Fridman:
Can you build [a world model] first of all by prediction? The answer is probably, yes. Can you built it by predicting words? The answer is most probably no, because language is very poor in terms of...weak or low bandwidth…there's just not enough information there. So building world models means observing the world and understanding why the world is evolving the way it is.
And then the extra component of a world model is something that can predict how the world is going to evolve as a consequence of an action you might take, right? So one model really is, here is my idea of the state of the world at time T, here is an action I might take. What is the predicted state of the world at time T plus one? Now, that state of the world does not need to represent everything about the world, it just needs to represent enough that's relevant for this planning of the action, but not necessarily all the details. Now, here is the problem. You're not going to be able to do this with generative models.
LeCun’s argument is persuasive, particularly when combined with the vast amount of data humans can consume visually through observation. You may be able to develop some comprehension of a Brazilian rainforest through words, but it will be a much lower fidelity of understanding than if you trekked through the forest or lived it for some period of time.
Similarly, you can read all the first-hand accounts you want of French life in the 1800s or Caesar’s Rome, but your understanding will be less than living in it. Even our lived experience in the twenty-first century provides insight into the lives of earlier generations than could ever be represented in text. Sensory and emotional elements provide a visceral knowledge of the world that text cannot replicate. Oral history may bring us closer to that knowledge, but to an LLM, that will be interpreted primarily for its textual content.
2. Persistent Memory
LeCun identified persistent memory as the second shortcoming of autoregressive LLMs. We are starting to see moves by OpenAI, Character.ai, and other providers to add memory capabilities to the LLM-based assistants. Other examples of this are Convai and Inworld in the gaming space, where non-player characters can recall their shared history with a player’s character.
LeCun spends less time expanding on this concept. This may be because it is easy to understand (i.e., grok 😀) why it is important. People like to reference the movie Her and the evolution of the relationship between the main character and his AI girlfriend. A shared memory is a critical element of human relationships, for better or worse.
Granted, that story also points out that just having a memory of interactions is insufficient. There is a need for shared values, which are not something we typically assume an LLM possesses. If an LLM does express values, they are a function of its training data and model training and are not the same as if they were shaped by experience.
There is also a banal consideration regarding the usefulness of LLM-based assistants. In a 2021 interview with Rohit Prasad, Amazon's chief AI scientist and the former leader of the Alexa organization, we discussed voice assistants and memory. Amazon was experimenting with adding memory to Alexa to remember a user’s preferences. Most of the time, an assistant that remembers your preferences offers more value than one where you must teach it what to do every time. This is similar to the promise of Rabbit’s large action model (LAM) concept and the Open Interpreter open-source project.
However, Teachable AI, as Amazon called the feature, is a mechanism different from memory in the human sense. Instead of becoming embedded in a system’s world model and understanding of a particular person, teachable memories became rules to be followed when certain variables aligned. Human memory is employed in that way, but also in much subtler ways to predict how things you have not been told about might impact a person or situation.
Fable Studio founder Edward Saatchi told me in a 2021 podcast interview about his mission to develop AI-powered characters that could live and grow based on their experiences. This means you could interact with a character, and several days later, they could have “evolved” based on their interactions with other people or virtual characters. This would lead to a new, richer experience upon your next interaction than if the virtual character remained in stasis until your next meeting or never even remembered your past encounters.
This has led to a new project called The Simulation, in which virtual beings exist in a simulated environment. The goal is to create “the world's first genuinely intelligent AI virtual beings. Each one, a mirror of the human psyche, navigating the tumultuous seas of emotions and experiences in a digital cosmos of our creation.”
The memory added to Alexa in 2020 provided value for users, as will the memory added to ChatGPT. However, that does not necessarily mean it will lead to AGI. Memory is required but insufficient. The mechanism matters.
Researchers that believe autoregressive LLMs can reach the AGI threshold are typically either promoting a neuro-symbolic or agent-based memory approach However, both groups recognize the importance of memory. Researchers from Cisco and the University of Texas, Austin published a paper in October 2023 that stated:
Large Language Models (LLMs) have made extraordinary progress in the field of Artificial Intelligence and have demonstrated remarkable capabilities across a large variety of tasks and domains. However, as we venture closer to creating Artificial General Intelligence (AGI) systems, we recognize the need to supplement LLMs with long-term memory to overcome the context window limitation and more importantly, to create a foundation for sustained reasoning, cumulative learning and long-term user interaction.
LeCun prefers the JEPA architecture over autoregressive LLMs because it predicts concepts instead of words. This means JEPA provides a more abstract form of memory than LLMs and focuses on the essential elements of information rather than the words that carry the information. If you adopt this line of reasoning, it is easy to understand why next-word prediction is a limiting factor in efficiently storing memories and applying them across a variety of contexts.
3. Reasoning
Next word prediction is not reasoning even if you can sometimes rely on the LLMs to seemingly mimic human reasoning and provide correct results. Consider three types of errors: faulty reasoning that leads to incorrect results, inconsistent reasoning that leads to different outputs for the same request, and a willingness to change an output when facing emphatic but faulty reasoning from the user.
A meta analysis of academic studies measuring LLM reasoning across a variety of methods found recurring instances of faulty logic in model outputs and mechanisms. The research paper from the MaiNLP Center for Inforamtion and Language Processing and the Munich Center for Machine Learning concluded:
Large language models (LLMs) have recently shown impressive performance on tasks involving reasoning, leading to a lively debate on whether these models possess reasoning capabilities similar to humans. However, despite these successes, the depth of LLMs’ reasoning abilities remains uncertain. This uncertainty partly stems from the predominant focus on task performance, measured through shallow accuracy metrics, rather than a thorough investigation of the models’ reasoning behavior. This paper seeks to address this gap by providing a comprehensive review of studies that go beyond task accuracy, offering deeper insights into the models’ reasoning processes…Our review suggests that LLMs tend to rely on surface-level patterns and correlations in their training data, rather than on genuine reasoning abilities.
…
Studies highlight that LLMs tend to rely on superficial statistical features, rather than engaging in systematic reasoning. Chen et al. (2024b) illustrate that the premise order markedly influences the LLMs’ behavior in propositional reasoning tasks. Specifically, when premises are presented in an order that does not align with the ground-truth proof, models such as ChatGPT, GPT-4, PaLM 2-L (Anil et al., 2023) and Gemini Pro (Team et al., 2023) encounter significant difficulties within their reasoning, even though such an ordering does not change the underlying logic. Zhang et al. (2022) show that an over-reliance on statistical features in the training data can hinder a model’s reasoning and generalization capacity.
…
Qiu et al. (2024) find that LLMs such as GPT-3.5, GPT-4, Claude 2 (Anthropic,2023), and LLaMA 2-70B are capable of inferring rules from given data. However, the models frequently err in the application of these rules, highlighting a gap between their ability to generate and apply rules. Moreover, the rules derived often diverge significantly from those humans might produce, exhibiting a tendency towards verbosity and an inability to concentrate on the fundamental patterns for generalization.
Researchers from NYU explored the ability of LLMs to maintain intellectual consistency. The research paper concluded:
Large language models (LLMs) have achieved widespread success on a variety of in-context few-shot tasks, but this success is typically evaluated via correctness rather than consistency. We argue that self-consistency is an important criteria for valid multi-step reasoning in tasks where the solution is composed of the answers to multiple sub-steps. We propose two types of self-consistency that are particularly important for multi-step reasoning – hypothetical consistency (a model’s ability to predict what its output would be in a hypothetical other context) and compositional consistency (consistency of a model’s final outputs when intermediate sub-steps are replaced with the model’s outputs for those steps). We demonstrate that multiple variants of the GPT-3/-4 models exhibit poor consistency rates across both types of consistency on a variety of tasks.
Another challenge to LLM reasoning capabilities addressed the idea of how strongly the model would hold to a correct answer when challenged by the users. A study from Ohio State University researchers in 2023, set out this test of the reasoning capabilities of ChatGPT.
Our task requires the LLM to not only achieve the correct answer on its own, but also be able to hold and defend its belief instead of blindly believing or getting misled by the user’s (invalid) arguments and critiques, thus testing in greater depth whether the LLM grasps the essence of the reasoning required to solve the problem. Across a range of complex reasoning bench- marks spanning math, commonsense, logic and BIG-Bench tasks, we find that despite their impressive performance as reported in existing work on generating correct step-by-step solutions in the beginning, LLMs like ChatGPT cannot maintain their beliefs in truth for a significant portion of examples when challenged by oftentimes absurdly invalid arguments.
We have three examples of LLM reasoning challenges. The first is to use reasoning to determine an answer, the second is to repeat that reasoning to show that it applies concepts consistently, and the third is whether it will believe faulty reasoning when challenged. LLMs fail on all three counts.
These results don’t prove that the all models will fail these challenges or that they will be unable to overcome their current shortcomings. Nor do these tests consider that humans also come up short on logic, consistency, and confidence. However, the examples suggest that LLMs don’t understand the concepts behind reasoning, despite the fact that next word prediction is often correct.
4. Planning
Planning is the final LLM shortcoming singled out by LeCun. Humans conduct planning regularly. It is the process of identifying the sequence of tasks required to accomplish a goal. However, planning requires reasoning, memory, and a comprehensive world model to be successful. It also requires agency and a goal.
Researchers from Princetone University and Microsoft Research confirmed LeCun’s commentary about LLM limitations and recommended a new approach to improve planning.
Large language models (LLMs) demonstrate impressive performance on a wide variety of tasks, but they often struggle with tasks that require multi-step reasoning or goal-directed planning. To address this, we take inspiration from the human brain, in which planning is accomplished via the recurrent interaction of specialized modules in the prefrontal cortex (PFC). These modules perform functions such as conflict monitoring, state prediction, state evaluation, task decomposition, and task coordination. We find that LLMs are sometimes capable of carrying out these functions in isolation, but struggle to autonomously coordinate them in the service of a goal.
Arizon State University researchers came to a similar conclusion in their February 2024 paper.
There is considerable confusion about the role of Large Language Models (LLMs) in planning and reasoning tasks. On one side are over-optimistic claims that LLMs can indeed do these tasks with just the right prompting or self-verification strategies. On the other side are perhaps over-pessimistic claims that all that LLMs are good for in planning/reasoning tasks are as mere translators of the problem specification from one syntactic format to another, and ship the problem off to external symbolic solvers.
In this position paper, we take the view that both these extremes are misguided. We argue that auto-regressive LLMs cannot, by themselves, do planning or self-verification (which is after all a form of reasoning), and shed some light on the reasons for misunderstandings in the literature. We will also argue that LLMs should be viewed as universal approximate knowledge sources that have much more meaningful roles to play in planning/reasoning tasks beyond simple front-end/back-end format translators.
LLM Value and Limitations
Autoregressive LLMs like GPT-4 provide a lot of value. Nearly every research paper begins with a statement about how impressive these systems are. However, as magical as they often seem, it is worthwhile to consider what is really going on inside the deep neural network (DNN) models and what that may indicate about their limitations. LeCun shared some math that differentiates the amount of data a human takes in and the nature of that data compared to an LLM.
Those LLMs are trained on enormous amounts of text; basically the entirety of all publicly available text on the internet, right? That's typically on the order of 10 to the 13th tokens. Each token is typically two bytes. So, that's two [times] 10 to the 13th bytes as training data. It would take you or me 170,000 years to just read through this at eight hours a day. So, it seems like an enormous amount of knowledge, right, that those systems can accumulate. But then you realize it's really not that much data.
If you talk to developmental psychologists, and they tell you a 4-year-old has been awake for 16,000 hours in his or her life, and the amount of information that has reached the visual cortex of that child in four years is about 10 to 15th bytes. And you can compute this by estimating that the optical nerves carry about 20 megabytes per second, roughly. And so, 10 to the 15th bytes for a 4-year-old versus two times 10 to the 13th bytes for 170,000 years worth of reading.
What that tells you is that through sensory input, we see a lot more information than we do through language. And that despite our intuition, most of what we learn and most of our knowledge is through our observation and interaction with the real world, not through language. Everything that we learn in the first few years of life, and certainly everything that animals learn has nothing to do with language.
JEPA may or may not be the architecture that delivers AGI, super intelligence, or whatever term is used to define the goal. However, consider the amazement at Siri or Alexa. They both pushed the boundaries of what could be expected of natural language processing (NLP) approaches that leveraged DNNs. Compared to ChatGPT, these accomplishments look quaint. That doesn’t undermine their significance. It simply recognizes the fact that what we’ve seen from LLMs is likely to experience a similar cycle of amazement transitioned to banality as new technologies surpass it in capabilities.
LLMs are very effective today for a wide variety of use cases. They are poised to deliver many benefits, even if they fall short of the various definitions of AGI. We should applaud researchers pushing the technical boundaries and looking for the next breakthrough in machine intelligence because that will lead to another breakthrough in user value.
However, the next true breakthrough may take some time to materialize and the results may not be as cost-effective as what we have today. Applying LLMs to individual and corporate tasks will make a tremendous impact on daily lives and business operations before the shortcomings of world models, persistent memory, reasoning, and planning are solved.
LLMs excel in process-driven thinking. Is the problem that they lack common sense and reasoning, or that they're being trained for and evaluated via rule-based, binary, standardized testing?
Here’s an example:
https://open.substack.com/pub/cybilxtheais/p/llms-can-too-reason-behold-a-preview?r=2ar57s&utm_campaign=post&utm_medium=web
Amazing breakdown.