No-Code RAG and the Generative AI Knowledge Revolution

Vectors beat context windows and CustomGPT makes RAG easy

Apr 26, 2024

Retrieval augmented generation (RAG) is probably the most referenced acronym in generative AI, aside from LLM. The reason for this is simple. RAG makes LLM-enabled knowledge assistants more accurate by grounding them in a specific dataset.

Generative AI is a knowledge assistance revolution. It is unlocking access to information in our vast stores of text data. Multimodal models indeed handle more than text. However, textual data has been nearly impenetrable for machines to make sense of at scale. LLMs have changed that situation.

Large language models (LLM) are amazing tools for handling text, but they suffer from a grounding challenge. The models are trained on so much data that they are excellent with language but not always strong at maintaining context or providing the right information to answer a user query. There are frequent issues pertaining to context windows (though X.ai may suggest otherwise with Grok 1.5), but even more so with knowledge sources. Failure to recall data in the dataset, failure to call all of the right data, and hallucinations (making up data not in the dataset) are all common LLM problems.

RAG is popular because it dramatically reduces these errors when your use case is focused on leveraging a known knowledge source. If an LLM is a horse, RAG is a combination of reins and blinders. It is a technique that keeps the LLM-based assistant headed in the right direction and eliminates distractions from information outside the use case context. RAG leverages LLMs for their best qualities and minimizes important shortcomings.

RAG is all the RAGE

The term RAG first appeared in a 2020 research paper that included authors from Meta’s FAIR (Facebook AI Research) Lab and was presentated at the NuerIPS 2020 conference. Long before FAIR published the open-source Llama models, text-to-image, text-to-video models, or Meta AI, the research team considered the more prosaic problem of accurately retrieving known information using an LLM. Key conclusions included:

Pre-trained neural language models have been shown to learn a substantial amount of in-depth knowledge from data. They can do so without any access to an external memory, as a parameterized implicit knowledge base. While this development is exciting, such models do have downsides: They cannot easily expand or revise their memory, can’t straightforwardly provide insight into their predictions, and may produce “hallucinations”. Hybrid models that combine parametric memory with non-parametric (i.e., retrieval-based) memories can address some of these issues because knowledge can be directly revised and expanded, and accessed knowledge can be inspected and interpreted…
Our results highlight the benefits of combining parametric and non-parametric memory with generation for knowledge-intensive tasks—tasks that humans could not reasonably be expected to perform without access to an external knowledge source.

The findings around the RAG architecture have been reinforced in several subsequent papers. A March 2024 paper evaluated more than 100 RAG-related research papers and identified several stages of development:

The development trajectory of RAG in the era of large models exhibits several distinct stage characteristics. Initially, RAG’s inception coincided with the rise of the Transformer architecture, focusing on enhancing language models by incorporating additional knowledge through Pre-Training Models (PTM). This early stage was characterized by foundational work aimed at refining pre-training techniques. The subsequent arrival of ChatGPT marked a pivotal moment, with LLM demonstrating powerful in context learning (ICL) capabilities. RAG research shifted towards providing better information for LLMs to answer more complex and knowledge-intensive tasks during the inference stage, leading to rapid development in RAG studies. As research progressed, the enhancement of RAG was no longer limited to the inference stage but began to incorporate more with LLM fine-tuning techniques.

Vectors of Knowledge

RAG relies on a technique known as vectorization, a process for transforming data into vectors with mathematical relationships between various data elements stored in a vector database. This provides a “nearness” measure between the vectors and increases the probability of factual answers provided by generative AI-enabled solutions. It often results in more robust responses as well.

The diagram above reflects the typical difference between an LLM-powered solution with and without RAG. Prompting a RAG-supported system begins with the retrieval of relevant information chunks that are then synthesized by the LLM for the query response. This retrieval process improves factuality and also directs the LLM to focus on the vector database information and not its training data.

This result is the difference between looking at all knowledge to respond to a user query and just looking at the book that has the answer.

In Context or In Vectors

A key consideration in an “answer system” is where the data comes from. Historically, “answers” have been stored in databases. LLMs provide the option for information to be embedded in the model. RAG returns the data location to a data store. However, another source of data is the context window.

Users can upload documents or spreadsheets and converse with their data. This is context window information retrieval. However, it is not infallible. “Lost in the middle” is a common occurrence whereby in-context data is not fully scanned by the LLM when formulating a response. Generally speaking, the recall rates (i.e., identifying and retrieving information when it exists) are inferior to RAG techniques. That can also be cost and latency issues when continually running large context window queries.

Researchers from the Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University, and other research institutions found that large context windows do not eliminate the value of RAG.

With the deepening of related research, the context of LLMs is continuously expanding. Presently, LLMs can effortlessly manage contexts exceeding 200,000 tokens. This capability signifies that long-document question answering, previously reliant on RAG, can now incorporate the entire document directly into the prompt. This has also sparked discussions on whether RAG is still necessary when LLMs are not constrained by context.
In fact, RAG still plays an irreplaceable role. On one hand, providing LLMs with a large amount of context at once will significantly impact its inference speed, while chunked retrieval and on-demand input can significantly improve operational efficiency. On the other hand, RAG-based generation can quickly locate the original references for LLMs to help users verify the generated answers. The entire retrieval and reasoning process is observable, while generation solely relying on long context remains a black box. Conversely, the expansion of context provides new opportunities for the development of RAG, enabling it to address more complex problems and integrative or summary questions that require reading a large amount of material to answer.

No-Code RAG

Recently, Synthedia hosted a webinar focused on RAG techniques for business users. One of the featured companies, CustomGPT, demonstrated a no-code RAG solution. Alden Do Rosario, CEO of CustomGPT.ai, commented:

We started with the idea that it was going to be RAG. So, on day one we created Chat GPT with your own data. From day one our vision was, “If you need to be a coder or if you need to be technical then we have failed.” So, it was a true democratization and … in our mind we have the perfect user whose only technical expertise should be able to use a browser that's it.

The video includes fireside chats with Do Rosario, MIT’s Doug Williams, and Adam Kamor, a co-founder of Tonic. The video also includes a demo of the no-code RAG solution along with details around the RAG evaluation suite from Tonic. Plus, Williams reviews what MIT has learned and what it is using CustomGPT for.

A key finding is that RAG can be complex, but it can be simplified to a level where a business analyst can build key software components.

Perplexity Enterprise Debuts Alongside New Funding Round and $1B Valuation

Bret Kinsella

April 24, 2024

Perplexity Enterprise Debuts Alongside New Funding Round and $1B Valuation

Perplexity announced a new enterprise service today, two new partnerships, and a new funding round. The enterprise service is already in use by Zoom, Bridgewater, Databricks, Snowflake, and others. Softbank and Deutsche Telekom will begin marketing Perplexity to their broadband users.

Read full story

Meta Llama 3 Launch Part 1 - 8B and 70B Models are Here, with 400B Model Coming