Retrieval-augmented generation (RAG) combines large language models (LLMs) with external knowledge retrieval. Traditional LLMs generate responses based solely on pre-trained data. With RAG, the model can access updated and specific information at the time of inference, providing more accurate and context-rich responses. This method leverages repositories of external data, ensuring that the generated outputs are coherent, relevant, and up-to-date.
The core advantage of RAG is its ability to dynamically fetch and integrate information. It bridges the gap between static model knowledge and the fluidity of real-world data. For instance, while a standard LLM may struggle with rapidly changing facts or niche information outside its training data, RAG can retrieve and utilize external data sources to enhance its responses. This makes RAG valuable in applications requiring current or specialized knowledge, such as customer support, research, and content creation.
In this article:
The first step in retrieval-augmented generation (RAG) involves creating or identifying external data outside the original training set of the large language model (LLM). This external data can come from various sources, including APIs, databases, or document repositories, and may exist in different formats, such as files, database records, or long-form text.
To make this data usable by the RAG model, it needs to be converted into a format that the model can understand. This is typically done using embedding techniques, where the data is transformed into numerical representations and stored in a vector database. This process builds a knowledge library that the LLM can access during inference to provide more informed and contextually relevant responses.
Once the external data source is established, the next step is to retrieve information relevant to the user's query. This involves converting the user query into a vector representation, which is then matched against the vectorized data in the knowledge library. The goal is to find and retrieve the most relevant pieces of information that align with the query.
For example, if a chatbot is used to answer HR-related questions within a company, and an employee asks about their available annual leave, the system would search for and retrieve documents related to the company's leave policy and the employee's specific leave records. The relevance of these documents is determined through mathematical vector calculations.
After retrieving the relevant information, the next step is to augment the LLM's prompt with this data. This involves integrating the retrieved information into the original user input to create a more contextually enriched prompt.
The augmented prompt is then fed into the LLM, which uses both its pre-existing knowledge and the new, context-specific data to generate a response. This step leverages prompt engineering techniques to ensure that the LLM can effectively utilize the additional information.
To ensure that the RAG system remains effective over time, it's crucial to keep the external data up-to-date. This involves regularly updating the data sources and refreshing the vector representations stored in the knowledge library.
Updating can be done either in real-time or through periodic batch processes, depending on the nature of the data and its rate of change. Keeping the external data current is essential for maintaining the accuracy and relevance of the information that the RAG system retrieves and integrates into responses.
Here are the top 10 tutorials showcasing the latest techniques for creating your own RAG systems:
Related content: Read our guide to RAG vs fine tuning
LangChain is a framework that simplifies the integration of large language models (LLMs) with external data sources and complex workflows. Among other capabilities, LangChain makes it easy to set up RAG. Below is a quick tutorial showing how to implement RAG with LangChain.
These instructions are adapted from the LangChain documentation.
1. Jupyter Notebook: Ensure you have Jupyter installed. If not, you can install it via pip:
pip install notebook
Jupyter notebooks are ideal for running this guide because they provide an interactive environment where you can run each step and see immediate results.
2. Install Required Packages: The primary packages needed for this tutorial are
langchain
langchain_community
langchain_chroma
pip install langchain langchain_community langchain_chroma
3. Set Up LangSmith: If you plan to monitor and trace your LangChain application, set up LangSmith for logging. First, sign up for an account, then set the necessary environment variables: \
import getpass import os os.environ["LANGCHAIN_TRACING_V2"] = "true" os.environ["LANGCHAIN_API_KEY"] = getpass.getpass()
This setup will help you trace and debug the chain of operations in your application.
The first step in building a RAG-based application is to load the data you want to use as a knowledge base. In this example, we load a Time Magazine article from a web URL using
DocumentLoaders
Here’s how you can load the content:
# Import the BeautifulSoup module for parsing HTML content import bs4 # Import the WebBaseLoader class from the langchain_community.document_loaders module from langchain_community.document_loaders import WebBaseLoader # Define a SoupStrainer to filter out only specific HTML elements based on class names bs4_strainer = bs4.SoupStrainer(class_=("post-title", "post-header", "post-content")) # Create an instance of WebBaseLoader with the specified webpage URL. 'bs_kwargs' argument passes BeautifulSoup-specific arguments loader = WebBaseLoader( web_paths=("https://time.com/6556168/when-ai-outsmart-humans/",), bs_kwargs={"parse_only": bs4_strainer}, ) # Load the documents from the specified URL, applying the filtering specified by bs4_strainer docs = loader.load() # Print the length of the content of the first document loaded print(len(docs[0].page_content))
This code loads the webpage content while stripping away unnecessary parts, keeping only the title, headers, and content.
After loading the document, it is usually too large to fit into the context window of most models. Therefore, the next step is to split the document into manageable chunks that can be stored and searched efficiently. We use
RecursiveCharacterTextSplitter
Here’s the code:
# Import the RecursiveCharacterTextSplitter class from the langchain.text_splitter module from langchain.text_splitter import RecursiveCharacterTextSplitter # Create a RecursiveCharacterTextSplitter with these parameters: # - chunk_size: The maximum size of each text chunk # - chunk_overlap: Characters that overlap between chunks # - add_start_index: If True, adds the starting index of each chunk to the metadata. text_splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200, add_start_index=True ) # Use the text_splitter to split the documents into smaller chunks based on the specified chunk_size and chunk_overlap all_splits = text_splitter.split_documents(docs) # Print the total number of text chunks created from the documents print(len(all_splits)) # Print the metadata of the first chunk to show additional information about this chunk print(all_splits[0].metadata)
This process ensures that the chunks are not only of manageable size but also retain contextual relevance due to the overlap.
Once the document is split into chunks, the next step is to index these chunks into a vector store. This allows for efficient retrieval based on the similarity of the chunks to a user query. We use
Chroma
OpenAIEmbeddings
Here’s how to do it:
from langchain.vectorstores import Chroma from langchain.embeddings.openai import OpenAIEmbeddings vectorstore = Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings())
This code creates a vector store that you can query to find the most relevant document chunks based on user input.
The retrieval process involves searching the vector store to find document chunks relevant to a user's query. We use the
VectorStoreRetriever
# Create a retriever object from the vectorstore with these arguments: # - search_type="similarity": The retriever should use similarity-based search # - search_kwargs={"k": 6}: Sets the number of top similar documents to retrieve to 6 retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6}) # Use the retriever to find and retrieve documents relevant to the given query retrieved_docs = retriever.invoke("What is the scaling hypothesis?") # Print the number of documents retrieved print(len(retrieved_docs)) # Print the content of the first retrieved document print(retrieved_docs[0].page_content)
This retrieves the top 6 most relevant document chunks based on the similarity to the query.
Finally, we chain all the components together to generate a response. We pass the retrieved documents and the original user query to the language model to produce a final answer:
# Create a retriever object from the vectorstore with these arguments: # - search_type="similarity": Specifies that the retriever should use similarity-based search # - search_kwargs={"k": 6}: Sets the number of top similar documents to retrieve to 6 retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6}) # Use the retriever to find and retrieve documents relevant to the given query retrieved_docs = retriever.invoke("What is the scaling hypothesis?") # Print the number of documents retrieved print(len(retrieved_docs)) # Print the content of the first retrieved document print(retrieved_docs[0].page_content)
This setup ties together retrieval and generation, enabling the creation of a dynamic question-answering system.
To see what you can start building today with GPTScript, visit our docs at https://gptscript-ai.github.io/knowledge/. For a great example of RAG at work on GPTScript check out our blog GPTScript Knowledge Tool v0.3 Introduces One-Time Configuration for Embedding Model Providers.