Top 10 RAG Tutorials in 2024 + Bonus LangChain Tutorial

What Is Retrieval Augmented Generation (RAG)?

Retrieval-augmented generation (RAG) combines large language models (LLMs) with external knowledge retrieval. Traditional LLMs generate responses based solely on pre-trained data. With RAG, the model can access updated and specific information at the time of inference, providing more accurate and context-rich responses. This method leverages repositories of external data, ensuring that the generated outputs are coherent, relevant, and up-to-date.

The core advantage of RAG is its ability to dynamically fetch and integrate information. It bridges the gap between static model knowledge and the fluidity of real-world data. For instance, while a standard LLM may struggle with rapidly changing facts or niche information outside its training data, RAG can retrieve and utilize external data sources to enhance its responses. This makes RAG valuable in applications requiring current or specialized knowledge, such as customer support, research, and content creation.

In this article:

How Does Retrieval-Augmented Generation Work?
Top 10 RAG Tutorials from the Community
Bonus RAG Tutorial: Building a RAG-Based Question-Answering App with LangChain

How Does Retrieval-Augmented Generation Work? {#how-does-retrieval-augmented-generation-work}

1. Enable Access to External Data

The first step in retrieval-augmented generation (RAG) involves creating or identifying external data outside the original training set of the large language model (LLM). This external data can come from various sources, including APIs, databases, or document repositories, and may exist in different formats, such as files, database records, or long-form text.

To make this data usable by the RAG model, it needs to be converted into a format that the model can understand. This is typically done using embedding techniques, where the data is transformed into numerical representations and stored in a vector database. This process builds a knowledge library that the LLM can access during inference to provide more informed and contextually relevant responses.

2. Retrieve Relevant Information

Once the external data source is established, the next step is to retrieve information relevant to the user’s query. This involves converting the user query into a vector representation, which is then matched against the vectorized data in the knowledge library. The goal is to find and retrieve the most relevant pieces of information that align with the query.

For example, if a chatbot is used to answer HR-related questions within a company, and an employee asks about their available annual leave, the system would search for and retrieve documents related to the company’s leave policy and the employee’s specific leave records. The relevance of these documents is determined through mathematical vector calculations.

3. Augment the LLM Prompt

After retrieving the relevant information, the next step is to augment the LLM’s prompt with this data. This involves integrating the retrieved information into the original user input to create a more contextually enriched prompt.

The augmented prompt is then fed into the LLM, which uses both its pre-existing knowledge and the new, context-specific data to generate a response. This step leverages prompt engineering techniques to ensure that the LLM can effectively utilize the additional information.

4. Update External Data

To ensure that the RAG system remains effective over time, it’s crucial to keep the external data up-to-date. This involves regularly updating the data sources and refreshing the vector representations stored in the knowledge library.

Updating can be done either in real-time or through periodic batch processes, depending on the nature of the data and its rate of change. Keeping the external data current is essential for maintaining the accuracy and relevance of the information that the RAG system retrieves and integrates into responses.

Bonus RAG Tutorial: Building a RAG-Based Question-Answering App with LangChain {#bonus-rag-tutorial-building-a-rag-based-question-answering-app-with-langchain}

LangChain is a framework that simplifies the integration of large language models (LLMs) with external data sources and complex workflows. Among other capabilities, LangChain makes it easy to set up RAG. Below is a quick tutorial showing how to implement RAG with LangChain.

These instructions are adapted from the LangChain documentation.

Prerequisites

1. Jupyter Notebook: Ensure you have Jupyter installed. If not, you can install it via pip:

pip install notebook

Jupyter notebooks are ideal for running this guide because they provide an interactive environment where you can run each step and see immediate results.

2. Install Required Packages: The primary packages needed for this tutorial are langchain, langchain_community, and langchain_chroma. You can install these using pip:

pip install langchain langchain_community langchain_chroma

3. Set Up LangSmith: If you plan to monitor and trace your LangChain application, set up LangSmith for logging. First, sign up for an account, then set the necessary environment variables:

    import getpass
    import os

    os.environ["LANGCHAIN_TRACING_V2"] = "true"
    os.environ["LANGCHAIN_API_KEY"] = getpass.getpass()

This setup will help you trace and debug the chain of operations in your application.

Loading Data

The first step in building a RAG-based application is to load the data you want to use as a knowledge base. In this example, we load a Time Magazine article from a web URL using DocumentLoaders. This loader converts HTML content into text, focusing only on relevant sections of the webpage.

Here’s how you can load the content:

    # Import the BeautifulSoup module for parsing HTML content
    import bs4

    # Import the WebBaseLoader class from the langchain_community.document_loaders module
    from langchain_community.document_loaders import WebBaseLoader

    # Define a SoupStrainer to filter out only specific HTML elements based on class names
    bs4_strainer = bs4.SoupStrainer(class_=("post-title", "post-header", "post-content"))

    # Create an instance of WebBaseLoader with the specified webpage URL. 'bs_kwargs' argument passes BeautifulSoup-specific arguments
    loader = WebBaseLoader(
        web_paths=("https://time.com/6556168/when-ai-outsmart-humans/",),
        bs_kwargs={"parse_only": bs4_strainer},
    )

    # Load the documents from the specified URL, applying the filtering specified by bs4_strainer
    docs = loader.load()

    # Print the length of the content of the first document loaded
    print(len(docs[0].page_content))

This code loads the webpage content while stripping away unnecessary parts, keeping only the title, headers, and content.

Splitting the Document

After loading the document, it is usually too large to fit into the context window of most models. Therefore, the next step is to split the document into manageable chunks that can be stored and searched efficiently. We use RecursiveCharacterTextSplitter for this purpose, which breaks the document into chunks of 1,000 characters with a 200-character overlap.

Here’s the code:

    # Import the RecursiveCharacterTextSplitter class from the langchain.text_splitter module
    from langchain.text_splitter import RecursiveCharacterTextSplitter

    # Create a RecursiveCharacterTextSplitter with these parameters:
    # - chunk_size: The maximum size of each text chunk
    # - chunk_overlap: Characters that overlap between chunks
    # - add_start_index: If True, adds the starting index of each chunk to the metadata.
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000, 
        chunk_overlap=200, 
        add_start_index=True
    )

    # Use the text_splitter to split the documents into smaller chunks based on the specified chunk_size and chunk_overlap
    all_splits = text_splitter.split_documents(docs)

    # Print the total number of text chunks created from the documents
    print(len(all_splits))

    # Print the metadata of the first chunk to show additional information about this chunk
    print(all_splits[0].metadata)

This process ensures that the chunks are not only of manageable size but also retain contextual relevance due to the overlap.

Storing Data Chunks

Once the document is split into chunks, the next step is to index these chunks into a vector store. This allows for efficient retrieval based on the similarity of the chunks to a user query. We use Chroma as the vector store and OpenAIEmbeddings to generate embeddings for each chunk.

Here’s how to do it:

    from langchain.vectorstores import Chroma
    from langchain.embeddings.openai import OpenAIEmbeddings

    vectorstore = Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings())

This code creates a vector store that you can query to find the most relevant document chunks based on user input.

Retrieving Data

The retrieval process involves searching the vector store to find document chunks relevant to a user’s query. We use the VectorStoreRetriever to accomplish this.

    # Create a retriever object from the vectorstore with these arguments:
    # - search_type="similarity": The retriever should use similarity-based search
    # - search_kwargs={"k": 6}: Sets the number of top similar documents to retrieve to 6
    retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6})

    # Use the retriever to find and retrieve documents relevant to the given query
    retrieved_docs = retriever.invoke("What is the scaling hypothesis?")

    # Print the number of documents retrieved
    print(len(retrieved_docs))

    # Print the content of the first retrieved document
    print(retrieved_docs[0].page_content)

This retrieves the top 6 most relevant document chunks based on the similarity to the query.

Retrieval and Generation: Generate

Finally, we chain all the components together to generate a response. We pass the retrieved documents and the original user query to the language model to produce a final answer:

    # Create a retriever object from the vectorstore with these arguments:
    # - search_type="similarity": Specifies that the retriever should use similarity-based search
    # - search_kwargs={"k": 6}: Sets the number of top similar documents to retrieve to 6
    retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6})

    # Use the retriever to find and retrieve documents relevant to the given query
    retrieved_docs = retriever.invoke("What is the scaling hypothesis?")

    # Print the number of documents retrieved
    print(len(retrieved_docs))

    # Print the content of the first retrieved document
    print(retrieved_docs[0].page_content)

This setup ties together retrieval and generation, enabling the creation of a dynamic question-answering system.

Create an Retrieval Augmented Generation System with Acorn

To see what you can start building today with GPTScript, visit our docs at https://gptscript-ai.github.io/knowledge/. For a great example of RAG at work on GPTScript check out our blog GPTScript Knowledge Tool v0.3 Introduces One-Time Configuration for Embedding Model Providers.