Knowledge Hub
In our previous article on semantic search, we explored the creation of embeddings to rank documents based on their relevancy to a given query phrase. However, semantic search has broader applications beyond document ranking. It can serve as a mechanism to connect Language Models (LLMs) to knowledge bases, enabling them to analyze and leverage external data sources.
By default, general LLMs only possess information from their training datasets. However, if we provide them with additional data, they can analyze it and uncover hidden connections between different pieces of information. LLMs can also be used to:
RAG, short for Retrieval-Augmented Generation, is a powerful framework that combines retrieval models and language generation models. It utilizes AI to retrieve documents from a given resource based on the relevancy of a specific context. RAG provides the capability to augment the generation process by leveraging retrieval models for enhanced responses in conversational chat applications.
The key idea behind RAG is to use retrieval models to retrieve relevant documents, and to use language generation models to generate responses based on those documents. By incorporating a retrieval mechanism, RAG can provide more accurate and contextually relevant responses in conversational systems.
Some key use cases of RAG include:
However, it's essential to understand that LLMs themselves cannot access external resources or perform operations beyond generating text. This is where the LangChain library comes in. LangChain enables interaction between LLMs and external systems, allowing for seamless switching of context between the LLM and our software.
LangChain is a communication layer between our software and an LLM. It is a set of tools and functions that allow us to:
The chain of prompts helps to alleviate some of the weaknesses of LLMs. By adding prompts that can perform specific operations on the input, we can channel the dialog in a more structured way, while also enriching the data being fed to the LLM.
Now let's discuss certain limitations of large language models (LLMs) and understand the techniques we can utilize to overcome those limitations.
LLMs are stateless, which means they do not have the ability to remember previous messages or maintain short-term memory like humans do. To overcome this limitation, we can implement context windows and add memory to LLMs.
Context windows involve providing the LLM with the entire conversation history, including past user inputs and model responses. Feeding back this context, the LLM can generate more accurate and relevant responses to ongoing conversations.
While adding memory to LLMs using context windows is a significant step towards improving their performance, it is essential to take into account the limitations imposed by the number of tokens allowed in a single request. Limitations on token* count exist to ensure the efficient processing of large language models.
For instance, models like gpt-3.5-turbo
and gpt-3.5-turbo-16k
have token limits of 4096
and 16,385
tokens, respectively*. On the other hand, gpt-4
has a token limit of 8132
.
As the amount of context increases, the processing becomes slower and more resource-intensive. Therefore, merely adding an infinite amount of text is not feasible within these limitations.
To address this challenge, it is important to summarize the chat history and include only the most relevant messages. For example, if a chat contains 100 messages, it is pragmatic to include only the last 20 messages to ensure that the token count remains within the allowable limits.
In this guide, we will walk through the process of building a chat application where users can ask questions about Greek and Roman myths based on the book "Stories of Old Greece and Rome" by Emilie K. Baker.
Before we can proceed with building the application, there are several steps we need to follow. The first step involves preparing our data. Similar to the approach described in the ChromaDB semantic search article, we need to convert our data into documents and store them in Chroma.
Before we dive into the code, there are a few packages we need to install. These packages are @langchain/core
, @langchain/openai
, and @langchain/community
.
Here's a brief summary of what each package does:
@langchain/core
: adds the core methods of LangChain.
@langchain/openai
: This package bundles classes related to the OpenAI API, which we will use for automatic embedding generation.
@langchain/community
: This package contains helper classes related to various things created by the community, including vector store utilities such as the Chroma utility method that we will use in our application.
To install these packages, you can run the following command in your terminal:
npm install @langchain/core @langchain/community @langchain/openai
# or yarn
yarn add @langchain/core @langchain/community @langchain/openai
Once the packages are installed, we can proceed to the next step of our application.
To begin with, we need to split the large plain text file into smaller, more manageable documents. It is necessary as a single document can not accommodate the entire book. To achieve this, we will utilize the TextLoader
and RecursiveCharacterTextSplitter
.
// Import necessary modules for file path handling and text processing.
import path from "path";
import { TextLoader } from "langchain/document_loaders/fs/text";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
// Asynchronous function to load and split a text document.
async function getDocuments() {
// Create the full file path for 'greek_and_roman_myths.txt'.
const pathToDocument = path.join(
process.cwd(),
"src/assets/docs/greek_and_roman_myths.txt"
);
// Initialize TextLoader to load the document.
const loader = new TextLoader(pathToDocument);
const docs = await loader.load(); // Load the document.
// Initialize RecursiveCharacterTextSplitter to split the text into chunks.
const textSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 3000,
chunkOverlap: 200,
});
// Return the split text chunks.
return textSplitter.splitDocuments(docs);
}
The RecursiveCharacterTextSplitter
is a utility class that helps us split plain text files into smaller documents for our chat application. We can adjust two options: chunkSize
and chunkOverlap
.
To control the size of each document, we use the chunkSize
option. By specifying the desired size, we can make the chunks more manageable for further processing.
Preserving context is important when splitting documents. The chunkOverlap
option helps with this. It adds a portion from the end and beginning of adjacent chunks to maintain the necessary context within each document.
To find the best configuration, I suggest experimenting with different values for chunkSize
and chunkOverlap
. It will help us determine the optimal chunk size based on the formatting and connection strength between sentence contexts.
Let me know if you have any other questions or if there's anything else I can assist you with.
We have successfully prepared our documents, and now it's time to store them in ChromaDB so that our chat application can access and retrieve them efficiently.
First, we need to establish a connection with ChromaDB and load our documents.
We can achieve this easily using LangChain’s Chroma
class from @langchain/community
package as follows:
import { Chroma } from "@langchain/community/vectorstores/chroma";
import { OpenAIEmbeddings } from "@langchain/openai";
function createVectorStore() {
const COLLECTION_NAME = "documents";
const embeddings = new OpenAIEmbeddings();
const vectorStore = new Chroma(embeddings, {
url: process.env.CHROMADB_PATH,
collectionName: COLLECTION_NAME,
});
return vectorStore;
}
We pass the OpenAIEmbeddings
instance as the first argument to the Chroma constructor. It allows us to automatically create embedding vectors for our documents. The next argument is the connection configurations of our vector store instance.
If you're interested in learning more about embeddings, I recommend referring to the semantic search article, where we delve deeper into the concept. There we use a plain ChromaDB package without LangChain. In a nutshell, embeddings are vector representations that capture the semantic meaning of text data.
Here is the full code of the script responsible for preparing our data for further operations in our chat application. We will name this file index-docs.js
.
// Load environment variables from a .env file.
import "dotenv/config";
// Import modules for file path handling and text processing.
import path from "path";
import { TextLoader } from "langchain/document_loaders/fs/text";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
// Import modules for vector storage and OpenAI embeddings.
import { Chroma } from "@langchain/community/vectorstores/chroma";
import { OpenAIEmbeddings } from "@langchain/openai";
// Asynchronously load and split a document into chunks.
async function getDocuments() {
// Construct the full path to the document.
const pathToDocument = path.join(
process.cwd(),
"src/assets/docs/greek_and_roman_myths.txt"
);
// Load the document using TextLoader.
const loader = new TextLoader(pathToDocument);
const docs = await loader.load();
// Split the document into smaller parts with RecursiveCharacterTextSplitter.
const textSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 3000,
chunkOverlap: 200,
});
// Split and return the document.
const splitDocs = await textSplitter.splitDocuments(docs);
return splitDocs;
}
// Create a vector store for document handling.
function createVectorStore() {
// Define the collection name for the documents.
const COLLECTION_NAME = "documents";
// Initialize OpenAI embeddings.
const embeddings = new OpenAIEmbeddings();
// Create and return a new Chroma vector store with specified settings.
const vectorStore = new Chroma(embeddings, {
url: process.env.CHROMADB_PATH,
collectionName: COLLECTION_NAME,
});
return vectorStore;
}
async function build() {
// Create a vector store and load documents.
const vectorStore = await createVectorStore();
const docs = await getDocuments();
// Add the processed documents to the vector store.
await vectorStore.addDocuments(docs);
}
// Run the build function.
build();
To handle user requests in our Astro application, we need to create an API route that supports the POST method. Since the specific implementation details of the Astro framework are beyond the scope of this assistance, I recommend referring to the practical guide of the Astro framework for more detailed information.
Here is the code for the API route:
// POST API route with request handling.
export const POST: APIRoute = async ({ request }) => {
// Parse JSON from the request.
const body = await request.json();
// Extract 'question' and 'history', with 'history' defaulting to empty.
const { question, history = [] } = body;
// Return error response if 'question' is missing.
if (!question) {
return new Response(JSON.stringify("Please provide query phrase"), {
status: 403,
});
}
// Get and invoke the runnable sequence with question and history.
const chain = await getRunnableSequence();
const result = await chain.invoke({ question, history });
// Return the processed result as a JSON response.
return new Response(JSON.stringify({ result }, null, 2));
};
In the provided code, notice the getRunnableSequence
function. We will delve deeper into this function and its implementation details in a later section.
To implement a conversational chat application using RAG (Retrieval-Augmented Generation) model, we need to follow the following steps:
In order to handle natural language questions from users and derive information from the chat history, we need to consider a scenario where users may ask questions that can be deduced from the chat history. For example, instead of explicitly asking "Give me the traits of Zeus!" a user might ask "Who is Zeus?" and then follow up with "What are his traits?"
To handle this scenario, we need to pass the chat history as context and create an instruction to rephrase the follow-up question to be a standalone question. It will enable the model to generate more coherent and relevant responses based on the conversation context.
We will refer to this prompt as condenceQuestionTemplate
.
Here is the placeholder for the condenceQuestionTemplate
prompt:
const condenseQuestionTemplate = `
If user asks about mythology user conversational history
Given the following conversation and a follow up question,
rephrase the follow up question to be a standalone question.
Chat History:
{chat_history}
Follow Up Input: {question}
Standalone question:
`;
const CONDENSE_QUESTION_PROMPT = PromptTemplate.fromTemplate(
condenseQuestionTemplate
);
The PromptTemplate
is a component provided by @langchain/core
. LangChain offers multiple methods for creating prompts, and it generates specific objects in the background that seamlessly interact with other LangChain like chains and agents.
In this step, we will create an answering prompt to analyze the text document that we have provided, and generate a response to the specific question that we narrowed down in our previous example.
const answerTemplate = `
You should be nice to user and provide short, witty but comprehensive answers.
Answer the question based only on the given context
Step 1. Find the relevant answer baed on the DOCUMENT
Step 2. Format in a readable, user friendly markdown format.
DOCUMENT:
--------
{context}
Question:
---------
{question}
`;
const ANSWER_PROMPT = PromptTemplate.fromTemplate(answerTemplate);
Note that we use curly braces for {context}
and {question}
as part of LangChain's template replacement convention. It is particularly important because we will be creating a pipeline of chains that will automatically fill in the necessary inputs.
A crucial concept within LangChain is the use of Runnables. Runnables allow us to combine multiple LangChain components, such as LLM models, prompts, and process inputs, as well as multiple chains themselves, in a sequential manner using pipes.
Since this concept is essential to understand and implement successfully, I highly recommend referring to the official LangChain documentation for further details and examples.
So now, let's summarize what we should do.
To achieve the desired results, just follow these simple steps:
answerTemplate
prompt to obtain the final results.Now, similar to our index-docs.js
script, we need to initialize a vector store using fromExistingCollection
, which directly loads documents from our desired collection. In a nutshell, this is the same process as in index-docs.js
, however, LangChain provides multiple ways to achieve the same thing. Next, we create a retriever from our database and configure it to retrieve the last 5
documents to be included in our context.
It is important to note that it is not recommended to include a large number of documents in the retriever as it may exceed the limits of our context window. It is advised to adjust this based on your specific needs and the nature of your data.
const vectorStore = await Chroma.fromExistingCollection(
new OpenAIEmbeddings({ openAIApiKey: import.meta.env.OPENAI_API_KEY }),
{
collectionName: "documents",
url: import.meta.env.CHROMADB_PATH,
}
);
const retriever = vectorStore.asRetriever(20);
Since we have already created a retriever, let's combine everything into a Runnable sequence.
In the code snippet below, our sequence consists of the following steps:
CONFIDENCE_QUESTION_PROMPT
, which is a sentence from the PromptTemplate
.verbose
flag to true
in order to debug and understand how it works behind the scenes.StringOutputParser
from @langchain/core/output_parsers
, which directly parses our input into a string instead of the LangChain output object. It is important to note that it does not process the actual LLM output, but rather converts the LangChain returned object, so we get a string output.function formatChatHistory(chatHistory: ChatMessage[]) {
const formattedDialogueTurns = chatHistory.map((message) => {
return `${message.type}: ${message.content}`;
});
return formattedDialogueTurns.join("\n");
}
const model = new ChatOpenAI({
modelName: "gpt-3.5-turbo-1106",
openAIApiKey: import.meta.env.OPENAI_API_KEY,
verbose: true,
});
const standaloneQuestionChain = RunnableSequence.from([
{
question: (input) => input.question,
chat_history: (input) => formatChatHistory(input.history),
},
CONDENSE_QUESTION_PROMPT,
model,
new StringOutputParser(),
]);
Furthermore, we need to create a Runnable called answerChain
which will retrieve the document based on the condensed question and pass it to the answer prompt. As you can see, we are using Runnable interface to retrieve our document and set it to context
input variable of our prompt.
const answerChain = RunnableSequence.from([
{
context: retriever.pipe(formatDocumentsAsString),
question: new RunnablePassthrough(),
},
ANSWER_PROMPT,
model,
new StringOutputParser(),
]);
And lastly, to connect both Runnable sequences together, we need to add our secret sauce of the LangChain - piping. LangChain utilizes a piping interface, which is handy when you need to combine multiple chains together.
const chain = standaloneQuestionChain.pipe(answerChain);
Here is the full implementation of getRunnableSequence
method:
async function getRunnableSequence() {
const model = new ChatOpenAI({
modelName: "gpt-3.5-turbo-1106",
openAIApiKey: import.meta.env.OPENAI_API_KEY,
verbose: true,
});
const condenseQuestionTemplate = `
If user asks about mythology user conversational history
Given the following conversation and a follow up question,
rephrase the follow up question to be a standalone question.
Chat History:
{chat_history}
Follow Up Input: {question}
Standalone question:
`;
const CONDENSE_QUESTION_PROMPT = PromptTemplate.fromTemplate(
condenseQuestionTemplate
);
const answerTemplate = `
You should be nice to user and provide short, witty but comprehensive answers.
Answer the question based only on the given context
Step 1. Find the relevant answer baed on the DOCUMENT
Step 2. Format in a readable, user friendly markdown format.
DOCUMENT:
--------
{context}
Question:
---------
{question}
`;
const ANSWER_PROMPT = PromptTemplate.fromTemplate(answerTemplate);
const vectorStore = await Chroma.fromExistingCollection(
new OpenAIEmbeddings({ openAIApiKey: import.meta.env.OPENAI_API_KEY }),
{
collectionName: "documents",
url: import.meta.env.CHROMADB_PATH,
}
);
const retriever = vectorStore.asRetriever(20);
const standaloneQuestionChain = RunnableSequence.from([
{
question: (input) => input.question,
chat_history: (input) => formatChatHistory(input.history),
},
CONDENSE_QUESTION_PROMPT,
model,
new StringOutputParser(),
]);
const answerChain = RunnableSequence.from([
{
context: retriever.pipe(formatDocumentsAsString),
question: new RunnablePassthrough(),
},
ANSWER_PROMPT,
model,
new StringOutputParser(),
]);
const chain = standaloneQuestionChain.pipe(answerChain);
return chain;
}
Now, let's run it on the application and see how it looks! I've already prepared a nice UI for us to quickly test our application, so we can get started right away.
Moreover, since we set a verbose
flag in our ChatOpenAI
model, we can observe in terminal the steps that LangChain does in order to get our response.
You can check out the demo and also dig into the project on GitHub.
The RAG approach can be applied in various areas, enabling users to access and retrieve information efficiently. Here are some examples:
Let's summarize what we have learned in our today's journey:
In conclusion, I encourage you to delve further into LangChain as it combines various methods of interaction with LLMs. This understanding will shed light on how to create software based on general artificial intelligence (Gen AI) principles, opening up new possibilities and advancements in the field of conversational AI.