TL;DR
In this article, you'll learn how to build a RAG based chatbot to chat with any PDF of your choice so you can achieve your lifelong dream of talking to PDFs 😏 In the end, I'll also show how you can test what you've built.
I know, I wrote something similar in my last article on building a customer support chatbot 😅 but this week we're going to dive deep into how to use the raw OpenAI API to chat with PDF data (including text trapped in visuals like tables) stored in ChromaDB, as well as how to use Streamlit to build the chatbot UI.
Introducing RAG, Vector Databases, and OCR
Before we dive into the code, let's debunk what we're going to implement 🕵️ To begin, OCR (Optical Character Recognition) is a technology within the field of computer vision that recognizes the characters present in the document and converts them into text - this is particularly helpful in the case of tables and charts in documents 😬 We'll be using OCR provided by Azure Cognitive Services in this tutorial.
Once text chunks are extracted using OCR, they are converted into a high-dimensional vector (aka. vectorized) using embedding models like Word2Vec, FastText, or BERT. These vectors, which encapsulate the semantic meaning of the text, are then indexed in a vector database. We'll be using ChromaDB as our in-memory vector database 🥳
Now, let's see what happens when a user asks their PDF something. First, the user query is first vectorized using the same embedding model used to vectorize the extracted PDF text chunks. Then, the top K most semantically similar text chunk is fetched by searching through the vector database, which remember, contains the text chunks from our PDF. The retrieved text chunks are then provided as context for ChatGPT to generate an answer based on information in their PDF. This is the process of retrieval, augmented, generation (RAG).
(Click here to learn how to evaluate RAG applications in CI/CD pipelines!)
Feeling educated? 😊 Let's begin.
Project Setup
First, I'm going to guide you through how to set up your project folders and any dependencies you need to install.
Create a project folder and a python virtual environment by running the following command:
Your terminal should now start something like this:
Installing dependencies
Run the following command to install OpenAI API, ChromaDB, and Azure:
Let's briefly go over what each of those package does:
- streamlit - sets up the chat UI, which includes a PDF uploader (thank god 😌)
- azure-ai-formrecognizer - extracts textual content from PDFs using OCR
- chromadb - is an in-memory vector database that stores the extracted PDF content
- openai - we all know what this does (receives relevant data from chromadb and returns a response based on your chatbot input)
Next, create a new main.py file - the entry point to your application
Getting your API keys
Lastly, get your OpenAI and Azure API key ready (click the hyperlink to get them if you don't already have one)
Note: It's pretty troublesome to sign up for an account on Azure Cognitive Services. You'll need a card (although they won't charge you automatically), and phone number 😔 but do give it a try if you're trying to build something serious!
Confident AI: The LLM Evaluation Platform
The all-in-one platform to evaluate and test LLM applications on the cloud, fully integrated with DeepEval.
Building the Chatbot UI with Streamlit
Streamlit is an easy way to build frontend applications using python. Lets import streamlit along with setting up everything else we'll need:
Give our chat UI a title and create a file uploader:
Listen for a change event in `uploaded_file`. This will be triggered when you upload a file:
View your streamlit app by running `main.py` (we'll implement the chat input UI later):
That's the easy part done 🥳! Next comes the not so easy part...
Extracting text from PDFs
Carrying on from the previous code snippet, we're going to send `temp_file` to Azure Cognitive Services for OCR:
Here, `dict_info` is a dictionary containing information on the extracted text chunks. It's a pretty complicated dictionary, so I would recommend printing it out and seeing for yourself what it looks like.
Paste in the following to finish processing the data received from Azure:
Here, we accessed various properties of the dictionary returned by Azure to get texts on the page, and data stored in tables. The logic is pretty complex because of all the nested structures 😨 but from personal experience, Azure OCR works well even for complex PDF structures, so I highly recommend giving it a try :)
Storing PDF content in ChromaDB
Still with me? 😅 Great, we're almost there so hang in there!
Paste in the code below to store extracted text chunks from `res` in ChromaDB.
The first try block ensures that we can continue uploading PDFs without having to refresh the page.
You might have noticed that we add data into a collection and not to the database directly. A collection in ChromaDB is a vector space. When a user enters a query, it performs a search inside this collection, instead of the entire database. In Chroma, this collection is identified by a unique name, and with a simple line of code, you can add all extracted text chunks via to this collection via `collection.add(...)`.
Generating a response using OpenAI
I get asked a lot about how to build a RAG chatbot without relying on frameworks like langchain and lLamaIndex. Well here's how you do it - you construct a list of prompts dynamically based on the results retrieved from your vector database.
Paste in the following code to wrap things up:
Notice how we reversed `prompts` after constructing a list of prompts according to the list of retrieved text chunks from ChromaDB. This is because the results returned from ChromaDB is ordered in descending order, meaning the most relevant text chunk will always be the first in the results list. However, the way ChatGPT works is it considers the last prompt in a list of prompts more, hence why we have to reverse it.
Run the streamlit app and try things out for yourself 😙:
🎉 Congratulations, you made it to the end!
Taking it a step further
As you know, LLM applications are a black box and so for production use cases, you'll want to safeguard the performance of your PDF chatbot to keep your users happy. To learn how to build a simple evaluation framework that could get you setup in less than 30 minutes, click here.
Conclusion
In this article, you've learnt:
- what a vector database is a how to use ChromaDB
- how to use the raw OpenAI API to build a RAG based chatbot without relying on 3rd party frameworks
- what OCR is and how to use Azure's OCR services
- how to quickly set up a beautiful chatbot UI using streamlit, which includes a file uploader.
This tutorial walked you through an example of how you can build a "chat with PDF" application using just Azure OCR, OpenAI, and ChromaDB. With what you've learnt, you can build powerful applications that help increase the productivity of workforces (at least that's the most prominent use case I've came across).
The source code for this tutorial is available here:
https://github.com/confident-ai/blog-examples/tree/main/chat-with-pdf
Thank you for reading!
Do you want to brainstorm how to evaluate your LLM (application)? Ask us anything in our discord. I might give you an “aha!” moment, who knows?
Confident AI: The LLM Evaluation Platform
The all-in-one platform to evaluate and test LLM applications on the cloud, fully integrated with DeepEval.