In this story
Jeffrey Ip
Cofounder @ Confident, creator of DeepEval. Ex-Googler (YouTube), Microsoft AI (Office365). Working overtime to enforce responsible AI.

How to build a PDF QA chatbot using OpenAI and ChromaDB

January 18, 2025
·
14 min read
Presenting...
The open-source LLM evaluation framework.
Star on GitHub
How to build a PDF QA chatbot using OpenAI and ChromaDB

TL;DR

In this article, you'll learn how to build a RAG based chatbot to chat with any PDF of your choice so you can achieve your lifelong dream of talking to PDFs 😏 In the end, I'll also show how you can test what you've built.

I know, I wrote something similar in my last article on building a customer support chatbot 😅 but this week we're going to dive deep into how to use the raw OpenAI API to chat with PDF data (including text trapped in visuals like tables) stored in ChromaDB, as well as how to use Streamlit to build the chatbot UI.

Introducing RAG, Vector Databases, and OCR

Before we dive into the code, let's debunk what we're going to implement 🕵️ To begin, OCR (Optical Character Recognition) is a technology within the field of computer vision that recognizes the characters present in the document and converts them into text - this is particularly helpful in the case of tables and charts in documents 😬 We'll be using OCR provided by Azure Cognitive Services in this tutorial.

Once text chunks are extracted using OCR, they are converted into a high-dimensional vector (aka. vectorized) using embedding models like Word2Vec, FastText, or BERT. These vectors, which encapsulate the semantic meaning of the text, are then indexed in a vector database. We'll be using ChromaDB as our in-memory vector database 🥳

Now, let's see what happens when a user asks their PDF something. First, the user query is first vectorized using the same embedding model used to vectorize the extracted PDF text chunks. Then, the top K most semantically similar text chunk is fetched by searching through the vector database, which remember, contains the text chunks from our PDF. The retrieved text chunks are then provided as context for ChatGPT to generate an answer based on information in their PDF. This is the process of retrieval, augmented, generation (RAG).

(Click here to learn how to evaluate RAG applications in CI/CD pipelines!)

Feeling educated? 😊 Let's begin.

Project Setup

First, I'm going to guide you through how to set up your project folders and any dependencies you need to install.

Create a project folder and a python virtual environment by running the following command:


mkdir chat-with-pdf
cd chat-with-pdf
python3 -m venv venv
source venv/bin/activate

Your terminal should now start something like this:


(venv)

Installing dependencies

Run the following command to install OpenAI API, ChromaDB, and Azure:


pip install openai chromadb azure-ai-formrecognizer streamlit tabulate

Let's briefly go over what each of those package does:

  • streamlit - sets up the chat UI, which includes a PDF uploader (thank god 😌)
  • azure-ai-formrecognizer - extracts textual content from PDFs using OCR
  • chromadb - is an in-memory vector database that stores the extracted PDF content
  • openai - we all know what this does (receives relevant data from chromadb and returns a response based on your chatbot input)

Next, create a new main.py file - the entry point to your application


touch main.py

Getting your API keys

Lastly, get your OpenAI and Azure API key ready (click the hyperlink to get them if you don't already have one)

Note: It's pretty troublesome to sign up for an account on Azure Cognitive Services. You'll need a card (although they won't charge you automatically), and phone number 😔 but do give it a try if you're trying to build something serious!

Confident AI: The DeepEval LLM Evaluation Platform

The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.

Regression test and evaluate LLM apps.
Easily A|B test prompts and models.
Edit and manage datasets on the cloud.
LLM observability with online evals.
Publicly sharable testing reports.
Automated human feedback collection.

Got Red? Red team LLM systems today with Confident AI

The leading platform to safety-test LLM applications on the cloud, native to DeepEval.

Tailored frameworks (e.g. OWASP Top 10)
10+ LLM guardrails to guard malicious I/O
40+ plug-and-play vulnerabilities and 10+ attacks
Guardrails accuracy and latency reporting
Publicly sharable risk assessments.
On-demand custom guards available.

Building the Chatbot UI with Streamlit

Streamlit is an easy way to build frontend applications using python. Lets import streamlit along with setting up everything else we'll need:


import streamlit as st
from azure.ai.formrecognizer import DocumentAnalysisClient
from azure.core.credentials import AzureKeyCredential
from tabulate import tabulate
from chromadb.utils import embedding_functions
import chromadb
import openai

# You'll need this client later to store PDF data
client = chromadb.Client()
client.heartbeat()

Give our chat UI a title and create a file uploader:


...
st.write("#Chat with PDF")

uploaded_file = st.file_uploader("Choose a PDF file", type="pdf")
...

Listen for a change event in `uploaded_file`. This will be triggered when you upload a file:


...
if uploaded_file is not None:
    # Create a temporary file to write the bytes to
    with open("temp_pdf_file.pdf", "wb") as temp_file:
        temp_file.write(uploaded_file.read())
...

View your streamlit app by running `main.py` (we'll implement the chat input UI later):


streamlit run main.py

That's the easy part done 🥳! Next comes the not so easy part...

Extracting text from PDFs

Carrying on from the previous code snippet, we're going to send `temp_file` to Azure Cognitive Services for OCR:


    ...
    # you can set this up in the azure cognitive services portal
    AZURE_COGNITIVE_ENDPOINT = "your-custom-azure-api-endpoint"
    AZURE_API_KEY = "your-azure-api-key"
    credential = AzureKeyCredential(AZURE_API_KEY)
    AZURE_DOCUMENT_ANALYSIS_CLIENT = DocumentAnalysisClient(AZURE_COGNITIVE_ENDPOINT, credential)

    # Open the temporary file in binary read mode and pass it to Azure
    with open("temp_pdf_file.pdf", "rb") as f:
        poller = AZURE_DOCUMENT_ANALYSIS_CLIENT.begin_analyze_document("prebuilt-document", document=f)
        doc_info = poller.result().to_dict()
    ...

Here, `dict_info` is a dictionary containing information on the extracted text chunks. It's a pretty complicated dictionary, so I would recommend printing it out and seeing for yourself what it looks like.

Paste in the following to finish processing the data received from Azure:


   ...
   res = []
   CONTENT = "content"
   PAGE_NUMBER = "page_number"
   TYPE = "type"
   RAW_CONTENT = "raw_content"
   TABLE_CONTENT = "table_content"

   for p in doc_info['pages']:
        dict = {}
        page_content = " ".join([line["content"] for line in p["lines"]])
        dict[CONTENT] = str(page_content)
        dict[PAGE_NUMBER] = str(p["page_number"])
        dict[TYPE] = RAW_CONTENT
        res.append(dict)

    for table in doc_info["tables"]:
        dict = {}
        dict[PAGE_NUMBER] = str(table["bounding_regions"][0]["page_number"])
        col_headers = []
        cells = table["cells"]
        for cell in cells:
            if cell["kind"] == "columnHeader" and cell["column_span"] == 1:
                for _ in range(cell["column_span"]):
                    col_headers.append(cell["content"])

        data_rows = [[] for _ in range(table["row_count"])]
        for cell in cells:
            if cell["kind"] == "content":
                for _ in range(cell["column_span"]):
                    data_rows[cell["row_index"]].append(cell["content"])
        data_rows = [row for row in data_rows if len(row) > 0]

        markdown_table = tabulate(data_rows, headers=col_headers, tablefmt="pipe")
        dict[CONTENT] = markdown_table
        dict[TYPE] = TABLE_CONTENT
        res.append(dict)
    ...

Here, we accessed various properties of the dictionary returned by Azure to get texts on the page, and data stored in tables. The logic is pretty complex because of all the nested structures 😨 but from personal experience, Azure OCR works well even for complex PDF structures, so I highly recommend giving it a try :)

Storing PDF content in ChromaDB

Still with me? 😅 Great, we're almost there so hang in there!

Paste in the code below to store extracted text chunks from `res` in ChromaDB.


    ...
    try:
        client.delete_collection(name="my_collection")
        st.session_state.messages = []
    except:
        print("Hopefully you'll never see this error.")

    openai_ef = embedding_functions.OpenAIEmbeddingFunction(api_key="your-openai-api-key", model_name="text-embedding-ada-002")
    collection = client.create_collection(name="my_collection", embedding_function=openai_ef)
    data = []
    id = 1
    for dict in res:
        content = dict.get(CONTENT, '')
        page_number = dict.get(PAGE_NUMBER, '')
        type_of_content = dict.get(TYPE, '')

        content_metadata = {   
            PAGE_NUMBER: page_number,
            TYPE: type_of_content
        }

        collection.add(
            documents=[content],
            metadatas=[content_metadata],
            ids=[str(id)]
        )
        id += 1
    ...

The first try block ensures that we can continue uploading PDFs without having to refresh the page.

You might have noticed that we add data into a collection and not to the database directly. A collection in ChromaDB is a vector space. When a user enters a query, it performs a search inside this collection, instead of the entire database. In Chroma, this collection is identified by a unique name, and with a simple line of code, you can add all extracted text chunks via to this collection via `collection.add(...)`.

Generating a response using OpenAI

I get asked a lot about how to build a RAG chatbot without relying on frameworks like langchain and lLamaIndex. Well here's how you do it - you construct a list of prompts dynamically based on the  results retrieved from your vector database.

Paste in the following code to wrap things up‍:


...
if "messages" not in st.session_state:
    st.session_state.messages = []

# Display chat messages from history on app rerun
for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.markdown(message["content"])

if prompt := st.chat_input("What do you want to say to your PDF?"):
    # Display your message
    with st.chat_message("user"):
        st.markdown(prompt)
    # Add your message to chat history
    st.session_state.messages.append({"role": "user", "content": prompt})

    # query ChromaDB based on your prompt, taking the top 5 most relevant result. These results are ordered by similarity.
    q = collection.query(
        query_texts=[prompt],
        n_results=5,
    )
    results = q["documents"][0]

    prompts = []
    for r in results:
        # construct prompts based on the retrieved text chunks in results 
        prompt = "Please extract the following: " + prompt + "  solely based on the text below. Use an unbiased and journalistic tone. If you're unsure of the answer, say you cannot find the answer. \n\n" + r

        prompts.append(prompt)
    prompts.reverse()

    openai_res = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "assistant", "content": prompt} for prompt in prompts],
        temperature=0,
    )

    response = openai_res["choices"][0]["message"]["content"]
    with st.chat_message("assistant"):
        st.markdown(response)

    # append the response to chat history
    st.session_state.messages.append({"role": "assistant", "content": response})

Notice how we reversed `prompts` after constructing a list of prompts according to the list of retrieved text chunks from ChromaDB. This is because the results returned from ChromaDB is ordered in descending order, meaning the most relevant text chunk will always be the first in the results list. However, the way ChatGPT works is it considers the last prompt in a list of prompts more, hence why we have to reverse it.

Run the streamlit app and try things out for yourself 😙:


streamlit run main.py

🎉 Congratulations, you made it to the end!

Taking it a step further

As you know, LLM applications are a black box and so for production use cases, you'll want to safeguard the performance of your PDF chatbot to keep your users happy. To learn how to build a simple evaluation framework that could get you setup in less than 30 minutes, click here.

Conclusion

In this article, you've learnt:

  • what a vector database is a how to use ChromaDB
  • how to use the raw OpenAI API to build a RAG based chatbot without relying on 3rd party frameworks
  • what OCR is and how to use Azure's OCR services
  • how to quickly set up a beautiful chatbot UI using streamlit, which includes a file uploader.

This tutorial walked you through an example of how you can build a "chat with PDF" application using just Azure OCR, OpenAI, and ChromaDB. With what you've learnt, you can build powerful applications that help increase the productivity of workforces (at least that's the most prominent use case I've came across).

The source code for this tutorial is available here:
https://github.com/confident-ai/blog-examples/tree/main/chat-with-pdf

Thank you for reading!

Confident AI: The DeepEval LLM Evaluation Platform

The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.

Regression test and evaluate LLM apps.
Easily A|B test prompts and models.
Edit and manage datasets on the cloud.
LLM observability with online evals.
Publicly sharable testing reports.
Automated human feedback collection.

Got Red? Red team LLM systems today with Confident AI

The leading platform to safety-test LLM applications on the cloud, native to DeepEval.

Tailored frameworks (e.g. OWASP Top 10)
10+ LLM guardrails to guard malicious I/O
40+ plug-and-play vulnerabilities and 10+ attacks
Guardrails accuracy and latency reporting
Publicly sharable risk assessments.
On-demand custom guards available.

Confident AI: The DeepEval LLM Evaluation Platform

The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.

Regression test and evaluate LLM apps.
Easily A|B test prompts and models.
Edit and manage datasets on the cloud.
LLM observability with online evals.
Publicly sharable testing reports.
Automated human feedback collection.

Got Red? Red team LLM systems today with Confident AI

The leading platform to safety-test LLM applications on the cloud, native to DeepEval.

Tailored frameworks (e.g. OWASP Top 10)
10+ LLM guardrails to guard malicious I/O
40+ plug-and-play vulnerabilities and 10+ attacks
Guardrails accuracy and latency reporting
Publicly sharable risk assessments.
On-demand custom guards available.
* * * * *

Do you want to brainstorm how to evaluate your LLM (application)? Ask us anything in our discord. I might give you an “aha!” moment, who knows?

Confident AI: The DeepEval LLM Evaluation Platform

The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.

Regression test and evaluate LLM apps.
Easily A|B test prompts and models.
Edit and manage datasets on the cloud.
LLM observability with online evals.
Publicly sharable testing reports.
Automated human feedback collection.

Got Red? Red team LLM systems today with Confident AI

The leading platform to safety-test LLM applications on the cloud, native to DeepEval.

Tailored frameworks (e.g. OWASP Top 10)
10+ LLM guardrails to guard malicious I/O
40+ plug-and-play vulnerabilities and 10+ attacks
Guardrail accuracy and latency reporting
Publicly sharable risk assessments.
On-demand custom guards available.
Jeffrey Ip
Cofounder @ Confident, creator of DeepEval. Ex-Googler (YouTube), Microsoft AI (Office365). Working overtime to enforce responsible AI.

Stay Confident

Subscribe to our weekly newsletter to stay confident in the AI systems you build.

Thank you! You're now subscribed to Confident AI's weekly newsletter.
Oops! Something went wrong while submitting the form.