Building a PDF Ask Anything LLM Bot

Irtiza Hafiz
3 min readJan 23, 2025

--

Photo by Andy Kelly on Unsplash

I recently built a Q&A bot that extracts content from PDF documents and answers questions about it using OpenAI’s language models and LangChain.

Here’s a step-by-step guide to how I did it.

The Idea

The bot has a simple workflow:

  1. Load content from a PDF file.
  2. Store the content in a vector store for easy searching.
  3. Retrieve relevant sections of the document based on a question.
  4. Use OpenAI’s language model to generate an answer.

Let me show you how it all comes together.

Step 1: Extract Content from a PDF

We start by extracting the text content of a PDF file. LangChain’s PyPDFLoader makes this process simple.

Environment Setup

from langchain_community.document_loaders import PyPDFLoader
from get_relevant_documents import get_answer_from_llm
pdf_file_path = "./data/sample-pdf.pdf"
  • PyPDFLoader: A tool that reads and extracts content from PDF files.
  • pdf_file_path: The path to the PDF file you want to process.

Extracting the PDF Content

loader = PyPDFLoader(pdf_file_path)
documents = []

for doc in loader.lazy_load():
documents.append(doc)
  • loader.lazy_load(): Streams the content of the PDF file into Document objects.
  • documents: A list that stores the extracted content for further processing.

With the content loaded, we’re ready to move on to the next step.

Step 2: Storing and Retrieving Relevant Documents

The next step is to store the extracted content in a vector store. This makes it easy to find the most relevant sections based on a question.

Setting Up OpenAI and Vector Store

from dotenv import load_dotenv
import os

load_dotenv()
import openai
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_openai import OpenAIEmbeddings
openai.api_key = os.getenv("OPENAI_API_KEY")
  • InMemoryVectorStore: Temporarily stores document embeddings for searching.
  • OpenAIEmbeddings: Converts document content into embeddings for similarity search.

Retrieving Relevant Documents

def get_k_relevant_documents(documents, question, k=3):
print(f"Storing {len(documents)} into Vector Store.")
vector_store = InMemoryVectorStore.from_documents(documents, OpenAIEmbeddings())
print("Getting relevant documents from in memory vector store.")
relevant_docs = vector_store.similarity_search(question, k=k)
print(f"Retrieved similar documents: {len(relevant_docs)}")
return relevant_docs
  • Retrieves the k most relevant documents based on a question.
  • Uses cosine similarity to find the most relevant documents by comparing their embeddings.

Step 3: Generating Answers with Context

Now that we have the relevant sections, we use OpenAI’s language model to generate an answer.

Querying the Language Model

def get_answer_from_llm(documents, question):
print(f"Question: {question}")
relevant_docs = get_k_relevant_documents(documents, question)
model = ChatOpenAI(model="gpt-4o-mini")

context_from_docs = "\n\n".join([doc.page_content for doc in relevant_docs])
messages = [
SystemMessage(
content=f"Use the following context to answer my question: {context_from_docs}"
),
HumanMessage(content=f"{question}"),
]
parser = StrOutputParser()
chain = model | parser
return chain.invoke(messages)
  • ChatOpenAI: A wrapper for interacting with OpenAI's language model.
  • context_from_docs: Combines the content of relevant documents into a single string for the model.

Step 4: Putting It All Together

Here’s how everything ties together:

answer = get_answer_from_llm(
documents=documents,
question="Who are the new?",
)
print(answer)
  • Uses the extracted content to find relevant sections and generate an answer.
  • The model responds with an answer based on the PDF’s content.

Example Output

For the sample PDF, running the script might return something like:

“The new team members are Jane Doe and John Smith.”

Closing Thoughts

Okay folks, that’s all for today.

Again, if you want the full code, check out my Github.

If you have read it so far, thank you for your time. I hope you found it valuable.

If you want to stay connected, here are a few ways you can do so: follow me on Medium or check out my website.

--

--

Irtiza Hafiz
Irtiza Hafiz

Written by Irtiza Hafiz

Engineering manager who writes about software development and productivity https://irtizahafiz.com

No responses yet