Building a PDF Ask Anything LLM Bot
Interact with any PDF file through a chatbot.
I recently built a Q&A bot that extracts content from PDF documents and answers questions about it using OpenAI’s language models and LangChain.
Here’s a step-by-step guide to how I did it.
The Idea
The bot has a simple workflow:
- Load content from a PDF file.
- Store the content in a vector store for easy searching.
- Retrieve relevant sections of the document based on a question.
- Use OpenAI’s language model to generate an answer.
Let me show you how it all comes together.
Step 1: Extract Content from a PDF
We start by extracting the text content of a PDF file. LangChain’s PyPDFLoader
makes this process simple.
Environment Setup
from langchain_community.document_loaders import PyPDFLoader
from get_relevant_documents import get_answer_from_llm
pdf_file_path = "./data/sample-pdf.pdf"
PyPDFLoader
: A tool that reads and extracts content from PDF files.pdf_file_path
: The path to the PDF file you want to process.
Extracting the PDF Content
loader = PyPDFLoader(pdf_file_path)
documents = []
for doc in loader.lazy_load():
documents.append(doc)
loader.lazy_load()
: Streams the content of the PDF file intoDocument
objects.documents
: A list that stores the extracted content for further processing.
With the content loaded, we’re ready to move on to the next step.
Step 2: Storing and Retrieving Relevant Documents
The next step is to store the extracted content in a vector store. This makes it easy to find the most relevant sections based on a question.
Setting Up OpenAI and Vector Store
from dotenv import load_dotenv
import os
load_dotenv()
import openai
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_openai import OpenAIEmbeddings
openai.api_key = os.getenv("OPENAI_API_KEY")
InMemoryVectorStore
: Temporarily stores document embeddings for searching.OpenAIEmbeddings
: Converts document content into embeddings for similarity search.
Retrieving Relevant Documents
def get_k_relevant_documents(documents, question, k=3):
print(f"Storing {len(documents)} into Vector Store.")
vector_store = InMemoryVectorStore.from_documents(documents, OpenAIEmbeddings())
print("Getting relevant documents from in memory vector store.")
relevant_docs = vector_store.similarity_search(question, k=k)
print(f"Retrieved similar documents: {len(relevant_docs)}")
return relevant_docs
- Retrieves the
k
most relevant documents based on a question. - Uses cosine similarity to find the most relevant documents by comparing their embeddings.
Step 3: Generating Answers with Context
Now that we have the relevant sections, we use OpenAI’s language model to generate an answer.
Querying the Language Model
def get_answer_from_llm(documents, question):
print(f"Question: {question}")
relevant_docs = get_k_relevant_documents(documents, question)
model = ChatOpenAI(model="gpt-4o-mini")
context_from_docs = "\n\n".join([doc.page_content for doc in relevant_docs])
messages = [
SystemMessage(
content=f"Use the following context to answer my question: {context_from_docs}"
),
HumanMessage(content=f"{question}"),
]
parser = StrOutputParser()
chain = model | parser
return chain.invoke(messages)
ChatOpenAI
: A wrapper for interacting with OpenAI's language model.context_from_docs
: Combines the content of relevant documents into a single string for the model.
Step 4: Putting It All Together
Here’s how everything ties together:
answer = get_answer_from_llm(
documents=documents,
question="Who are the new?",
)
print(answer)
- Uses the extracted content to find relevant sections and generate an answer.
- The model responds with an answer based on the PDF’s content.
Example Output
For the sample PDF, running the script might return something like:
“The new team members are Jane Doe and John Smith.”
Closing Thoughts
Okay folks, that’s all for today.
Again, if you want the full code, check out my Github.
If you have read it so far, thank you for your time. I hope you found it valuable.
If you want to stay connected, here are a few ways you can do so: follow me on Medium or check out my website.