Ask Questions to Your CSV Files (LangChain + OpenAI)
Building a LLMChat for your CSV files.
While playing around with LangChain, I found a surprisingly fun use case — a tool to ask questions about any CSV file.
I tried it on a couple of different CSV files, and it works great for most use cases!
Here’s a step-by-step guide to how it works.
Example Rows from the CSV
Here are a few example rows from the CSV file to give you an idea of the data.
Example Queries
Here are a few sample questions and how the bot might respond:
Query 1
Question: “What industry is Holder-Sellers in?”
Answer: “Holder-Sellers is in the Automotive industry.”
Query 2
Question: “How many employees does Carr Inc have?”
Answer: “Carr Inc has 8,167 employees.”
Query 3
Question: “What’s the purpose of Kidd Group?”
Answer: “Kidd Group focuses on proactive foreground paradigms.”
The Idea
The bot takes a simple approach:
- Load the content of a CSV file.
- Store the content in a vector store for efficient searching.
- Retrieve the most relevant rows based on a question.
- Use OpenAI’s language model to generate an answer.
Let’s dive in!
Step 1: Load CSV Data
We start by loading data from a CSV file. LangChain’s CSVLoader
makes this simple and efficient.
Environment Setup
from langchain_community.document_loaders.csv_loader import CSVLoader
from get_relevant_documents import get_answer_from_llm
csv_file_path = "./data/organizations-100.csv"
CSVLoader
: A tool to read and parse CSV files into structured data.csv_file_path
: The path to the CSV file you want to query.
Loading CSV Content
loader = CSVLoader(file_path=csv_file_path)
documents = loader.load()
loader.load()
: Reads the entire CSV file and converts each row into aDocument
object, making it compatible with LangChain’s tools.documents
: Stores the data for further processing.
Once the CSV data is loaded, it’s ready to be queried.
Step 2: Storing and Retrieving Relevant Rows
To answer questions efficiently, we need to store the loaded CSV data in a vector store and retrieve the most relevant rows.
Setting Up OpenAI and Vector Store
from dotenv import load_dotenv
import os
load_dotenv()
import openai
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_openai import OpenAIEmbeddings
openai.api_key = os.getenv("OPENAI_API_KEY")
InMemoryVectorStore
: Temporarily stores the document embeddings for searching.OpenAIEmbeddings
: Converts the rows of the CSV into embeddings that can be compared for similarity.
Retrieving Relevant Rows
def get_k_relevant_documents(documents, question, k=3):
print(f"Storing {len(documents)} into Vector Store.")
vector_store = InMemoryVectorStore.from_documents(documents, OpenAIEmbeddings())
print("Getting relevant documents from in-memory vector store.")
relevant_docs = vector_store.similarity_search(question, k=k)
print(f"Retrieved similar documents: {len(relevant_docs)}")
return relevant_docs
- Finds the
k
most relevant rows from the CSV file based on the question. - Uses cosine similarity between the embeddings of the question and the document rows to find matches.
Step 3: Generating Answers
With the relevant rows in hand, we use OpenAI’s language model to generate an answer.
Querying the Language Model
def get_answer_from_llm(documents, question):
print(f"Question: {question}")
relevant_docs = get_k_relevant_documents(documents, question)
model = ChatOpenAI(model="gpt-4o-mini")
context_from_docs = "\n\n".join([doc.page_content for doc in relevant_docs])
messages = [
SystemMessage(
content=f"Use the following context to answer my question: {context_from_docs}"
),
HumanMessage(content=f"{question}"),
]
parser = StrOutputParser()
chain = model | parser
return chain.invoke(messages)
- Sends the context and question to OpenAI’s language model.
- Combines the most relevant rows into a string for the model.
Step 4: Putting It All Together
Here’s how everything ties together:
answer = get_answer_from_llm(
documents=documents,
question="What's the purpose of Hicks LLC as an organization?",
)
print(answer)
- Finds relevant rows from the CSV and generates an answer.
- The model provides a natural language answer based on the CSV content.
Closing Thoughts
Okay folks, that’s all for today.
Again, if you want the full code, check out my Github.
If you have read it so far, thank you for your time. I hope you found it valuable.
If you want to stay connected, here are a few ways you can do so: follow me on Medium or check out my website.