Sitemap

Build Asynchronous LLM APIs with Kafka & Redis

Decoupled asynchronous architecture to scale your LLM Applications.

4 min readSep 19, 2025

--

Press enter or click to view image in full size

In today’s LLM applications a very common API design pattern is to accept a user prompt, kick off background processing using Kafka and send back a polling URL in the response for the client to periodically check if the processing is complete.

In this blog post, I will show you how to build such a pipeline using Kafka, Redis and OpenAI APIs.

Kafka Schemas

At first, we design a Kafka schema for our API endpoint to publish to.

{
"conversation_id": "abc123",
"prompt": "What is your favorite food?",
}

It’s a very simple schema called async_user_prompt.

When a backend server receives a user prompt for the first time, it assigns a conversation_id, then writes a message with the user prompt to async_user_prompt Kafka topic.

The second Kafka topic we create is called: async_llm_response.

Once the processing is complete, the system writes the LLM response to this schema.

{
"conversation_id": "abc123",
"user_prompt": "What is your favorite food?",
"llm_response": "Pizza.",
}

These are the only two Kafka topics we need to build our system.

Backend Server

Now, let’s see how to use these two Kafka Topics to design the system we want.

Press enter or click to view image in full size

When the backend receives a user prompt through a GET request to /answer route, it immediately assigns the message a conversation_id and writes a message to Kafka topic async_user_prompt.

We have a Kafka Consumer instance configured to listen to any new messages coming to the same Kafka Topic.

Once the server successfully publishes to Kafka, it returns a HTTP 202 ACCEPTED response to the client with very minimal latency.

Optionally, the server can return a polling_url that the client can use to check progress, and even receive the LLM response when ready.

Something like this:

GET /polling?conversationId=abc123
------------------------------------
Response:

HTTP 200
{
"status": "complete",
"conversation_id": "abc123",
"user_prompt": "What is your favorite food?",
"llm_response": "Pizza."
}

Or, for a pending task:

GET /polling?conversationId=abc123
------------------------------------
Response:

HTTP 200
{
"status": "pending",
"conversation_id": "abc123",
"user_prompt": "What is your favorite food?",
"llm_response": ""
}

At this point, your backend server is able to forward a user prompt to Kafka for background processing, and send back a polling URL to the client for it to periodically check status.

Kafka Consumer & Interested Parties

Next, we will see the function of the Kafka Consumer and how this architecture can support a large number of use cases.

Let’s zoom into the second part of the process.

Press enter or click to view image in full size

Once the Kafka Consumer receives the prompt, it immediately sends it to OpenAI for processing. When processing is complete, it stores the conversation in a Redis database, and writes the original prompt and response back to Kafka, albeit to a different topic async_llm_response.

Now, the async_llm_response topic has decoupled the API server from all interested parties who care about LLM responses across the company.

These parties (imagine other Kafka Consumers) can just listen for messages and process OpenAI’s response in whatever way they would like.

The producer does not care. The original server does not care.

This is a very common use case of message buses like Kafka. They decouple the producer and consumer using schemas. As long as the Interested Party knows about async_llm_response schema, it can process the message.

So, you can imagine the following interested parties:

  1. A consumer that updates relevant databases where the client is polling for response
  2. A consumer that needs to do further processing on the LLM response. Maybe generate a AI photo or do some other post-processing
  3. A consumer that checks the LLM’s response for profanity or other quality control things

Closing Thoughts

With this architecture, not only do you minimize latency on your backend servers, but also allow other stakeholders within your company to hook into a system that you built in a very decoupled way.

If you are still reading, I hope you found it valuable and it was worth your time.

For similar content, check out my YouTube channel or follow me here on Medium.

If you would like to get a copy of the illustration or extra notes, join my newsletter and I will email you a copy.

--

--

Irtiza Hafiz
Irtiza Hafiz

Written by Irtiza Hafiz

Engineering manager who writes about software development and productivity https://irtizahafiz.com

No responses yet