Streaming Responses

Published

In the previous tutorial, you learned how to manage conversations with AI models by storing message history. But you might have noticed something: when you ask a question, you have to wait for the complete response before seeing anything. What if the AI could start showing its answer immediately, word by word, like a human typing?

That's exactly what streaming responses do. Instead of waiting for the entire response, you get the AI's answer in real-time chunks, creating a much more interactive and engaging experience.

Why Streaming Matters

Let's break down the difference between regular and streaming responses:

Regular (Non-Streaming) Response:

  • You ask a question
  • You wait... and wait... (sometimes 10-30 seconds)
  • The complete answer appears all at once

Streaming Response:

  • You ask a question
  • Words start appearing immediately
  • You see the AI "thinking" and responding in real-time
  • Much better user experience

Streaming is especially important for longer responses. Instead of staring at a blank screen, users see progress immediately.

How Streaming Works

When you request a streaming response, the AI doesn't send one big message. Instead, it sends many small chunks of text as it generates them. Your code receives these chunks one by one and displays them as they arrive.

sequenceDiagram
    participant User
    participant App
    participant AI_API

    User->>App: Ask question
    App->>AI_API: Request streaming response
    AI_API-->>App: Chunk 1: "The"
    App-->>User: Display: "The"
    AI_API-->>App: Chunk 2: " answer"
    App-->>User: Display: "The answer"
    AI_API-->>App: Chunk 3: " is..."
    App-->>User: Display: "The answer is..."
    AI_API-->>App: [Stream complete]
    App-->>User: Final response ready

Understanding Async Iteration

Streaming responses use a JavaScript feature called async iteration. This lets you process data as it arrives, rather than waiting for everything at once.

Here's the basic pattern:

// This is how streaming data looks
for await (const chunk of streamingResponse) {
  // Process each chunk as it arrives
  console.log(chunk.text);
}

The for await loop waits for each chunk and processes it immediately. This is perfect for streaming responses.

Environment Setup

Make sure you have your setup from previous tutorials:

  • Google AI API key in your .env file
  • @google/genai package installed
  • Basic TypeScript project structure
  • Message management code from Tutorial 2.2

Working Code Example

Let's build a streaming chat system step by step, building on what you learned about conversation management.

Step 1: Import Required Modules

import { GoogleGenAI } from "@google/genai";
import * as dotenv from "dotenv";

dotenv.config();

We're using the same imports as before, but we'll use different methods for streaming.

Step 2: Set Up the AI Client

const genAI = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY! });

This creates our AI client, just like in previous tutorials.

Step 3: Create Message Management (From Tutorial 2.2)

interface Message {
  role: "system" | "user" | "assistant";
  content: string;
}

const messages: Message[] = [];

function addMessage(role: "system" | "user" | "assistant", content: string) {
  messages.push({ role, content });
}

We're reusing the message management system you learned in the previous tutorial.

Step 4: Format Messages for Streaming

function formatMessagesForStreaming(): string {
  return messages.map((msg) => `${msg.role}: ${msg.content}`).join("\n");
}

This formats our conversation history into a single prompt, just like before.

Step 6: Display Streaming Text Properly

function displayStreamingResponse(text: string) {
  // Write text without a newline (so it continues on same line)
  process.stdout.write(text || "");
}

async function streamingChatWithDisplay(userMessage: string): Promise<string> {
  addMessage("user", userMessage);
  const conversationPrompt = formatMessagesForStreaming();

  console.log(`\nYou: ${userMessage}`);
  console.log("AI: ");

  const result = await genAI.models.generateContentStream({
    model: "gemini-2.5-flash",
    contents: conversationPrompt,
  });
  let fullResponse = "";

  for await (const chunk of result) {
    const chunkText = chunk.text || "";
    fullResponse += chunkText;

    // Display each chunk immediately
    displayStreamingResponse(chunkText || "");
  }

  console.log("\n"); // Add newline when complete
  addMessage("assistant", fullResponse);

  return fullResponse;
}

Notice the key differences: we use generateContentStream() instead of generateContent(), and we process chunks with for await.

This version shows the user's message, then displays "AI: " and streams the response word by word.

Step 7: Test the Streaming Conversation

async function testStreamingConversation() {
  // Set up AI behavior
  addMessage(
    "system",
    "You are a helpful programming tutor. Keep answers practical and include examples."
  );

  console.log("Starting streaming conversation...\n");

  // First streaming message
  await streamingChatWithDisplay("What is TypeScript and why should I use it?");

  // Second streaming message - AI remembers context
  await streamingChatWithDisplay("Show me a simple example with types");

  // Third streaming message
  await streamingChatWithDisplay(
    "What are the main benefits over regular JavaScript?"
  );
}

testStreamingConversation();

Run this and you'll see the AI's responses appear word by word, while still maintaining conversation history.

Understanding the Streaming Process

Here's what happens step by step:

  1. User asks question → Add to message history
  2. Format conversation → Create prompt with full history
  3. Start stream → Call generateContentStream()
  4. Receive chunks → AI sends small pieces of text
  5. Display immediately → Show each chunk as it arrives
  6. Build full response → Combine all chunks
  7. Save to history → Add complete response to messages

The key insight is that you're building the full response piece by piece while displaying it in real-time.

Streaming vs Non-Streaming Comparison

Let's see both approaches side by side:

// Non-streaming (from Tutorial 2.2)
async function regularChat(userMessage: string): Promise<string> {
  addMessage("user", userMessage);
  const prompt = formatMessagesForStreaming();

  const response = await model.generateContent(prompt);
  const aiResponse = response.response.text();

  addMessage("assistant", aiResponse);
  return aiResponse;
}

// Streaming (new approach)
async function streamingChat(userMessage: string): Promise<string> {
  addMessage("user", userMessage);
  const prompt = formatMessagesForStreaming();

  const result = await model.generateContentStream(prompt);
  let fullResponse = "";

  for await (const chunk of result.stream) {
    const chunkText = chunk.text();
    fullResponse += chunkText;
    process.stdout.write(chunkText);
  }

  addMessage("assistant", fullResponse);
  return fullResponse;
}

The main differences:

  • generateContent() vs generateContentStream()
  • Single response vs chunk processing
  • Immediate display vs waiting for completion

When to Use Streaming

Use streaming when:

  • Building interactive chat applications
  • Responses might be long (more than a few sentences)
  • User experience is important
  • You want to show progress to users

Use regular responses when:

  • Building simple scripts or automation
  • Processing responses programmatically
  • Response length is predictable and short
  • Real-time display isn't needed

FAQ

Summary

Streaming responses transform the user experience by showing AI responses in real-time instead of waiting for complete answers. The key concepts are: using generateContentStream() instead of generateContent(), processing chunks with for await loops, displaying text immediately while building the full response, and maintaining conversation history just like with regular responses. Streaming is essential for interactive applications where user experience matters, and it prepares you perfectly for building real-time chat interfaces.

Complete Code

You can find the complete, runnable code for this tutorial on GitHub: https://github.com/avestalabs/academy/tree/main/2-core-llm-interactions/streaming-responses

Share this article: