Understanding LLM API Parameters - Control Your AI Responses | Gen-AI Engineering | AvestaLabs Academy

You've made some LLM calls and noticed that the AI's responses can vary quite a bit. Sometimes they're creative and expansive, other times they're concise and focused. The secret to controlling this behavior lies in understanding the parameters you can pass to LLM APIs.

Let's break down the most important parameters that work across virtually all LLM providers - from Google's Gemini to OpenAI to local models. Think of these as the "control knobs" for your AI interactions.

The Big Four: Essential Parameters Every Developer Should Know

1. Temperature - The Creativity Dial

Think of temperature like the creativity setting on a radio. At low settings, you get clear, consistent reception (predictable responses). At high settings, you might get some interesting static and variations (creative, unpredictable responses).

How it works: When an AI generates text, it doesn't just pick the most likely next word. Instead, it considers many possible words and their probabilities. Temperature controls how much the AI "explores" these alternatives.

Range: 0.0 to 2.0 (though most providers cap it at 1.0)
Low values (0.0-0.3): The AI picks the most likely words almost every time
- Result: Deterministic, focused, consistent responses
- Like asking a very serious, methodical person the same question - you'll get nearly identical answers
Medium values (0.4-0.7): The AI sometimes picks less likely but still reasonable words
- Result: Balanced creativity and consistency
- Like asking a thoughtful person - they'll give similar but not identical answers
High values (0.8-1.0+): The AI often picks surprising, less likely words
- Result: Creative, varied, sometimes unpredictable responses
- Like asking a very creative, spontaneous person - you'll get wildly different answers each time

Real-world examples:

Code generation (use 0.1-0.3): You want the AI to write for (let i = 0; i < array.length; i++), not for (let unicorn = 0; unicorn < mysticalArray.size; unicorn++)
Creative writing (use 0.7-0.9): You want varied, interesting prose, not the same boring sentences
Data extraction (use 0.0-0.2): You want consistent JSON format: {"name": "John"}, not sometimes {"person": "John"} or {name: John}
Brainstorming (use 0.8-1.0): You want diverse, unexpected ideas

2. Max Output Tokens - The Response Length Limiter

Imagine you're giving someone a word limit for an essay. That's exactly what max output tokens does - it sets a hard limit on how long the AI's response can be.

Why this matters:

Too low: "The best way to learn JavaScript is to practice reg..." (cut off mid-sentence)
Too high: You pay for tokens you don't need, and responses might be unnecessarily long
Just right: Complete, concise responses that fit your needs

Common use cases:

Short answers: 50-150 tokens
Explanations: 200-500 tokens
Code with explanations: 500-1000 tokens
Long-form content: 1000+ tokens

3. Top P (Nucleus Sampling) - The Alternative to Temperature

If temperature is like a creativity dial, top P is like a "focus filter." Instead of making the AI more or less creative overall, it controls which words the AI is even allowed to consider.

How it works: Imagine the AI has a list of possible next words, ranked by probability:

"the" (30% likely)
"a" (25% likely)
"this" (20% likely)
"some" (15% likely)
"purple" (5% likely)
"banana" (3% likely)
"quantum" (2% likely)

With top P = 0.9 (90%): The AI only considers words that make up the top 90% of probability. So it would consider "the," "a," "this," and "some" (which add up to 90%) but ignore "purple," "banana," and "quantum."

With top P = 0.5 (50%): The AI only considers "the" and "a" (which add up to 55%, just over 50%).

Practical ranges:

0.1-0.5: Very focused, only considers the most likely words
0.8-0.95: Balanced, considers most reasonable options (most common setting)
0.95-1.0: Considers almost all possibilities, including unlikely ones

Pro tip: Use either temperature OR top_p, not both. They both control randomness but in different ways, and using both can create confusing, unpredictable results.

4. Stop Sequences - The Response Terminators

Stop sequences are like saying "stop talking when you reach this point." They're incredibly useful for controlling exactly where the AI stops generating text.

How they work: You give the AI a list of text patterns. As soon as the AI generates any of these patterns, it immediately stops, even if it hasn't reached the max token limit.

Common examples:

["\n\n"]: Stop at double line breaks (end of paragraphs)
["###"]: Stop at three hashtags (common markdown separator)
["Human:", "User:"]: Stop when the AI tries to simulate a human response
["```"]: Stop at code block markers
["\n4.", "4."]: Stop before generating a 4th item in a list

Real-world scenarios:

Without stop sequences:

User: "List 3 benefits of exercise:"
AI: "1. Improves cardiovascular health
2. Builds muscle strength  
3. Enhances mental well-being
4. Increases flexibility
5. Boosts immune system
6. Improves sleep quality..."

With stop sequences ["\n4.", "4."]:

User: "List 3 benefits of exercise:"
AI: "1. Improves cardiovascular health
2. Builds muscle strength
3. Enhances mental well-being"

Perfect for:

Controlling list lengths
Preventing the AI from continuing conversations
Stopping at specific formatting markers
Creating clean, predictable outputs

Working Code Example

We'll use the same setup as before, let's see these parameters in action with practical examples. Here's a complete code example you can run right away:

import { GoogleGenAI } from "@google/genai";

dotenv.config();

const apiKey = process.env.GEMINI_API_KEY;

if (!apiKey) {
    console.error("GEMINI_API_KEY not found in environment variables");
    process.exit(1);
}

const genAI = new GoogleGenAI({ apiKey });

This sets up our Google AI client using the API key from our environment variables.

async function demonstrateTemperature() {
    const prompt = "what is javascript in one sentence?";

    // Low temperature - consistent, focused
    const response = await genAI.models.generateContent({
        model: "gemini-2.5-flash",
        contents: prompt,
        config: {
            temperature: 0.1
        }
    });

    console.log("Conservative (temp 0.1):", response.text);
}

This function shows how low temperature produces more predictable, focused responses. The AI will give similar outputs each time you run this.

async function demonstrateCreativeTemperature() {
    const prompt = "Write a creative opening line for a story about a robot chef.";

    // High temperature - creative, varied
    const response = await genAI.models.generateContent({
        model: "gemini-2.5-flash",
        contents: prompt,
        config: { temperature: 0.9 }
    });
    console.log("Creative (temp 0.9):", response.text);
}

With high temperature, you'll get much more varied and creative responses. Run this multiple times and you'll see different outputs each time.

async function demonstrateTokenLimits() {
    const prompt = "Explain how photosynthesis works in plants in 2 lines.";

    // Short response
    const response = await genAI.models.generateContent({
        model: "gemini-2.5-flash",
        contents: prompt,
        config: { maxOutputTokens: 600, temperature: 0.3 }
    });
    console.log("Limited to 600 tokens:", response.text);
}

This demonstrates how maxOutputTokens controls response length. With only 600 tokens, you'll get a very brief explanation that might be cut off.

async function demonstrateStopSequences() {
    const prompt = "List four benefits of exercise:\n1.";

    const response = await genAI.models.generateContent({
        model: "gemini-2.5-flash",
        contents: prompt,
        config: { temperature: 0.3, stopSequences: ["\n4.", "4."] }
    });
    console.log("Stopped at item 4:", response.text);
}

The stop sequences ensure the AI stops generating after listing three items, preventing it from continuing to item 4.

async function demonstrateTopP() {
    const prompt = "Generate two business ideas for a food truck.";

    const response = await genAI.models.generateContent({
        model: "gemini-2.5-flash",
        contents: prompt,
        config: {
            topP: 0.8
        }
    });

    console.log("Using topP (0.8):", response.text);
}

This shows how topP controls the diversity of token selection, offering an alternative way to manage creativity.

// Run all demonstrations
async function runAllExamples() {
  await demonstrateTemperature();
  await demonstrateCreativeTemperature();
  await demonstrateTokenLimits();
  await demonstrateStopSequences();
  await demonstrateTopP();
}

runAllExamples().catch(console.error);

This function runs all our examples so you can see the different parameters in action.

Parameter Combinations That Work Well

Here are some tried-and-tested parameter combinations for common use cases:

Code Generation:

Temperature: 0.1-0.2
Max output tokens: 500-1000
Stop sequences: ["```", "\n\n---"]

Creative Writing:

Temperature: 0.7-0.8
Max output tokens: 800-1500
Top P: 0.9

Data Extraction:

Temperature: 0.0-0.1
Max output tokens: 200-500
Stop sequences: Based on your expected format

Conversational AI:

Temperature: 0.6-0.7
Max output tokens: 300-600
Top P: 0.95

Cross-Provider Compatibility

The beauty of understanding these core parameters is that they work across different LLM providers with minor variations:

Google (Gemini): Uses temperature, maxOutputTokens, topP, stopSequences
OpenAI: Uses temperature, max_tokens, top_p, stop
Anthropic (Claude): Uses temperature and max_tokens, calls stop sequences stop_sequences
Local models (Ollama, etc.): Usually support the same core parameters

FAQ

Summary

Understanding LLM API parameters is crucial for building reliable AI applications. The four essential parameters - temperature, maxOutputTokens, topP, and stopSequences - give you precise control over AI behavior. Temperature controls creativity (0.0 for deterministic, 1.0 for creative), maxOutputTokens limits response length, and stop sequences provide clean termination points. These parameters work consistently across different LLM providers, making your knowledge transferable as you explore different AI services.

Complete Code

You can find the complete, runnable code for this tutorial on GitHub: