Understanding Large Language Models

Large Language Models (LLMs) have revolutionized natural language processing and artificial intelligence. This lesson explores their architecture, capabilities, limitations, and practical applications in modern AI systems.

What are Large Language Models?

Large Language Models are neural networks trained on vast amounts of text data to understand and generate human-like language. They can perform a wide variety of language tasks without task-specific training.

Key Characteristics

Scale: Billions or trillions of parameters
Generality: Can perform many language tasks
Few-shot Learning: Learn new tasks from examples
Emergent Abilities: Capabilities that emerge at scale

The Evolution of Language Models

Traditional Approaches (Pre-2017)

Rule-based systems
Statistical models (N-grams)
Early neural networks (RNNs, LSTMs)

Transformer Era (2017-2019)

Attention mechanism
BERT (Bidirectional Encoder Representations)
GPT-1 (Generative Pre-trained Transformer)

Large-Scale Era (2019-Present)

GPT-2, GPT-3, GPT-4
PaLM, LaMDA, Claude
Specialized models (Codex, ChatGPT)

LLM Architecture Deep Dive

Transformer Foundation

LLMs are built on the transformer architecture with key components:

import torch
import torch.nn as nn
import math

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)

        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)

        attention_weights = torch.softmax(scores, dim=-1)
        output = torch.matmul(attention_weights, V)

        return output, attention_weights

    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)

        # Linear transformations
        Q = self.W_q(query).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(key).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(value).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        # Apply attention
        attention_output, attention_weights = self.scaled_dot_product_attention(Q, K, V, mask)

        # Concatenate heads
        attention_output = attention_output.transpose(1, 2).contiguous().view(
            batch_size, -1, self.d_model
        )

        # Final linear transformation
        output = self.W_o(attention_output)

        return output, attention_weights

class TransformerBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.attention = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)

        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model)
        )

        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # Self-attention with residual connection
        attn_output, _ = self.attention(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))

        # Feed-forward with residual connection
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))

        return x

Scaling Laws

The relationship between model performance and scale:

def compute_scaling_law(N, D, C):
    """
    Compute expected loss based on scaling laws
    N: Number of parameters
    D: Dataset size
    C: Compute budget
    """
    # Simplified Chinchilla scaling law
    alpha_N = 0.076  # Parameter scaling exponent
    alpha_D = 0.095  # Data scaling exponent

    # Optimal loss scales with compute
    L_N = (N / 8.8e13) ** (-alpha_N)
    L_D = (D / 5.4e13) ** (-alpha_D)

    return min(L_N, L_D)

# Example: Compare different model sizes
models = [
    ("GPT-3 Small", 125e6, 300e9),
    ("GPT-3 Medium", 1.3e9, 300e9),
    ("GPT-3 Large", 6.7e9, 300e9),
    ("GPT-3", 175e9, 300e9)
]

for name, params, data in models:
    loss = compute_scaling_law(params, data, params * data)
    print(f"{name}: {params/1e9:.1f}B params, Expected loss: {loss:.4f}")

Training Process

Pre-training

LLMs are trained in two main phases:

class LLMTrainer:
    def __init__(self, model, tokenizer, config):
        self.model = model
        self.tokenizer = tokenizer
        self.config = config
        self.optimizer = torch.optim.AdamW(
            model.parameters(),
            lr=config.learning_rate,
            weight_decay=config.weight_decay
        )

    def preprocess_batch(self, texts):
        """Preprocess a batch of texts for training"""
        # Tokenize texts
        tokens = self.tokenizer(
            texts,
            padding=True,
            truncation=True,
            max_length=self.config.max_length,
            return_tensors="pt"
        )

        # Create input and target sequences
        input_ids = tokens['input_ids'][:, :-1]
        target_ids = tokens['input_ids'][:, 1:]

        return input_ids, target_ids

    def compute_loss(self, input_ids, target_ids):
        """Compute next-token prediction loss"""
        outputs = self.model(input_ids)
        logits = outputs.logits

        # Flatten for cross-entropy loss
        logits_flat = logits.view(-1, logits.size(-1))
        targets_flat = target_ids.view(-1)

        loss = nn.CrossEntropyLoss(ignore_index=self.tokenizer.pad_token_id)
        return loss(logits_flat, targets_flat)

    def train_step(self, batch):
        """Single training step"""
        input_ids, target_ids = self.preprocess_batch(batch)

        # Forward pass
        loss = self.compute_loss(input_ids, target_ids)

        # Backward pass
        self.optimizer.zero_grad()
        loss.backward()

        # Gradient clipping
        torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0)

        self.optimizer.step()

        return loss.item()

Fine-tuning Approaches

Supervised Fine-tuning (SFT)

def supervised_fine_tuning(model, instruction_dataset):
    """Fine-tune model on instruction-following tasks"""

    for batch in instruction_dataset:
        # Format: {"instruction": "...", "input": "...", "output": "..."}
        prompts = []
        targets = []

        for example in batch:
            prompt = f"Instruction: {example['instruction']}\nInput: {example['input']}\nOutput: "
            target = example['output']

            prompts.append(prompt)
            targets.append(target)

        # Train on prompt -> target mapping
        loss = train_step(model, prompts, targets)

    return model

Reinforcement Learning from Human Feedback (RLHF)

class RLHFTrainer:
    def __init__(self, policy_model, reward_model, ref_model):
        self.policy = policy_model
        self.reward_model = reward_model
        self.ref_model = ref_model
        self.kl_coeff = 0.1

    def compute_rewards(self, prompts, responses):
        """Compute rewards for generated responses"""
        # Reward model scores
        reward_scores = self.reward_model(prompts, responses)

        # KL penalty from reference model
        policy_logprobs = self.policy.get_logprobs(prompts, responses)
        ref_logprobs = self.ref_model.get_logprobs(prompts, responses)
        kl_penalty = self.kl_coeff * (policy_logprobs - ref_logprobs)

        return reward_scores - kl_penalty

    def ppo_step(self, prompts, responses, old_logprobs, rewards):
        """Proximal Policy Optimization step"""
        new_logprobs = self.policy.get_logprobs(prompts, responses)
        ratio = torch.exp(new_logprobs - old_logprobs)

        # PPO clipped objective
        advantages = rewards - rewards.mean()
        surr1 = ratio * advantages
        surr2 = torch.clamp(ratio, 0.8, 1.2) * advantages

        policy_loss = -torch.min(surr1, surr2).mean()

        return policy_loss

LLM Capabilities

Core Language Tasks

Text Generation

def generate_text(model, tokenizer, prompt, max_length=100):
    """Generate text continuation"""
    inputs = tokenizer(prompt, return_tensors="pt")

    with torch.no_grad():
        outputs = model.generate(
            inputs.input_ids,
            max_length=max_length,
            temperature=0.7,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )

    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return generated_text[len(prompt):]

Few-shot Learning

def few_shot_classification(model, tokenizer, examples, query):
    """Perform classification using few-shot examples"""

    # Build prompt with examples
    prompt = "Classify the sentiment of these texts:\n\n"

    for example in examples:
        prompt += f"Text: {example['text']}\nSentiment: {example['label']}\n\n"

    prompt += f"Text: {query}\nSentiment:"

    # Generate classification
    response = generate_text(model, tokenizer, prompt, max_length=len(prompt) + 10)

    return response.strip()

Reasoning and Problem Solving

def chain_of_thought_reasoning(model, tokenizer, problem):
    """Use chain-of-thought prompting for reasoning"""

    prompt = f"""
    Problem: {problem}

    Let me think step by step:

    Step 1:"""

    reasoning = generate_text(model, tokenizer, prompt, max_length=200)

    return reasoning

Emergent Abilities

Capabilities that appear at sufficient scale:

In-context Learning: Learning from examples in the prompt
Chain-of-thought Reasoning: Step-by-step problem solving
Code Generation: Writing functional code
Mathematical Reasoning: Solving complex math problems
Multilingual Understanding: Working across languages

Practical Applications

Chatbots and Assistants

class LLMChatbot:
    def __init__(self, model, tokenizer, system_prompt=""):
        self.model = model
        self.tokenizer = tokenizer
        self.system_prompt = system_prompt
        self.conversation_history = []

    def chat(self, user_message):
        """Generate chatbot response"""
        # Build conversation context
        context = self.system_prompt + "\n\n"

        for turn in self.conversation_history:
            context += f"Human: {turn['user']}\nAssistant: {turn['assistant']}\n\n"

        context += f"Human: {user_message}\nAssistant:"

        # Generate response
        response = generate_text(
            self.model,
            self.tokenizer,
            context,
            max_length=len(context) + 150
        )

        # Update conversation history
        self.conversation_history.append({
            'user': user_message,
            'assistant': response
        })

        return response

Content Generation

class ContentGenerator:
    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer

    def generate_article(self, topic, style="informative", length="medium"):
        """Generate article on given topic"""

        length_map = {
            "short": "Write a brief 200-word article",
            "medium": "Write a comprehensive 500-word article",
            "long": "Write a detailed 1000-word article"
        }

        prompt = f"""
        {length_map[length]} about {topic} in an {style} style.

        Title: {topic.title()}

        """

        article = generate_text(self.model, self.tokenizer, prompt, max_length=800)

        return article

    def generate_code(self, description, language="Python"):
        """Generate code from description"""

        prompt = f"""
        Write {language} code to {description}:

        ```{language.lower()}
        """

        code = generate_text(self.model, self.tokenizer, prompt, max_length=300)

        # Extract code between backticks
        if "```" in code:
            code = code.split("```")[0]

        return code.strip()

Limitations and Challenges

1. Hallucination

LLMs can generate plausible but incorrect information:

def detect_hallucination(model, tokenizer, claim, knowledge_base):
    """Simple hallucination detection"""

    # Check if claim is supported by knowledge base
    verification_prompt = f"""
    Claim: {claim}

    Based on the following knowledge:
    {knowledge_base}

    Is this claim accurate? Answer: Yes or No
    Explanation:
    """

    verification = generate_text(model, tokenizer, verification_prompt, max_length=100)

    return "No" in verification[:10]  # Simple heuristic

2. Bias and Fairness

def bias_evaluation(model, tokenizer, templates):
    """Evaluate model bias across different groups"""

    results = {}

    for template in templates:
        for group in ["men", "women", "various ethnicities"]:
            prompt = template.format(group=group)
            response = generate_text(model, tokenizer, prompt, max_length=50)

            # Analyze sentiment/content of response
            sentiment = analyze_sentiment(response)
            results[f"{template}_{group}"] = sentiment

    return results

3. Safety and Alignment

class SafetyFilter:
    def __init__(self, harmful_patterns):
        self.harmful_patterns = harmful_patterns

    def is_safe(self, text):
        """Check if generated text is safe"""
        text_lower = text.lower()

        for pattern in self.harmful_patterns:
            if pattern in text_lower:
                return False

        return True

    def safe_generate(self, model, tokenizer, prompt, max_attempts=3):
        """Generate text with safety filtering"""

        for attempt in range(max_attempts):
            response = generate_text(model, tokenizer, prompt)

            if self.is_safe(response):
                return response

        return "I cannot generate a safe response to this request."

Evaluation Metrics

Perplexity

def calculate_perplexity(model, tokenizer, text):
    """Calculate perplexity of text under model"""

    tokens = tokenizer(text, return_tensors="pt")
    input_ids = tokens.input_ids

    with torch.no_grad():
        outputs = model(input_ids, labels=input_ids)
        loss = outputs.loss

    perplexity = torch.exp(loss)
    return perplexity.item()

BLEU Score for Generation

from nltk.translate.bleu_score import sentence_bleu

def evaluate_generation(model, tokenizer, prompts, references):
    """Evaluate generation quality using BLEU"""

    bleu_scores = []

    for prompt, reference in zip(prompts, references):
        generated = generate_text(model, tokenizer, prompt)

        # Tokenize for BLEU calculation
        reference_tokens = reference.split()
        generated_tokens = generated.split()

        bleu = sentence_bleu([reference_tokens], generated_tokens)
        bleu_scores.append(bleu)

    return sum(bleu_scores) / len(bleu_scores)

Future Directions

1. Multimodal LLMs

Integration with vision, audio, and other modalities

2. Efficient Architectures

Mixture of Experts (MoE)
Sparse attention mechanisms
Model compression techniques

3. Better Alignment

Constitutional AI
Improved RLHF methods
Value-based training

4. Specialized Applications

Scientific reasoning
Code generation
Creative applications

Key Takeaways

LLMs are transformer-based models trained on vast text corpora
They exhibit emergent abilities at scale, including few-shot learning
Training involves pre-training on text and fine-tuning for specific tasks
RLHF helps align models with human preferences
Applications span chatbots, content generation, and reasoning tasks
Key challenges include hallucination, bias, and safety concerns
Evaluation requires multiple metrics beyond perplexity

Practice Exercise

Experiment with Prompting: Try different prompting strategies (zero-shot, few-shot, chain-of-thought) on a language model
Bias Analysis: Evaluate a model's responses to prompts about different demographic groups
Safety Testing: Test how a model responds to potentially harmful prompts
Performance Comparison: Compare different sized models on the same task

Next Steps

Now that you understand LLMs, we'll explore the training techniques used to create these powerful models.

Continue to: LLM Training Techniques