What's the difference between embeddings and regular text search?

Regular text search looks for exact word matches, while embeddings understand meaning. For example, searching for "car" with embeddings would also find documents about "automobile" or "vehicle" even if they don't contain the exact word "car".

How many dimensions should my embeddings have?

Google's text-embedding-004 provides 768 dimensions, which is excellent for most applications. Higher dimensions can capture more nuance but require more storage and computational resources. For beginners, stick with the default model dimensions.

Can I create embeddings for very long documents?

Yes, but there are token limits (typically 2048 tokens for text-embedding-004). For longer documents, you'll need to split them into chunks and create embeddings for each chunk, which we'll cover in the next tutorial about vector storage.

Text Embeddings

In the previous tutorial, we learned about RAG (Retrieval-Augmented Generation) and how it solves the knowledge gap problem in LLMs. Now, let's dive into one of the most crucial components that makes RAG possible: text embeddings.

Think of embeddings as a way to convert human-readable text into a numerical format that computers can understand and compare. It's like creating a mathematical "fingerprint" for each piece of text.

What are Embeddings?

Embeddings are numerical representations of data - in our case, text. Instead of storing text as words and sentences, embeddings convert text into arrays of numbers (vectors) that capture the semantic meaning.

Here's a simple analogy: imagine you're organizing books in a library. Instead of just sorting them alphabetically, you create a numerical system where books with similar topics get similar numbers. That's essentially what embeddings do for text.

graph LR
    A["Hello World"] --> B[Embedding Model]
    B --> C["[0.2, -0.1, 0.8, ...]"]
    D["Hi there"] --> B
    B --> E["[0.19, -0.09, 0.81, ...]"]
    F["Goodbye"] --> B
    B --> G["[-0.3, 0.7, -0.2, ...]"]

Notice how "Hello World" and "Hi there" have similar numbers because they're both greetings, while "Goodbye" has different numbers because it has a different meaning.

Text Embeddings Explained

Text embeddings are vectors (arrays of numbers) that represent the semantic meaning of text. Each dimension in the vector captures different aspects of the text's meaning, relationships, and context.

Key Properties of Text Embeddings:

1. Semantic Similarity: Texts with similar meanings have similar embedding vectors.

"dog" and "puppy" will have similar embeddings
"car" and "automobile" will be close in vector space

2. Mathematical Operations in Embeddings: Embeddings let us do “math” with words to find hidden relationships. A classic example is:

'king' - 'man' + 'woman' ≈ 'queen'

Here’s an easy way to understand it:

Start with “king”: This represents both royalty and male.
Subtract “man”: This removes the male part, leaving only royalty.
Add “woman”: This adds the female part to the remaining royalty.
Result: You get “queen”, which combines royalty + female.

Think of it like a simple recipe:

king = royalty + male
king - man = royalty
royalty + woman = queen

3. Dimensionality: Think of a "dimension" as a single number in the embedding vector. Each number represents a different abstract feature or aspect of the text's meaning that the model has learned. Most modern embedding models create vectors with hundreds or thousands of dimensions (typically 256, 512, 768, or 1536 dimensions).

4. Contextual Understanding: Modern embeddings understand context, so "bank" near "river" vs "bank" near "money" get different embeddings.

Why Embeddings Matter for RAG:

In RAG systems, embeddings allow us to:

Convert both documents and user queries into the same numerical format
Find the most relevant documents by comparing embedding similarity
Retrieve context that's semantically related, not just keyword-matched

How to Create Text Embeddings

Creating text embeddings involves using pre-trained models that have learned to convert text into meaningful numerical representations. Google's Gemini API, OpenAI's text-embedding-3, and Cohere's embed-multilingual-v2 provide powerful embedding capabilities that we can use directly.

The Process:

Send text to an embedding model
Receive back a numerical vector
Store or use the vector for similarity comparisons

Embedding Models & Quality

Different embedding models have different strengths. Here are a few popular ones:

Google text-embedding-004:
- 768 dimensions
- Excellent for general-purpose text understanding and retrieval.
OpenAI text-embedding-ada-02:
- 1536 dimensions
- A widely used and powerful model, known for strong performance across many tasks.
Hugging Face all-MiniLM-L6-v2:
- 384 dimensions
- A very popular, lightweight model that runs locally. It's fast and great for applications where you don't need the power (or cost) of a large API-based model.

Quality Factors:

Dimension Size: Higher dimensions can capture more nuance but require more storage
Training Data: Models trained on diverse, high-quality data perform better
Task Specialization: Some models are optimized for specific tasks (search, classification, etc.)

Environment Setup

Before we start creating embeddings, let's set up our TypeScript environment:

# Install the required dependencies
npm install @google/generative-ai
npm install @types/node typescript ts-node

Make sure you have your Google API key ready in your .env file:

GOOGLE_API_KEY=your_gemini_api_key_here

Working Code Example

Let's break down how to create text embeddings step by step in our index.ts file.

Step 1: Import Dependencies and Set Up the Client

import { GoogleGenerativeAI } from '@google/generative-ai';
import * as dotenv from 'dotenv';

// Load environment variables
dotenv.config();

This imports the Google Generative AI library and loads our API key from the environment variables.

Step 2: Initialize the Embedding Model

const genAI = new GoogleGenerativeAI(process.env.GOOGLE_API_KEY!);

// Get the embedding model
const model = genAI.getGenerativeModel({ 
  model: "text-embedding-004" 
});

Here we're creating a client and specifically requesting the text embedding model. The "text-embedding-004" is Google's latest embedding model optimized for text understanding.

Step 3: Create a Simple Embedding Function

async function createEmbedding(text: string) {
  try {
    const result = await model.embedContent(text);
    return result.embedding.values;
  } catch (error) {
    console.error('Error creating embedding:', error);
    throw error;
  }
}

This function takes any text string and returns its embedding as an array of numbers. The embedContent method handles all the complex processing internally.

Step 4: Test with Different Texts

async function testEmbeddings() {
  const texts = [
    "I love programming in TypeScript",
    "TypeScript development is enjoyable",
    "The weather is sunny today"
  ];

  for (const text of texts) {
    const embedding = await createEmbedding(text);
    console.log(`Text: "${text}"`);
    console.log(`Embedding dimensions: ${embedding.length}`);
    console.log(`First 5 values: [${embedding.slice(0, 5).join(', ')}]`);
    console.log('---');
  }
}

This demonstrates creating embeddings for different texts. Notice how we only show the first 5 values since the full embedding has 768 numbers.

Step 5: Compare Embedding Similarity

function cosineSimilarity(vecA: number[], vecB: number[]): number {
  const dotProduct = vecA.reduce((sum, a, i) => sum + a * vecB[i], 0);
  const magnitudeA = Math.sqrt(vecA.reduce((sum, a) => sum + a * a, 0));
  const magnitudeB = Math.sqrt(vecB.reduce((sum, b) => sum + b * b, 0));
  return dotProduct / (magnitudeA * magnitudeB);
}

async function compareSimilarity() {
  const text1 = "I enjoy coding in TypeScript";
  const text2 = "TypeScript programming is fun";
  const text3 = "The cat is sleeping";

  const emb1 = await createEmbedding(text1);
  const emb2 = await createEmbedding(text2);
  const emb3 = await createEmbedding(text3);

  console.log('Similarity between programming texts:', 
    cosineSimilarity(emb1, emb2).toFixed(3));
  console.log('Similarity between programming and cat text:', 
    cosineSimilarity(emb1, emb3).toFixed(3));
}

This shows how to measure similarity between embeddings using cosine similarity. The programming-related texts will have a higher similarity score than unrelated texts.

Step 6: Running Everything Together

async function main() {
  console.log('=== Creating Text Embeddings ===\n');
  await testEmbeddings();
  
  console.log('\n=== Comparing Similarities ===\n');
  await compareSimilarity();
}

main().catch(console.error);

This ties everything together, demonstrating both embedding creation and similarity comparison.

FAQ

Summary

Text embeddings are the foundation of modern RAG systems. They convert text into numerical vectors that capture semantic meaning, allowing computers to understand and compare text content mathematically.

Key takeaways:

Embeddings represent text as arrays of numbers that capture meaning
Similar texts have similar embedding vectors
Google's text-embedding-004 model provides high-quality 768-dimensional embeddings
You can measure text similarity using cosine similarity between embeddings
Embeddings enable semantic search, not just keyword matching

In the next tutorial, we'll learn how to store these embeddings efficiently and perform similarity searches to find the most relevant documents for our RAG system.

Complete Code

You can find the complete, runnable code for this tutorial on GitHub: https://github.com/avestalabs/academy/tree/main/4-rag/text-embeddings