Vector Storage & Similarity Search

Published

Now that you understand how to create text embeddings, the next crucial step is learning how to store these vectors efficiently and search through them to find the most relevant content. This is where vector storage and similarity search come into play - the backbone of any effective RAG system.

Think of vector storage as a specialized library system. Instead of organizing books alphabetically, you organize them by their "meaning coordinates" so you can quickly find books with similar topics, even if they use different words.

What is Vector Storage?

Vector storage refers to specialized systems designed to store, index, and retrieve high-dimensional vectors (like our text embeddings) efficiently. Unlike traditional databases that store text, numbers, or structured data, vector databases are optimized for mathematical operations on arrays of floating-point numbers.

Why Do We Need Specialized Vector Storage?

Regular databases aren't designed for vector operations. Here's why:

Traditional Database Challenge:

-- This doesn't work for finding similar vectors
SELECT * FROM documents 
WHERE embedding = [0.2, -0.1, 0.8, ...] -- Exact match only

What We Actually Need:

-- PostgreSQL with pgvector - finds similar vectors
SELECT id, content, 1 - (embedding <=> '[0.2, -0.1, 0.8, ...]') AS similarity
FROM documents
ORDER BY embedding <=> '[0.2, -0.1, 0.8, ...]'
LIMIT 5;

Key Requirements for Vector Storage:

  1. High-Dimensional Support: Handle vectors with hundreds or thousands of dimensions
  2. Similarity Search: Find vectors that are "close" to a query vector
  3. Scalability: Efficiently search through millions of vectors
  4. Speed: Return results in milliseconds, not seconds
  5. Indexing: Smart organization for fast retrieval

Types of Vector Databases

Let's explore the different categories of vector storage solutions:

1. Purpose-Built Vector Databases

These are databases designed specifically for vector operations:

Pinecone:

  • Fully managed cloud service
  • Excellent for production applications
  • Automatic scaling and optimization
  • Built-in metadata filtering

Weaviate:

  • Open-source with cloud options
  • Strong integration with ML models
  • GraphQL API
  • Supports multiple vector spaces

Qdrant:

  • Open-source, Rust-based
  • High performance and memory efficiency
  • Rich filtering capabilities
  • Easy to self-host

2. Traditional Databases with Vector Extensions

Existing databases that added vector capabilities:

PostgreSQL with pgvector:

  • Adds vector data type to PostgreSQL
  • Familiar SQL interface
  • Good for existing PostgreSQL users
  • ACID compliance

Redis with RediSearch:

  • In-memory vector search
  • Extremely fast for smaller datasets
  • Familiar Redis interface
  • Good for caching and real-time applications

3. In-Memory Solutions

For development and smaller applications:

Faiss (Facebook AI Similarity Search):

  • Library, not a database
  • Extremely fast similarity search
  • Requires custom integration
  • Great for research and prototyping

Simple Arrays in Memory:

  • Store vectors in application memory
  • Use for small datasets or development
  • No persistence without additional work

Choosing the Right Solution

For Learning/Development:

  • PostgreSQL with pgvector (what we'll use in this tutorial)
  • Use Faiss for experimentation

For Small to Medium Applications:

  • PostgreSQL with pgvector
  • Self-hosted Qdrant

For Large-Scale Production:

  • Pinecone (managed)
  • Weaviate (cloud or self-hosted)
  • Qdrant (cloud or self-hosted)

Similarity search is the process of finding vectors that are "close" to a given query vector in high-dimensional space. Instead of looking for exact matches, we find the most similar items based on mathematical distance.

Real-World Analogy

Imagine you're in a music store and you say, "I like this song, find me similar music." The store clerk doesn't look for the exact same song, but finds music with similar:

  • Genre
  • Tempo
  • Mood
  • Instruments

That's exactly what similarity search does with text embeddings - it finds content with similar meaning, context, and semantic properties.

1. Exact Search (Brute Force):

  • Compare query vector with every stored vector
  • Guaranteed to find the best matches
  • Slow for large datasets (O(n) complexity)

2. Approximate Search (ANN - Approximate Nearest Neighbors):

  • Use indexing to quickly find "good enough" matches
  • Much faster for large datasets
  • Slight trade-off in accuracy for speed

How Similarity Search Works

Let's break down the similarity search process:

Step 1: Vector Preparation

graph LR
    A["User Query: 'How to debug TypeScript?'"] --> B[Create Embedding] --> C["Query Vector: [0.1, -0.3, 0.7, ...]"]

Step 2: Database Query

graph TD
    A[Query Vector] --> B[PostgreSQL with pgvector]
    C[Stored Vectors] --> B
    B --> D[Similarity Calculation]
    D --> E[Ranked Results]

Step 3: Ranking and Selection

graph LR
    A[SQL Query] --> B[ORDER BY Similarity] --> C[LIMIT Top K Results]

The Complete Process:

  1. Query Processing: Convert user query to embedding vector
  2. SQL Execution: Run similarity search query in PostgreSQL
  3. Ranking: Database sorts results by similarity score
  4. Filtering: Apply any metadata filters (date, category, etc.)
  5. Return: Provide top K most similar results

Environment Setup

For this tutorial, we'll use PostgreSQL with the pgvector extension for production-ready vector storage:

# Install dependencies
npm install @google/generative-ai pg @types/pg
npm install @types/node typescript ts-node

# Install PostgreSQL and pgvector (if not already installed)
# On macOS with Homebrew:
brew install postgresql
brew install pgvector

# On Ubuntu/Debian:
sudo apt-get install postgresql postgresql-contrib
sudo apt-get install postgresql-14-pgvector

Working Code Example

Let's build a PostgreSQL-based vector storage and similarity search system:

Step 1: Database Setup

First, let's create the database schema in pgAdmin or psql:

-- Create the pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Create documents table with vector column
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT NOT NULL,
    embedding vector(768), -- Google's text-embedding-004 uses 768 dimensions
    metadata JSONB,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Create an index for faster similarity searches
CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);

This SQL creates our vector-enabled table with proper indexing for fast similarity searches.

Step 2: TypeScript Database Connection

// index.ts
import { GoogleGenerativeAI } from '@google/generative-ai';
import { Client } from 'pg';
import * as dotenv from 'dotenv';

dotenv.config();

interface VectorDocument {
  id: number;
  content: string;
  embedding: number[];
  metadata?: Record<string, any>;
  created_at: Date;
}

This sets up our TypeScript interfaces and imports for PostgreSQL integration.

Step 3: Create the PostgreSQL Vector Store Class

// index.ts
class PostgreSQLVectorStore {
  private client: Client;
  private genAI: GoogleGenerativeAI;
  private model: any;

  constructor() {
    this.client = new Client({
      host: process.env.DB_HOST || 'localhost',
      port: parseInt(process.env.DB_PORT || '5432'),
      database: process.env.DB_NAME || 'vector_db',
      user: process.env.DB_USER || 'postgres',
      password: process.env.DB_PASSWORD || 'password',
    });

    this.genAI = new GoogleGenerativeAI(process.env.GOOGLE_API_KEY!);
    this.model = this.genAI.getGenerativeModel({ model: "text-embedding-004" });
  }

  async connect() {
    await this.client.connect();
    console.log('Connected to PostgreSQL database');
  }

  async disconnect() {
    await this.client.end();
    console.log('Disconnected from PostgreSQL database');
  }
}

This creates our PostgreSQL-based vector store with proper database connection management.

Step 4: Add Document Storage Method

// index.ts (add to PostgreSQLVectorStore class)
async addDocument(content: string, metadata?: Record<string, any>): Promise<number> {
  try {
    // Create embedding for the content
    const result = await this.model.embedContent(content);
    const embedding = result.embedding.values;

    // Insert into PostgreSQL
    const query = `
      INSERT INTO documents (content, embedding, metadata)
      VALUES ($1, $2, $3)
      RETURNING id
    `;
    
    const values = [
      content,
      `[${embedding.join(',')}]`, // Convert array to PostgreSQL vector format
      metadata ? JSON.stringify(metadata) : null
    ];

    const res = await this.client.query(query, values);
    const documentId = res.rows[0].id;
    
    console.log(`Added document with ID: ${documentId}`);
    return documentId;
  } catch (error) {
    console.error('Error adding document:', error);
    throw error;
  }
}

This method creates embeddings and stores them directly in PostgreSQL using the vector data type.

// index.ts (add to PostgreSQLVectorStore class)
async search(query: string, topK: number = 5): Promise<Array<VectorDocument & { similarity: number }>> {
  try {
    // Create embedding for the search query
    const result = await this.model.embedContent(query);
    const queryEmbedding = result.embedding.values;

    // Execute similarity search using PostgreSQL
    const searchQuery = `
      SELECT 
        id,
        content,
        metadata,
        created_at,
        1 - (embedding <=> $1) AS similarity
      FROM documents
      ORDER BY embedding <=> $1
      LIMIT $2
    `;

    const queryVector = `[${queryEmbedding.join(',')}]`;
    const res = await this.client.query(searchQuery, [queryVector, topK]);

    return res.rows.map(row => ({
      id: row.id,
      content: row.content,
      embedding: [], // We don't need to return the full embedding
      metadata: row.metadata,
      created_at: row.created_at,
      similarity: parseFloat(row.similarity)
    }));
  } catch (error) {
    console.error('Error searching documents:', error);
    throw error;
  }
}

This implements similarity search using PostgreSQL's native vector operations with the <=> cosine distance operator.

Step 6: Advanced Search with Metadata Filtering

// index.ts (add to PostgreSQLVectorStore class)
async searchWithFilter(
  query: string, 
  topK: number = 5, 
  metadataFilter?: Record<string, any>
): Promise<Array<VectorDocument & { similarity: number }>> {
  try {
    const result = await this.model.embedContent(query);
    const queryEmbedding = result.embedding.values;

    let searchQuery = `
      SELECT 
        id,
        content,
        metadata,
        created_at,
        1 - (embedding <=> $1) AS similarity
      FROM documents
    `;

    const queryParams: any[] = [`[${queryEmbedding.join(',')}]`];

    // Add metadata filtering if provided
    if (metadataFilter) {
      const filterConditions = Object.entries(metadataFilter).map((_, index) => {
        return `metadata->>'${Object.keys(metadataFilter)[index]}' = $${index + 2}`;
      });
      
      searchQuery += ` WHERE ${filterConditions.join(' AND ')}`;
      queryParams.push(...Object.values(metadataFilter));
    }

    searchQuery += ` ORDER BY embedding <=> $1 LIMIT $${queryParams.length + 1}`;
    queryParams.push(topK);

    const res = await this.client.query(searchQuery, queryParams);

    return res.rows.map(row => ({
      id: row.id,
      content: row.content,
      embedding: [],
      metadata: row.metadata,
      created_at: row.created_at,
      similarity: parseFloat(row.similarity)
    }));
  } catch (error) {
    console.error('Error searching with filter:', error);
    throw error;
  }
}

This adds advanced filtering capabilities using PostgreSQL's JSONB operators for metadata queries.

Step 7: Test the PostgreSQL Vector Store

// index.ts
async function testPostgreSQLVectorStore() {
  const vectorStore = new PostgreSQLVectorStore();
  
  try {
    await vectorStore.connect();

    // Add some test documents
    await vectorStore.addDocument(
      "TypeScript is a strongly typed programming language that builds on JavaScript",
      { category: "programming", language: "typescript" }
    );
    
    await vectorStore.addDocument(
      "JavaScript is a versatile programming language for web development",
      { category: "programming", language: "javascript" }
    );
    
    await vectorStore.addDocument(
      "The weather forecast shows sunny skies for the weekend",
      { category: "weather", location: "general" }
    );
    
    await vectorStore.addDocument(
      "React is a popular JavaScript library for building user interfaces",
      { category: "programming", language: "javascript", framework: "react" }
    );

    // Perform similarity search
    console.log('\n=== Basic Search Results ===');
    const results = await vectorStore.search("programming languages", 3);
    
    results.forEach((result, index) => {
      console.log(`${index + 1}. ${result.content}`);
      console.log(`   Similarity: ${result.similarity.toFixed(3)}`);
      console.log(`   Category: ${result.metadata?.category}`);
      console.log('---');
    });

    // Perform filtered search
    console.log('\n=== Filtered Search Results (Programming only) ===');
    const filteredResults = await vectorStore.searchWithFilter(
      "web development", 
      2, 
      { category: "programming" }
    );
    
    filteredResults.forEach((result, index) => {
      console.log(`${index + 1}. ${result.content}`);
      console.log(`   Similarity: ${result.similarity.toFixed(3)}`);
      console.log(`   Language: ${result.metadata?.language}`);
      console.log('---');
    });

  } finally {
    await vectorStore.disconnect();
  }
}

This demonstrates both basic similarity search and advanced filtering using PostgreSQL.

Step 8: SQL Queries You Can Run in pgAdmin

Here are some useful SQL queries you can execute directly in pgAdmin to explore your vector data:

-- View all documents with their similarity to a specific query
SELECT 
  id,
  content,
  metadata,
  1 - (embedding <=> '[0.1, -0.2, 0.3, ...]') AS similarity
FROM documents
ORDER BY similarity DESC
LIMIT 10;

-- Find documents by category with similarity search
SELECT 
  id,
  content,
  metadata->>'category' as category,
  1 - (embedding <=> '[0.1, -0.2, 0.3, ...]') AS similarity
FROM documents
WHERE metadata->>'category' = 'programming'
ORDER BY similarity DESC
LIMIT 5;

-- Get statistics about your vector database
SELECT 
  COUNT(*) as total_documents,
  COUNT(DISTINCT metadata->>'category') as unique_categories,
  AVG(array_length(string_to_array(embedding::text, ','), 1)) as avg_dimensions
FROM documents;

Step 9: Complete Example

// index.ts
async function main() {
  console.log('=== PostgreSQL Vector Store Demo ===\n');
  await testPostgreSQLVectorStore();
}

main().catch(console.error);

This brings everything together for a complete PostgreSQL-based vector storage and similarity search system.

FAQ

Summary

Vector storage and similarity search using PostgreSQL with pgvector provides a production-ready foundation for RAG systems. By leveraging SQL's familiar interface with vector capabilities, you can build scalable and maintainable vector search applications.

Key takeaways:

  • PostgreSQL with pgvector combines traditional database benefits with vector search
  • Use proper indexing (ivfflat) for fast similarity searches on large datasets
  • SQL queries make vector operations accessible and debuggable
  • Metadata filtering allows for sophisticated search refinement
  • The <=> operator provides efficient cosine similarity calculations
  • pgAdmin makes it easy to inspect and debug your vector data

In the next tutorial, we'll combine everything we've learned about embeddings and vector storage to build a complete RAG pipeline that can answer questions using your own documents.

Complete Code

You can find the complete, runnable code for this tutorial on GitHub: https://github.com/avestalabs/academy/tree/main/4-rag/vector-storage%26similarity-search

Share this article: