Vectorized Databases

# Vectorized Databases Vectorized databases store data as numerical vectors (arrays of numbers) instead of traditional text, enabling **semantic search** - finding content by meaning rather than exact keyword matches. ## Core Principle **Traditional search (keyword-based):** - "Find documents containing 'tritone substitution'" - Returns only exact matches - Misses synonyms, related concepts, paraphrases **Vector search (semantic):** - "Find documents about harmonic substitution techniques" - Returns conceptually similar content even without exact keywords - Matches: "flat-five sub", "altered dominant", "reharmonization" ## How Vector Search Works ### Step 1: Create Embeddings **Embedding:** Convert text into a vector (list of numbers) that captures semantic meaning. ```javascript import { OpenAIEmbeddings } from "langchain/embeddings/openai"; const embeddings = new OpenAIEmbeddings(); const text = "The tritone substitution uses a chord a tritone away"; const vector = await embeddings.embedQuery(text); // Result: [0.023, -0.145, 0.876, ..., 0.432] // (typically 1536 numbers for OpenAI embeddings) ``` **Key insight:** Similar concepts produce similar vectors. ```javascript const vec1 = await embeddings.embedQuery("tritone substitution"); const vec2 = await embeddings.embedQuery("flat-five sub"); const vec3 = await embeddings.embedQuery("pizza recipe"); // vec1 and vec2 are close in vector space // vec3 is far from both ``` ### Step 2: Store Vectors in Database **Vector database** stores these embeddings with their source text. ```javascript import { MemoryVectorStore } from "langchain/vectorstores/memory"; const vectorStore = await MemoryVectorStore.fromTexts( [ "Tritone substitution replaces V7 with bII7", "Modal interchange borrows chords from parallel modes", "Voice leading creates smooth melodic motion" ], [{ source: "harmony" }, { source: "harmony" }, { source: "melody" }], embeddings ); ``` ### Step 3: Query by Similarity **Similarity search:** Find vectors closest to query vector. ```javascript const results = await vectorStore.similaritySearch( "What are common reharmonization techniques?", k: 2 // return top 2 matches ); // Returns: // 1. "Tritone substitution replaces V7 with bII7" // 2. "Modal interchange borrows chords from parallel modes" // Note: Neither document contains "reharmonization"! ``` ## Vector Similarity Explained ### Cosine Similarity Measures angle between two vectors. Range: -1 (opposite) to 1 (identical). ``` Similarity = (A · B) / (||A|| × ||B||) ``` **Example:** - "jazz improvisation" ↔ "bebop soloing" → 0.85 (very similar) - "jazz improvisation" ↔ "classical counterpoint" → 0.42 (somewhat related) - "jazz improvisation" ↔ "car engine repair" → 0.05 (unrelated) ### Distance Metrics **Common distance functions:** - **Cosine similarity** - Best for text (ignores magnitude, focuses on direction) - **Euclidean distance** - Straight-line distance (used for some embeddings) - **Dot product** - Fast but sensitive to vector magnitude ## Vector Database Options ### In-Memory (Development) **LangChain MemoryVectorStore:** ```javascript const vectorStore = await MemoryVectorStore.fromDocuments( documents, embeddings ); ``` **Pros:** Simple, fast, no setup **Cons:** Lost on restart, limited scale ### Production Databases #### Pinecone Cloud-hosted, managed service ```javascript import { PineconeStore } from "langchain/vectorstores/pinecone"; const vectorStore = await PineconeStore.fromDocuments( documents, embeddings, { pineconeIndex } ); ``` #### Weaviate Open-source, self-hosted or cloud ```javascript import { WeaviateStore } from "langchain/vectorstores/weaviate"; ``` #### Supabase (PostgreSQL + pgvector) Use existing Postgres database ```javascript import { SupabaseVectorStore } from "langchain/vectorstores/supabase"; ``` #### Convex (Your Stack) Store vectors in Convex tables with custom similarity search ```javascript // Convex schema defineTable("embeddings", { text: v.string(), vector: v.array(v.float64()), metadata: v.object({ ... }) }); // Query with custom function const results = await ctx.db .query("embeddings") .filter(q => cosineSimilarity(q.field("vector"), queryVector) > 0.7) .take(10); ``` ## RAG (Retrieval Augmented Generation) **Pattern:** Enhance LLM responses with relevant data from vector store. ### Basic RAG Flow ``` User Question ↓ Embed Query → Vector Search → Retrieve Top K Documents ↓ LLM (Question + Retrieved Docs) ↓ Answer ``` ### Implementation ```javascript import { RetrievalQAChain } from "langchain/chains"; const chain = RetrievalQAChain.fromLLM( model, vectorStore.asRetriever() ); const result = await chain.invoke({ query: "How does Coltrane use tritone substitutions in Giant Steps?" }); // Chain automatically: // 1. Embeds the question // 2. Searches vector store // 3. Passes relevant docs + question to LLM // 4. Returns synthesized answer ``` ### Advanced RAG: Conversational **Add memory for multi-turn conversations:** ```javascript import { ConversationalRetrievalQAChain } from "langchain/chains"; const chain = ConversationalRetrievalQAChain.fromLLM( model, vectorStore.asRetriever(), { memory: new BufferMemory() } ); // First question await chain.call({ question: "What is modal interchange?" }); // Follow-up (uses conversation history) await chain.call({ question: "Give me an example in Giant Steps" }); // LLM knows "it" refers to modal interchange ``` ## Embeddings in Depth ### What Gets Embedded **Text chunks (documents):** - Song annotations - Literature notes - NNT scores (as text representation) - Article sections - Code comments **Optimal chunk size:** 200-500 tokens - Too small: Loses context - Too large: Dilutes semantic meaning **Chunking strategy:** ```javascript import { RecursiveCharacterTextSplitter } from "langchain/text_splitter"; const splitter = new RecursiveCharacterTextSplitter({ chunkSize: 500, chunkOverlap: 50 // Maintain context between chunks }); const chunks = await splitter.splitDocuments(documents); ``` ### Embedding Models **OpenAI text-embedding-ada-002:** - 1536 dimensions - Good general-purpose - Cost: $0.0001 / 1K tokens **OpenAI text-embedding-3-small:** - 1536 dimensions - Newer, better performance - Cost: $0.00002 / 1K tokens **Open-source alternatives:** - sentence-transformers (Python) - all-MiniLM-L6-v2 (lightweight) - Custom fine-tuned models (for specialized domains like music theory) ## Vector Search vs Traditional Search ### When to Use Vector Search **✅ Vector search excels at:** - **Semantic similarity** - "Find songs about loss" (matches "grief", "heartbreak", "mourning") - **Multilingual** - Embeddings work across languages - **Fuzzy concepts** - "Songs with a dark, brooding atmosphere" - **No exact keywords** - User can't articulate exact terms **Example: Music research** Query: "Examples of unexpected harmonic resolution" Matches: - "The chord progression avoided the expected cadence" - "Deceptive cadence creates surprise" - "Unresolved tension through non-functional harmony" ### When to Use Traditional Search (ripgrep) **✅ Traditional search (ripgrep) excels at:** - **Exact matches** - "Find all mentions of 'Cmaj7#11'" - **Pattern matching** - "Find files containing /[A-G][#b]?maj7/" - **Speed** - Milliseconds vs seconds for vector search - **No cost** - No API calls or embedding generation - **Code search** - Finding function names, variables, filenames **Example: Vault maintenance** Query: "Find all files with broken wiki links" Use: `rg '\[\[.*\|.*\]\]'` (regex pattern matching) ### Hybrid Approach (Best Practice) **Combine both for optimal results:** 1. **Initial filter with ripgrep** (fast, exact) ```bash rg -l "Giant Steps" ~/vault/ # → Returns 15 files ``` 2. **Semantic search within results** (accurate, conceptual) ```javascript const filtered = await vectorStore.similaritySearch( "analysis of harmonic movement in Giant Steps", { filter: { file: giantStepsFiles } } ); ``` ## Real-World Use Cases ### Use Case 1: Song Annotation Search **Your problem:** Search across 100+ song annotation articles for "descending chromatic bass lines" **Solution:** ```javascript // Embed all song annotations const songs = await loadSongArticles(); const vectorStore = await ConvexVectorStore.fromDocuments( songs, embeddings, { convex, table: "song_embeddings" } ); // Semantic search const results = await vectorStore.similaritySearch( "songs with chromatic descending bass motion", k: 10 ); ``` **Returns articles about:** - "I'll Remember April" (walking bass with chromatic approach) - "Stairway to Heaven" (descending chromatic line in intro) - "My Funny Valentine" (chromatic voice leading in bridge) ### Use Case 2: PhD Literature Review **Your problem:** "Find all notes related to notation systems in contemporary music" **Solution:** ```javascript const literatureStore = await ConvexVectorStore.fromTexts( obsidianVaultNotes, embeddings ); const papers = await literatureStore.similaritySearch( "contemporary music notation systems and graphic scores", k: 20 ); ``` **Finds notes even if they use different terminology:** - "Extended techniques notation" - "Graphic scores in 20th century music" - "Alternative notation methods" ### Use Case 3: NNT Query Interface **Your vision:** "Natural language query for NNT scores" **Implementation:** ```javascript // User types: "Show me tritone substitutions in bebop tunes" const nntAgent = await createAgent({ tools: [ nntVectorSearch, // Semantic search of scores nntParser, // Parse NNT syntax convexQuery // Structured data queries ], llm: model }); const result = await nntAgent.invoke({ input: "Show me tritone substitutions in bebop tunes" }); // Agent: // 1. Searches vector store for "tritone substitution" + "bebop" // 2. Parses NNT scores to find bII7 chords // 3. Queries Convex for tune metadata // 4. Returns matching passages with playback links ``` ## Performance and Cost Considerations ### Embedding Generation Costs **OpenAI text-embedding-3-small:** - $0.00002 per 1K tokens - Average song annotation: ~500 tokens - 1000 songs embedded: $0.01 (one cent!) **One-time vs incremental:** - Embed entire vault once: ~$1-5 - Incremental updates: pennies per day ### Query Performance **Vector search speed:** - In-memory (MemoryVectorStore): <50ms - Pinecone (cloud): 50-200ms - Convex (custom implementation): 100-500ms depending on index **Optimization:** - Pre-filter with metadata (e.g., artist, year) - Cache popular queries - Use approximate nearest neighbor (ANN) algorithms ### Storage Requirements **Per document:** - Text: 1-10KB - Embedding (1536 dimensions × 4 bytes): ~6KB - Total: ~16KB per document **For 10,000 documents:** ~160MB (tiny!) ## Getting Started with Vector Search ### Step 1: Choose Embedding Model ```bash npm install @langchain/openai ``` ```javascript import { OpenAIEmbeddings } from "@langchain/openai"; const embeddings = new OpenAIEmbeddings({ modelName: "text-embedding-3-small" }); ``` ### Step 2: Prepare Documents ```javascript import { Document } from "langchain/document"; const docs = songAnnotations.map(annotation => new Document({ pageContent: annotation.text, metadata: { title: annotation.title, artist: annotation.artist, section: annotation.section } }) ); ``` ### Step 3: Create Vector Store ```javascript import { MemoryVectorStore } from "langchain/vectorstores/memory"; const vectorStore = await MemoryVectorStore.fromDocuments( docs, embeddings ); ``` ### Step 4: Search ```javascript const results = await vectorStore.similaritySearch( "jazz songs with modal interchange", 4 // top 4 results ); results.forEach(doc => { console.log(doc.metadata.title); console.log(doc.pageContent); }); ``` ## See Also - [[Langchain Overview]] - Framework for using vector stores - [[Langchain with Convex]] - Storing vectors in your database - [[ripgrep vs Vector Search]] - Choosing the right search tool - [[Database Theory]] - Fundamentals of data storage