Vectorized Databases
Vectorized Databases
Vectorized databases store data as numerical vectors (arrays of numbers) instead of traditional text, enabling semantic search - finding content by meaning rather than exact keyword matches.
Core Principle
Traditional search (keyword-based):
- "Find documents containing 'tritone substitution'"
- Returns only exact matches
- Misses synonyms, related concepts, paraphrases
Vector search (semantic):
- "Find documents about harmonic substitution techniques"
- Returns conceptually similar content even without exact keywords
- Matches: "flat-five sub", "altered dominant", "reharmonization"
How Vector Search Works
Step 1: Create Embeddings
Embedding: Convert text into a vector (list of numbers) that captures semantic meaning.
javascriptimport { OpenAIEmbeddings } from "langchain/embeddings/openai"; const embeddings = new OpenAIEmbeddings(); const text = "The tritone substitution uses a chord a tritone away"; const vector = await embeddings.embedQuery(text); // Result: [0.023, -0.145, 0.876, ..., 0.432] // (typically 1536 numbers for OpenAI embeddings)
Key insight: Similar concepts produce similar vectors.
javascriptconst vec1 = await embeddings.embedQuery("tritone substitution"); const vec2 = await embeddings.embedQuery("flat-five sub"); const vec3 = await embeddings.embedQuery("pizza recipe"); // vec1 and vec2 are close in vector space // vec3 is far from both
Step 2: Store Vectors in Database
Vector database stores these embeddings with their source text.
javascriptimport { MemoryVectorStore } from "langchain/vectorstores/memory"; const vectorStore = await MemoryVectorStore.fromTexts( [ "Tritone substitution replaces V7 with bII7", "Modal interchange borrows chords from parallel modes", "Voice leading creates smooth melodic motion" ], [{ source: "harmony" }, { source: "harmony" }, { source: "melody" }], embeddings );
Step 3: Query by Similarity
Similarity search: Find vectors closest to query vector.
javascriptconst results = await vectorStore.similaritySearch( "What are common reharmonization techniques?", k: 2 // return top 2 matches ); // Returns: // 1. "Tritone substitution replaces V7 with bII7" // 2. "Modal interchange borrows chords from parallel modes" // Note: Neither document contains "reharmonization"!
Vector Similarity Explained
Cosine Similarity
Measures angle between two vectors. Range: -1 (opposite) to 1 (identical).
Similarity = (A Β· B) / (||A|| Γ ||B||)
Example:
- "jazz improvisation" β "bebop soloing" β 0.85 (very similar)
- "jazz improvisation" β "classical counterpoint" β 0.42 (somewhat related)
- "jazz improvisation" β "car engine repair" β 0.05 (unrelated)
Distance Metrics
Common distance functions:
- Cosine similarity - Best for text (ignores magnitude, focuses on direction)
- Euclidean distance - Straight-line distance (used for some embeddings)
- Dot product - Fast but sensitive to vector magnitude
Vector Database Options
In-Memory (Development)
LangChain MemoryVectorStore:
javascriptconst vectorStore = await MemoryVectorStore.fromDocuments( documents, embeddings );
Pros: Simple, fast, no setup
Cons: Lost on restart, limited scale
Production Databases
Pinecone
Cloud-hosted, managed service
javascriptimport { PineconeStore } from "langchain/vectorstores/pinecone"; const vectorStore = await PineconeStore.fromDocuments( documents, embeddings, { pineconeIndex } );
Weaviate
Open-source, self-hosted or cloud
javascriptimport { WeaviateStore } from "langchain/vectorstores/weaviate";
Supabase (PostgreSQL + pgvector)
Use existing Postgres database
javascriptimport { SupabaseVectorStore } from "langchain/vectorstores/supabase";
Convex (Your Stack)
Store vectors in Convex tables with custom similarity search
javascript// Convex schema defineTable("embeddings", { text: v.string(), vector: v.array(v.float64()), metadata: v.object({ ... }) }); // Query with custom function const results = await ctx.db .query("embeddings") .filter(q => cosineSimilarity(q.field("vector"), queryVector) > 0.7) .take(10);
RAG (Retrieval Augmented Generation)
Pattern: Enhance LLM responses with relevant data from vector store.
Basic RAG Flow
User Question
β
Embed Query β Vector Search β Retrieve Top K Documents
β
LLM (Question + Retrieved Docs)
β
Answer
Implementation
javascriptimport { RetrievalQAChain } from "langchain/chains"; const chain = RetrievalQAChain.fromLLM( model, vectorStore.asRetriever() ); const result = await chain.invoke({ query: "How does Coltrane use tritone substitutions in Giant Steps?" }); // Chain automatically: // 1. Embeds the question // 2. Searches vector store // 3. Passes relevant docs + question to LLM // 4. Returns synthesized answer
Advanced RAG: Conversational
Add memory for multi-turn conversations:
javascriptimport { ConversationalRetrievalQAChain } from "langchain/chains"; const chain = ConversationalRetrievalQAChain.fromLLM( model, vectorStore.asRetriever(), { memory: new BufferMemory() } ); // First question await chain.call({ question: "What is modal interchange?" }); // Follow-up (uses conversation history) await chain.call({ question: "Give me an example in Giant Steps" }); // LLM knows "it" refers to modal interchange
Embeddings in Depth
What Gets Embedded
Text chunks (documents):
- Song annotations
- Literature notes
- NNT scores (as text representation)
- Article sections
- Code comments
Optimal chunk size: 200-500 tokens
- Too small: Loses context
- Too large: Dilutes semantic meaning
Chunking strategy:
javascriptimport { RecursiveCharacterTextSplitter } from "langchain/text_splitter"; const splitter = new RecursiveCharacterTextSplitter({ chunkSize: 500, chunkOverlap: 50 // Maintain context between chunks }); const chunks = await splitter.splitDocuments(documents);
Embedding Models
OpenAI text-embedding-ada-002:
- 1536 dimensions
- Good general-purpose
- Cost: $0.0001 / 1K tokens
OpenAI text-embedding-3-small:
- 1536 dimensions
- Newer, better performance
- Cost: $0.00002 / 1K tokens
Open-source alternatives:
- sentence-transformers (Python)
- all-MiniLM-L6-v2 (lightweight)
- Custom fine-tuned models (for specialized domains like music theory)
Vector Search vs Traditional Search
When to Use Vector Search
β Vector search excels at:
- Semantic similarity - "Find songs about loss" (matches "grief", "heartbreak", "mourning")
- Multilingual - Embeddings work across languages
- Fuzzy concepts - "Songs with a dark, brooding atmosphere"
- No exact keywords - User can't articulate exact terms
Example: Music research
Query: "Examples of unexpected harmonic resolution"
Matches:
- "The chord progression avoided the expected cadence"
- "Deceptive cadence creates surprise"
- "Unresolved tension through non-functional harmony"
When to Use Traditional Search (ripgrep)
β Traditional search (ripgrep) excels at:
- Exact matches - "Find all mentions of 'Cmaj7#11'"
- Pattern matching - "Find files containing /[A-G][#b]?maj7/"
- Speed - Milliseconds vs seconds for vector search
- No cost - No API calls or embedding generation
- Code search - Finding function names, variables, filenames
Example: Vault maintenance
Query: "Find all files with broken wiki links"
Use: rg '\[\[.*\|.*\]\]' (regex pattern matching)
Hybrid Approach (Best Practice)
Combine both for optimal results:
- Initial filter with ripgrep (fast, exact)
bashrg -l "Giant Steps" ~/vault/ # β Returns 15 files
- Semantic search within results (accurate, conceptual)
javascriptconst filtered = await vectorStore.similaritySearch( "analysis of harmonic movement in Giant Steps", { filter: { file: giantStepsFiles } } );
Real-World Use Cases
Use Case 1: Song Annotation Search
Your problem: Search across 100+ song annotation articles for "descending chromatic bass lines"
Solution:
javascript// Embed all song annotations const songs = await loadSongArticles(); const vectorStore = await ConvexVectorStore.fromDocuments( songs, embeddings, { convex, table: "song_embeddings" } ); // Semantic search const results = await vectorStore.similaritySearch( "songs with chromatic descending bass motion", k: 10 );
Returns articles about:
- "I'll Remember April" (walking bass with chromatic approach)
- "Stairway to Heaven" (descending chromatic line in intro)
- "My Funny Valentine" (chromatic voice leading in bridge)
Use Case 2: PhD Literature Review
Your problem: "Find all notes related to notation systems in contemporary music"
Solution:
javascriptconst literatureStore = await ConvexVectorStore.fromTexts( obsidianVaultNotes, embeddings ); const papers = await literatureStore.similaritySearch( "contemporary music notation systems and graphic scores", k: 20 );
Finds notes even if they use different terminology:
- "Extended techniques notation"
- "Graphic scores in 20th century music"
- "Alternative notation methods"
Use Case 3: NNT Query Interface
Your vision: "Natural language query for NNT scores"
Implementation:
javascript// User types: "Show me tritone substitutions in bebop tunes" const nntAgent = await createAgent({ tools: [ nntVectorSearch, // Semantic search of scores nntParser, // Parse NNT syntax convexQuery // Structured data queries ], llm: model }); const result = await nntAgent.invoke({ input: "Show me tritone substitutions in bebop tunes" }); // Agent: // 1. Searches vector store for "tritone substitution" + "bebop" // 2. Parses NNT scores to find bII7 chords // 3. Queries Convex for tune metadata // 4. Returns matching passages with playback links
Performance and Cost Considerations
Embedding Generation Costs
OpenAI text-embedding-3-small:
- $0.00002 per 1K tokens
- Average song annotation: ~500 tokens
- 1000 songs embedded: $0.01 (one cent!)
One-time vs incremental:
- Embed entire vault once: ~$1-5
- Incremental updates: pennies per day
Query Performance
Vector search speed:
- In-memory (MemoryVectorStore): <50ms
- Pinecone (cloud): 50-200ms
- Convex (custom implementation): 100-500ms depending on index
Optimization:
- Pre-filter with metadata (e.g., artist, year)
- Cache popular queries
- Use approximate nearest neighbor (ANN) algorithms
Storage Requirements
Per document:
- Text: 1-10KB
- Embedding (1536 dimensions Γ 4 bytes): ~6KB
- Total: ~16KB per document
For 10,000 documents: ~160MB (tiny!)
Getting Started with Vector Search
Step 1: Choose Embedding Model
bashnpm install @langchain/openai
javascriptimport { OpenAIEmbeddings } from "@langchain/openai"; const embeddings = new OpenAIEmbeddings({ modelName: "text-embedding-3-small" });
Step 2: Prepare Documents
javascriptimport { Document } from "langchain/document"; const docs = songAnnotations.map(annotation => new Document({ pageContent: annotation.text, metadata: { title: annotation.title, artist: annotation.artist, section: annotation.section } }) );
Step 3: Create Vector Store
javascriptimport { MemoryVectorStore } from "langchain/vectorstores/memory"; const vectorStore = await MemoryVectorStore.fromDocuments( docs, embeddings );
Step 4: Search
javascriptconst results = await vectorStore.similaritySearch( "jazz songs with modal interchange", 4 // top 4 results ); results.forEach(doc => { console.log(doc.metadata.title); console.log(doc.pageContent); });
See Also
- Langchain Overview - Framework for using vector stores
- Langchain with Convex - Storing vectors in your database
- ripgrep vs Vector Search - Choosing the right search tool
- Database Theory - Fundamentals of data storage