# Vectorized Databases
Vectorized databases store data as numerical vectors (arrays of numbers) instead of traditional text, enabling **semantic search** - finding content by meaning rather than exact keyword matches.
## Core Principle
**Traditional search (keyword-based):**
- "Find documents containing 'tritone substitution'"
- Returns only exact matches
- Misses synonyms, related concepts, paraphrases
**Vector search (semantic):**
- "Find documents about harmonic substitution techniques"
- Returns conceptually similar content even without exact keywords
- Matches: "flat-five sub", "altered dominant", "reharmonization"
## How Vector Search Works
### Step 1: Create Embeddings
**Embedding:** Convert text into a vector (list of numbers) that captures semantic meaning.
```javascript
import { OpenAIEmbeddings } from "langchain/embeddings/openai";
const embeddings = new OpenAIEmbeddings();
const text = "The tritone substitution uses a chord a tritone away";
const vector = await embeddings.embedQuery(text);
// Result: [0.023, -0.145, 0.876, ..., 0.432]
// (typically 1536 numbers for OpenAI embeddings)
```
**Key insight:** Similar concepts produce similar vectors.
```javascript
const vec1 = await embeddings.embedQuery("tritone substitution");
const vec2 = await embeddings.embedQuery("flat-five sub");
const vec3 = await embeddings.embedQuery("pizza recipe");
// vec1 and vec2 are close in vector space
// vec3 is far from both
```
### Step 2: Store Vectors in Database
**Vector database** stores these embeddings with their source text.
```javascript
import { MemoryVectorStore } from "langchain/vectorstores/memory";
const vectorStore = await MemoryVectorStore.fromTexts(
[
"Tritone substitution replaces V7 with bII7",
"Modal interchange borrows chords from parallel modes",
"Voice leading creates smooth melodic motion"
],
[{ source: "harmony" }, { source: "harmony" }, { source: "melody" }],
embeddings
);
```
### Step 3: Query by Similarity
**Similarity search:** Find vectors closest to query vector.
```javascript
const results = await vectorStore.similaritySearch(
"What are common reharmonization techniques?",
k: 2 // return top 2 matches
);
// Returns:
// 1. "Tritone substitution replaces V7 with bII7"
// 2. "Modal interchange borrows chords from parallel modes"
// Note: Neither document contains "reharmonization"!
```
## Vector Similarity Explained
### Cosine Similarity
Measures angle between two vectors. Range: -1 (opposite) to 1 (identical).
```
Similarity = (A · B) / (||A|| × ||B||)
```
**Example:**
- "jazz improvisation" ↔ "bebop soloing" → 0.85 (very similar)
- "jazz improvisation" ↔ "classical counterpoint" → 0.42 (somewhat related)
- "jazz improvisation" ↔ "car engine repair" → 0.05 (unrelated)
### Distance Metrics
**Common distance functions:**
- **Cosine similarity** - Best for text (ignores magnitude, focuses on direction)
- **Euclidean distance** - Straight-line distance (used for some embeddings)
- **Dot product** - Fast but sensitive to vector magnitude
## Vector Database Options
### In-Memory (Development)
**LangChain MemoryVectorStore:**
```javascript
const vectorStore = await MemoryVectorStore.fromDocuments(
documents,
embeddings
);
```
**Pros:** Simple, fast, no setup
**Cons:** Lost on restart, limited scale
### Production Databases
#### Pinecone
Cloud-hosted, managed service
```javascript
import { PineconeStore } from "langchain/vectorstores/pinecone";
const vectorStore = await PineconeStore.fromDocuments(
documents,
embeddings,
{ pineconeIndex }
);
```
#### Weaviate
Open-source, self-hosted or cloud
```javascript
import { WeaviateStore } from "langchain/vectorstores/weaviate";
```
#### Supabase (PostgreSQL + pgvector)
Use existing Postgres database
```javascript
import { SupabaseVectorStore } from "langchain/vectorstores/supabase";
```
#### Convex (Your Stack)
Store vectors in Convex tables with custom similarity search
```javascript
// Convex schema
defineTable("embeddings", {
text: v.string(),
vector: v.array(v.float64()),
metadata: v.object({ ... })
});
// Query with custom function
const results = await ctx.db
.query("embeddings")
.filter(q => cosineSimilarity(q.field("vector"), queryVector) > 0.7)
.take(10);
```
## RAG (Retrieval Augmented Generation)
**Pattern:** Enhance LLM responses with relevant data from vector store.
### Basic RAG Flow
```
User Question
↓
Embed Query → Vector Search → Retrieve Top K Documents
↓
LLM (Question + Retrieved Docs)
↓
Answer
```
### Implementation
```javascript
import { RetrievalQAChain } from "langchain/chains";
const chain = RetrievalQAChain.fromLLM(
model,
vectorStore.asRetriever()
);
const result = await chain.invoke({
query: "How does Coltrane use tritone substitutions in Giant Steps?"
});
// Chain automatically:
// 1. Embeds the question
// 2. Searches vector store
// 3. Passes relevant docs + question to LLM
// 4. Returns synthesized answer
```
### Advanced RAG: Conversational
**Add memory for multi-turn conversations:**
```javascript
import { ConversationalRetrievalQAChain } from "langchain/chains";
const chain = ConversationalRetrievalQAChain.fromLLM(
model,
vectorStore.asRetriever(),
{ memory: new BufferMemory() }
);
// First question
await chain.call({ question: "What is modal interchange?" });
// Follow-up (uses conversation history)
await chain.call({ question: "Give me an example in Giant Steps" });
// LLM knows "it" refers to modal interchange
```
## Embeddings in Depth
### What Gets Embedded
**Text chunks (documents):**
- Song annotations
- Literature notes
- NNT scores (as text representation)
- Article sections
- Code comments
**Optimal chunk size:** 200-500 tokens
- Too small: Loses context
- Too large: Dilutes semantic meaning
**Chunking strategy:**
```javascript
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 500,
chunkOverlap: 50 // Maintain context between chunks
});
const chunks = await splitter.splitDocuments(documents);
```
### Embedding Models
**OpenAI text-embedding-ada-002:**
- 1536 dimensions
- Good general-purpose
- Cost: $0.0001 / 1K tokens
**OpenAI text-embedding-3-small:**
- 1536 dimensions
- Newer, better performance
- Cost: $0.00002 / 1K tokens
**Open-source alternatives:**
- sentence-transformers (Python)
- all-MiniLM-L6-v2 (lightweight)
- Custom fine-tuned models (for specialized domains like music theory)
## Vector Search vs Traditional Search
### When to Use Vector Search
**✅ Vector search excels at:**
- **Semantic similarity** - "Find songs about loss" (matches "grief", "heartbreak", "mourning")
- **Multilingual** - Embeddings work across languages
- **Fuzzy concepts** - "Songs with a dark, brooding atmosphere"
- **No exact keywords** - User can't articulate exact terms
**Example: Music research**
Query: "Examples of unexpected harmonic resolution"
Matches:
- "The chord progression avoided the expected cadence"
- "Deceptive cadence creates surprise"
- "Unresolved tension through non-functional harmony"
### When to Use Traditional Search (ripgrep)
**✅ Traditional search (ripgrep) excels at:**
- **Exact matches** - "Find all mentions of 'Cmaj7#11'"
- **Pattern matching** - "Find files containing /[A-G][#b]?maj7/"
- **Speed** - Milliseconds vs seconds for vector search
- **No cost** - No API calls or embedding generation
- **Code search** - Finding function names, variables, filenames
**Example: Vault maintenance**
Query: "Find all files with broken wiki links"
Use: `rg '\[\[.*\|.*\]\]'` (regex pattern matching)
### Hybrid Approach (Best Practice)
**Combine both for optimal results:**
1. **Initial filter with ripgrep** (fast, exact)
```bash
rg -l "Giant Steps" ~/vault/
# → Returns 15 files
```
2. **Semantic search within results** (accurate, conceptual)
```javascript
const filtered = await vectorStore.similaritySearch(
"analysis of harmonic movement in Giant Steps",
{ filter: { file: giantStepsFiles } }
);
```
## Real-World Use Cases
### Use Case 1: Song Annotation Search
**Your problem:** Search across 100+ song annotation articles for "descending chromatic bass lines"
**Solution:**
```javascript
// Embed all song annotations
const songs = await loadSongArticles();
const vectorStore = await ConvexVectorStore.fromDocuments(
songs,
embeddings,
{ convex, table: "song_embeddings" }
);
// Semantic search
const results = await vectorStore.similaritySearch(
"songs with chromatic descending bass motion",
k: 10
);
```
**Returns articles about:**
- "I'll Remember April" (walking bass with chromatic approach)
- "Stairway to Heaven" (descending chromatic line in intro)
- "My Funny Valentine" (chromatic voice leading in bridge)
### Use Case 2: PhD Literature Review
**Your problem:** "Find all notes related to notation systems in contemporary music"
**Solution:**
```javascript
const literatureStore = await ConvexVectorStore.fromTexts(
obsidianVaultNotes,
embeddings
);
const papers = await literatureStore.similaritySearch(
"contemporary music notation systems and graphic scores",
k: 20
);
```
**Finds notes even if they use different terminology:**
- "Extended techniques notation"
- "Graphic scores in 20th century music"
- "Alternative notation methods"
### Use Case 3: NNT Query Interface
**Your vision:** "Natural language query for NNT scores"
**Implementation:**
```javascript
// User types: "Show me tritone substitutions in bebop tunes"
const nntAgent = await createAgent({
tools: [
nntVectorSearch, // Semantic search of scores
nntParser, // Parse NNT syntax
convexQuery // Structured data queries
],
llm: model
});
const result = await nntAgent.invoke({
input: "Show me tritone substitutions in bebop tunes"
});
// Agent:
// 1. Searches vector store for "tritone substitution" + "bebop"
// 2. Parses NNT scores to find bII7 chords
// 3. Queries Convex for tune metadata
// 4. Returns matching passages with playback links
```
## Performance and Cost Considerations
### Embedding Generation Costs
**OpenAI text-embedding-3-small:**
- $0.00002 per 1K tokens
- Average song annotation: ~500 tokens
- 1000 songs embedded: $0.01 (one cent!)
**One-time vs incremental:**
- Embed entire vault once: ~$1-5
- Incremental updates: pennies per day
### Query Performance
**Vector search speed:**
- In-memory (MemoryVectorStore): <50ms
- Pinecone (cloud): 50-200ms
- Convex (custom implementation): 100-500ms depending on index
**Optimization:**
- Pre-filter with metadata (e.g., artist, year)
- Cache popular queries
- Use approximate nearest neighbor (ANN) algorithms
### Storage Requirements
**Per document:**
- Text: 1-10KB
- Embedding (1536 dimensions × 4 bytes): ~6KB
- Total: ~16KB per document
**For 10,000 documents:** ~160MB (tiny!)
## Getting Started with Vector Search
### Step 1: Choose Embedding Model
```bash
npm install @langchain/openai
```
```javascript
import { OpenAIEmbeddings } from "@langchain/openai";
const embeddings = new OpenAIEmbeddings({
modelName: "text-embedding-3-small"
});
```
### Step 2: Prepare Documents
```javascript
import { Document } from "langchain/document";
const docs = songAnnotations.map(annotation =>
new Document({
pageContent: annotation.text,
metadata: {
title: annotation.title,
artist: annotation.artist,
section: annotation.section
}
})
);
```
### Step 3: Create Vector Store
```javascript
import { MemoryVectorStore } from "langchain/vectorstores/memory";
const vectorStore = await MemoryVectorStore.fromDocuments(
docs,
embeddings
);
```
### Step 4: Search
```javascript
const results = await vectorStore.similaritySearch(
"jazz songs with modal interchange",
4 // top 4 results
);
results.forEach(doc => {
console.log(doc.metadata.title);
console.log(doc.pageContent);
});
```
## See Also
- [[Langchain Overview]] - Framework for using vector stores
- [[Langchain with Convex]] - Storing vectors in your database
- [[ripgrep vs Vector Search]] - Choosing the right search tool
- [[Database Theory]] - Fundamentals of data storage