Semantic Search for GPS Objects
Semantic Search for GPS Objects
Problem Statement
Challenge: Traditional GPS searches rely on exact matches (coordinates, addresses, names) or rigid filters (tags, date ranges). How can you search for recordings using natural language and semantic meaning?
Example queries you want to support:
- "Find recordings with heavy bass near the border"
- "Venues with intimate atmosphere and live jazz"
- "Water infrastructure showing signs of corrosion"
- "Monitoring sites with elevated contamination readings"
Requirements:
- Natural language query understanding
- Semantic similarity matching (not just keyword search)
- Hybrid search (combine semantic + spatial + metadata filters)
- Fast performance (sub-second response for 10k+ objects)
- Integration with existing Convex architecture
- Support for follow-up questions and refinement
Solution: Langchain.js + OpenAI embeddings + Convex vector storage + hybrid query patterns.
Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β User Query: "Find jazz recordings near border" β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Query Processing (Langchain.js) β
β β’ Generate embedding via OpenAI β
β β’ Extract filters (location, date, tags) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Hybrid Search (Convex) β
β β’ Vector similarity search (semantic) β
β β’ Spatial filtering (near border = bbox) β
β β’ Metadata filtering (tags, date range) β
β β’ Combine scores and rank results β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Results + Explanation β
β β’ Top matches with similarity scores β
β β’ Why each result matched β
β β’ Map visualization with highlighted results β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Embedding Strategy
What to Embed
For each GPS object, concatenate these fields into searchable text:
typescriptfunction buildSearchText(object: GPSObject): string { const parts = [ object.name, object.description || "", object.tags.join(" "), JSON.stringify(object.properties) ]; return parts.filter(p => p).join(" "); }
Example for field recording:
"Tijuana Jazz Club - Friday Night Set Live jazz trio performance, intimate atmosphere jazz live tijuana trio {recording_date: '2025-11-01T21:30:00Z', equipment: 'Zoom H6, Rode NTG3', weather: 'Clear, 18Β°C, light winds', notes: 'Live jazz trio, moderate audience noise'}"
Generating Embeddings
typescript// convex/embeddings.ts import { OpenAI } from "openai"; import { internalMutation } from "./_generated/server"; import { v } from "convex/values"; const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY }); export const generateEmbedding = internalMutation({ args: { objectId: v.id("gpsObjects") }, handler: async (ctx, args) => { // Fetch GPS object const object = await ctx.db.get(args.objectId); if (!object) return; const dataset = await ctx.db.get(object.datasetId); if (!dataset) return; // Build searchable text const searchText = [ object.name, object.description || "", object.tags.join(" "), JSON.stringify(object.properties) ].join(" "); // Generate embedding const response = await openai.embeddings.create({ model: "text-embedding-3-small", // 1536 dimensions input: searchText }); const embedding = response.data[0].embedding; // Extract coordinates for spatial queries let coordinates: [number, number] = [0, 0]; if (object.geometry.type === "Point") { coordinates = object.geometry.coordinates as [number, number]; } else if (object.geometry.type === "Polygon") { // Use centroid coordinates = calculateCentroid(object.geometry.coordinates); } // Store embedding await ctx.db.insert("gpsObjectEmbeddings", { objectId: args.objectId, datasetId: object.datasetId, embedding, searchText, metadata: { name: object.name, datasetType: dataset.datasetType, coordinates, tags: object.tags }, createdAt: Date.now() }); console.log(`β Generated embedding for: ${object.name}`); } });
Batch Embedding Generation
typescript// scripts/generate-embeddings.ts import { ConvexHttpClient } from "convex/browser"; import { api } from "../convex/_generated/api"; async function generateAllEmbeddings(datasetId: string) { const client = new ConvexHttpClient(process.env.CONVEX_URL!); // Get all objects without embeddings const objects = await client.query(api.gpsObjects.getObjectsWithoutEmbeddings, { datasetId }); console.log(`Generating embeddings for ${objects.length} objects...`); // Process in batches of 10 (rate limiting) for (let i = 0; i < objects.length; i += 10) { const batch = objects.slice(i, i + 10); await Promise.all( batch.map(obj => client.mutation(api.embeddings.generateEmbedding, { objectId: obj._id }) ) ); console.log(`Progress: ${i + batch.length}/${objects.length}`); // Rate limiting: wait 1 second between batches await new Promise(resolve => setTimeout(resolve, 1000)); } console.log("β All embeddings generated"); }
Basic Semantic Search
Query Implementation
typescript// convex/search.ts import { query } from "./_generated/server"; import { v } from "convex/values"; import { OpenAI } from "openai"; const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY }); export const semanticSearch = query({ args: { query: v.string(), datasetId: v.optional(v.id("gpsDatasets")), limit: v.optional(v.number()) }, handler: async (ctx, args) => { // Generate query embedding const response = await openai.embeddings.create({ model: "text-embedding-3-small", input: args.query }); const queryEmbedding = response.data[0].embedding; // Fetch embeddings (optionally filtered by dataset) let embeddingsQuery = ctx.db.query("gpsObjectEmbeddings"); if (args.datasetId) { embeddingsQuery = embeddingsQuery.withIndex( "by_dataset", q => q.eq("datasetId", args.datasetId) ); } const allEmbeddings = await embeddingsQuery.collect(); // Calculate cosine similarity const withScores = allEmbeddings.map(doc => ({ ...doc, similarity: cosineSimilarity(queryEmbedding, doc.embedding) })); // Sort by similarity and take top results const topResults = withScores .sort((a, b) => b.similarity - a.similarity) .slice(0, args.limit ?? 10); // Fetch full GPS objects const objects = await Promise.all( topResults.map(r => ctx.db.get(r.objectId)) ); return topResults.map((result, i) => ({ object: objects[i], similarity: result.similarity, matchedText: result.searchText.slice(0, 200) + "..." // Preview })); } }); // Cosine similarity helper function cosineSimilarity(a: number[], b: number[]): number { const dotProduct = a.reduce((sum, val, i) => sum + val * b[i], 0); const magnitudeA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0)); const magnitudeB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0)); return dotProduct / (magnitudeA * magnitudeB); }
Example Usage
typescript// In React component import { useQuery } from "convex/react"; import { api } from "../convex/_generated/api"; function SearchComponent() { const [query, setQuery] = useState(""); const results = useQuery(api.search.semanticSearch, { query, limit: 20 }); return ( <div> <input type="text" value={query} onChange={(e) => setQuery(e.target.value)} placeholder="Find recordings with heavy bass near the border" /> {results?.map(result => ( <div key={result.object._id}> <h3>{result.object.name}</h3> <p>Similarity: {(result.similarity * 100).toFixed(1)}%</p> <p>{result.matchedText}</p> </div> ))} </div> ); }
Hybrid Search: Semantic + Spatial + Filters
Advanced Query with Multiple Dimensions
typescript// convex/search.ts export const hybridSearch = query({ args: { query: v.string(), datasetId: v.optional(v.id("gpsDatasets")), // Spatial filters nearLocation: v.optional(v.object({ lat: v.number(), lng: v.number(), radiusKm: v.number() })), // Metadata filters tags: v.optional(v.array(v.string())), dateRange: v.optional(v.object({ start: v.number(), end: v.number() })), // Ranking weights weights: v.optional(v.object({ semantic: v.number(), // Default: 0.7 spatial: v.number(), // Default: 0.2 recency: v.number() // Default: 0.1 })), limit: v.optional(v.number()) }, handler: async (ctx, args) => { const weights = args.weights ?? { semantic: 0.7, spatial: 0.2, recency: 0.1 }; // 1. Generate query embedding const response = await openai.embeddings.create({ model: "text-embedding-3-small", input: args.query }); const queryEmbedding = response.data[0].embedding; // 2. Fetch embeddings with initial filters let embeddingsQuery = ctx.db.query("gpsObjectEmbeddings"); if (args.datasetId) { embeddingsQuery = embeddingsQuery.withIndex( "by_dataset", q => q.eq("datasetId", args.datasetId) ); } let embeddings = await embeddingsQuery.collect(); // 3. Apply metadata filters if (args.tags && args.tags.length > 0) { embeddings = embeddings.filter(emb => args.tags!.some(tag => emb.metadata.tags.includes(tag)) ); } // 4. Calculate semantic similarity const withSemanticScores = embeddings.map(emb => ({ ...emb, semanticScore: cosineSimilarity(queryEmbedding, emb.embedding) })); // 5. Calculate spatial proximity score (if location provided) const withSpatialScores = withSemanticScores.map(emb => { let spatialScore = 1.0; // Max score if no location filter if (args.nearLocation) { const distance = haversineDistance( args.nearLocation.lat, args.nearLocation.lng, emb.metadata.coordinates[1], // lat emb.metadata.coordinates[0] // lng ); // Normalize: 0 km = 1.0, radiusKm = 0.0 spatialScore = Math.max( 0, 1 - (distance / args.nearLocation.radiusKm) ); } return { ...emb, spatialScore }; }); // 6. Calculate recency score const now = Date.now(); const withRecencyScores = withSpatialScores.map(emb => { const ageMs = now - emb.createdAt; const ageDays = ageMs / (1000 * 60 * 60 * 24); // Normalize: 0 days = 1.0, 365 days = 0.0 const recencyScore = Math.max(0, 1 - (ageDays / 365)); return { ...emb, recencyScore }; }); // 7. Calculate weighted composite score const withCompositeScores = withRecencyScores.map(emb => ({ ...emb, compositeScore: emb.semanticScore * weights.semantic + emb.spatialScore * weights.spatial + emb.recencyScore * weights.recency })); // 8. Sort by composite score const topResults = withCompositeScores .sort((a, b) => b.compositeScore - a.compositeScore) .slice(0, args.limit ?? 10); // 9. Fetch full GPS objects const objects = await Promise.all( topResults.map(r => ctx.db.get(r.objectId)) ); return topResults.map((result, i) => ({ object: objects[i], scores: { composite: result.compositeScore, semantic: result.semanticScore, spatial: result.spatialScore, recency: result.recencyScore }, explanation: buildExplanation(result, args.query) })); } }); // Helper: Build human-readable explanation function buildExplanation(result: any, query: string): string { const parts = []; parts.push( `Semantic match: ${(result.semanticScore * 100).toFixed(0)}%` ); if (result.spatialScore < 1.0) { parts.push( `Location relevance: ${(result.spatialScore * 100).toFixed(0)}%` ); } parts.push( `Recency: ${(result.recencyScore * 100).toFixed(0)}%` ); return parts.join(" β’ "); } // Haversine distance formula function haversineDistance( lat1: number, lon1: number, lat2: number, lon2: number ): number { const R = 6371; // Earth's radius in km const dLat = toRadians(lat2 - lat1); const dLon = toRadians(lon2 - lon1); const a = Math.sin(dLat / 2) * Math.sin(dLat / 2) + Math.cos(toRadians(lat1)) * Math.cos(toRadians(lat2)) * Math.sin(dLon / 2) * Math.sin(dLon / 2); const c = 2 * Math.atan2(Math.sqrt(a), Math.sqrt(1 - a)); return R * c; } function toRadians(degrees: number): number { return degrees * (Math.PI / 180); }
Advanced Query Patterns
Pattern 1: Natural Language with Extracted Filters
User query: "Find jazz recordings near the border from last month"
Query processing:
typescript// Use LLM to extract structured filters const extractedFilters = await extractFilters(userQuery); // Result: // { // semantic_query: "jazz recordings", // location: { lat: 32.5332, lng: -117.0363, radiusKm: 5 }, // date_range: { start: lastMonthStart, end: now }, // tags: ["jazz"] // } const results = await hybridSearch({ query: extractedFilters.semantic_query, nearLocation: extractedFilters.location, dateRange: extractedFilters.date_range, tags: extractedFilters.tags });
Pattern 2: Follow-Up Refinement
Conversation:
User: "Find field recordings with heavy bass"
Agent: [Returns 10 results]
User: "Show only the ones near downtown Tijuana"
Agent: [Refines with spatial filter]
User: "From the last 3 months"
Agent: [Adds date range filter]
Implementation:
typescript// Maintain conversation state const [conversationState, setConversationState] = useState({ baseQuery: "", filters: {} }); // Update filters incrementally function refineSearch(refinement: string) { const newFilters = extractRefinement(refinement, conversationState); setConversationState({ ...conversationState, filters: { ...conversationState.filters, ...newFilters } }); // Re-run search with updated filters hybridSearch({ query: conversationState.baseQuery, ...conversationState.filters }); }
Pattern 3: Similar Objects
User action: Clicks "Find similar" on a GPS object
Implementation:
typescriptexport const findSimilar = query({ args: { objectId: v.id("gpsObjects"), limit: v.optional(v.number()) }, handler: async (ctx, args) => { // Get embedding for the object const embedding = await ctx.db .query("gpsObjectEmbeddings") .withIndex("by_object", q => q.eq("objectId", args.objectId)) .first(); if (!embedding) return []; // Find similar embeddings const allEmbeddings = await ctx.db .query("gpsObjectEmbeddings") .collect(); const withScores = allEmbeddings .filter(emb => emb.objectId !== args.objectId) // Exclude self .map(emb => ({ ...emb, similarity: cosineSimilarity(embedding.embedding, emb.embedding) })); const topResults = withScores .sort((a, b) => b.similarity - a.similarity) .slice(0, args.limit ?? 10); const objects = await Promise.all( topResults.map(r => ctx.db.get(r.objectId)) ); return topResults.map((r, i) => ({ object: objects[i], similarity: r.similarity })); } });
Integration with Langchain.js
RAG Pattern: Retrieve + Generate
Use case: Answer questions about your GPS dataset using LLM
typescriptimport { ChatOpenAI } from "@langchain/openai"; import { PromptTemplate } from "@langchain/core/prompts"; export const answerQuestion = query({ args: { question: v.string(), datasetId: v.optional(v.id("gpsDatasets")) }, handler: async (ctx, args) => { // 1. Retrieve relevant GPS objects const results = await ctx.runQuery(api.search.semanticSearch, { query: args.question, datasetId: args.datasetId, limit: 5 }); // 2. Build context from results const context = results .map(r => ` Name: ${r.object.name} Description: ${r.object.description} Properties: ${JSON.stringify(r.object.properties, null, 2)} `) .join("\n\n---\n\n"); // 3. Generate answer with LLM const llm = new ChatOpenAI({ modelName: "gpt-4-turbo-preview", temperature: 0 }); const prompt = PromptTemplate.fromTemplate(` You are an expert in analyzing GPS-tagged field recordings and geospatial data. Context (GPS objects from database): {context} User question: {question} Provide a detailed answer based ONLY on the context above. If the context doesn't contain enough information, say so. `); const chain = prompt.pipe(llm); const response = await chain.invoke({ context, question: args.question }); return { answer: response.content, sources: results.map(r => ({ name: r.object.name, similarity: r.similarity })) }; } });
Example usage:
typescriptconst result = await answerQuestion({ question: "What venues have the best acoustics for recording jazz?", datasetId: "music-venues-tijuana" }); console.log(result.answer); // "Based on the field recordings database, Tijuana Jazz Club shows // the best acoustics for jazz recording due to its intimate 150-person // capacity, sound-treated walls, and minimal ambient noise..." console.log(result.sources); // [ // { name: "Tijuana Jazz Club - Friday Night Set", similarity: 0.89 }, // { name: "ENTONO Live Music - Acoustic Session", similarity: 0.82 } // ]
Performance Optimization
Caching Embeddings
typescript// Cache query embeddings for common searches const embeddingCache = new Map<string, number[]>(); async function getCachedEmbedding(text: string): Promise<number[]> { if (embeddingCache.has(text)) { return embeddingCache.get(text)!; } const response = await openai.embeddings.create({ model: "text-embedding-3-small", input: text }); const embedding = response.data[0].embedding; embeddingCache.set(text, embedding); return embedding; }
Pre-filtering Before Similarity Search
typescript// Bad: Calculate similarity for all 10,000 objects const allEmbeddings = await ctx.db .query("gpsObjectEmbeddings") .collect(); // 10,000 objects // Good: Pre-filter by tags or dataset const filteredEmbeddings = await ctx.db .query("gpsObjectEmbeddings") .filter(q => q.eq(q.field("metadata.datasetType"), "audio_recordings")) .collect(); // 1,000 objects
Approximate Nearest Neighbor (ANN)
For datasets >50k objects, consider using specialized vector databases:
Option 1: Pinecone
typescriptimport { PineconeClient } from "@pinecone-database/pinecone"; const pinecone = new PineconeClient(); await pinecone.init({ apiKey: process.env.PINECONE_API_KEY, environment: "us-west1-gcp" }); const index = pinecone.Index("gps-objects"); // Query (much faster than Convex for large datasets) const results = await index.query({ queryRequest: { vector: queryEmbedding, topK: 10, includeMetadata: true } });
Option 2: Keep Convex, optimize queries
- Use smaller embedding dimensions (384 instead of 1536)
- Implement hierarchical search (coarse β fine)
- Cache frequently accessed embeddings in memory
Real-World Examples
Example 1: Audio Research
Query: "Find field recordings with prominent low-frequency content near industrial areas"
Result:
json[ { "object": { "name": "Border Crossing - Evening Rush", "coordinates": [-117.0363, 32.5421] }, "scores": { "semantic": 0.87, // Strong match: "low-frequency", "industrial" "spatial": 0.95, // Very close to industrial zone "composite": 0.89 }, "explanation": "Strong semantic match (87%) β’ Near industrial zone (95%)" } ]
Example 2: Infrastructure Planning
Query: "Sewer lines showing signs of deterioration in high-traffic areas"
Result:
json[ { "object": { "name": "Sewer Main TJ-SW-1234", "condition": "moderate_wear", "location": "Av. RevoluciΓ³n (high pedestrian traffic)" }, "scores": { "semantic": 0.82, // Match: "deterioration", "wear" "spatial": 0.78, // Near high-traffic intersection "recency": 0.90, // Recently inspected "composite": 0.83 } } ]
Monitoring and Debugging
Query Analytics
typescript// Track search performance queryAnalytics: defineTable({ query: v.string(), filters: v.any(), resultCount: v.number(), avgSimilarity: v.number(), executionTimeMs: v.number(), timestamp: v.number() }).index("by_timestamp", ["timestamp"]);
Embedding Quality Metrics
typescript// Periodically check embedding quality export const checkEmbeddingQuality = internalMutation({ handler: async (ctx) => { // Sample 100 random pairs const embeddings = await ctx.db .query("gpsObjectEmbeddings") .collect(); const samples = samplePairs(embeddings, 100); // Calculate average similarity for same-dataset vs different-dataset const sameDatasetSim = samples .filter(([a, b]) => a.datasetId === b.datasetId) .map(([a, b]) => cosineSimilarity(a.embedding, b.embedding)); const diffDatasetSim = samples .filter(([a, b]) => a.datasetId !== b.datasetId) .map(([a, b]) => cosineSimilarity(a.embedding, b.embedding)); console.log("Embedding quality metrics:"); console.log(`Same dataset avg: ${avg(sameDatasetSim)}`); console.log(`Different dataset avg: ${avg(diffDatasetSim)}`); console.log(`Separation ratio: ${avg(sameDatasetSim) / avg(diffDatasetSim)}`); // Good separation ratio: > 1.5 } });
See Also
- GPS Dataset Architecture with Convex - Database schema
- Building a GPS Dataset Manager - React UI with search interface
- Vectorized Databases - Embeddings fundamentals
- Langchain with Convex - RAG implementation patterns
- Langchain Overview - Langchain.js concepts