← Back to articles

Semantic Search for GPS Objects

Path: Computer Tech/Development/Tech Companies/Google/Google Maps Platform/GPS Dataset Catalog/Semantic Search for GPS Objects.mdUpdated: 2/3/2026

Semantic Search for GPS Objects

Problem Statement

Challenge: Traditional GPS searches rely on exact matches (coordinates, addresses, names) or rigid filters (tags, date ranges). How can you search for recordings using natural language and semantic meaning?

Example queries you want to support:

  • "Find recordings with heavy bass near the border"
  • "Venues with intimate atmosphere and live jazz"
  • "Water infrastructure showing signs of corrosion"
  • "Monitoring sites with elevated contamination readings"

Requirements:

  1. Natural language query understanding
  2. Semantic similarity matching (not just keyword search)
  3. Hybrid search (combine semantic + spatial + metadata filters)
  4. Fast performance (sub-second response for 10k+ objects)
  5. Integration with existing Convex architecture
  6. Support for follow-up questions and refinement

Solution: Langchain.js + OpenAI embeddings + Convex vector storage + hybrid query patterns.

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  User Query: "Find jazz recordings near border"    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Query Processing (Langchain.js)                    β”‚
β”‚  β€’ Generate embedding via OpenAI                   β”‚
β”‚  β€’ Extract filters (location, date, tags)         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Hybrid Search (Convex)                             β”‚
β”‚  β€’ Vector similarity search (semantic)             β”‚
β”‚  β€’ Spatial filtering (near border = bbox)          β”‚
β”‚  β€’ Metadata filtering (tags, date range)          β”‚
β”‚  β€’ Combine scores and rank results                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Results + Explanation                              β”‚
β”‚  β€’ Top matches with similarity scores              β”‚
β”‚  β€’ Why each result matched                         β”‚
β”‚  β€’ Map visualization with highlighted results      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Embedding Strategy

What to Embed

For each GPS object, concatenate these fields into searchable text:

typescript
function buildSearchText(object: GPSObject): string {
  const parts = [
    object.name,
    object.description || "",
    object.tags.join(" "),
    JSON.stringify(object.properties)
  ];
  
  return parts.filter(p => p).join(" ");
}

Example for field recording:

"Tijuana Jazz Club - Friday Night Set Live jazz trio performance, intimate atmosphere jazz live tijuana trio {recording_date: '2025-11-01T21:30:00Z', equipment: 'Zoom H6, Rode NTG3', weather: 'Clear, 18Β°C, light winds', notes: 'Live jazz trio, moderate audience noise'}"

Generating Embeddings

typescript
// convex/embeddings.ts
import { OpenAI } from "openai";
import { internalMutation } from "./_generated/server";
import { v } from "convex/values";

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY
});

export const generateEmbedding = internalMutation({
  args: { objectId: v.id("gpsObjects") },
  handler: async (ctx, args) => {
    // Fetch GPS object
    const object = await ctx.db.get(args.objectId);
    if (!object) return;
    
    const dataset = await ctx.db.get(object.datasetId);
    if (!dataset) return;
    
    // Build searchable text
    const searchText = [
      object.name,
      object.description || "",
      object.tags.join(" "),
      JSON.stringify(object.properties)
    ].join(" ");
    
    // Generate embedding
    const response = await openai.embeddings.create({
      model: "text-embedding-3-small",  // 1536 dimensions
      input: searchText
    });
    
    const embedding = response.data[0].embedding;
    
    // Extract coordinates for spatial queries
    let coordinates: [number, number] = [0, 0];
    if (object.geometry.type === "Point") {
      coordinates = object.geometry.coordinates as [number, number];
    } else if (object.geometry.type === "Polygon") {
      // Use centroid
      coordinates = calculateCentroid(object.geometry.coordinates);
    }
    
    // Store embedding
    await ctx.db.insert("gpsObjectEmbeddings", {
      objectId: args.objectId,
      datasetId: object.datasetId,
      embedding,
      searchText,
      metadata: {
        name: object.name,
        datasetType: dataset.datasetType,
        coordinates,
        tags: object.tags
      },
      createdAt: Date.now()
    });
    
    console.log(`βœ“ Generated embedding for: ${object.name}`);
  }
});

Batch Embedding Generation

typescript
// scripts/generate-embeddings.ts
import { ConvexHttpClient } from "convex/browser";
import { api } from "../convex/_generated/api";

async function generateAllEmbeddings(datasetId: string) {
  const client = new ConvexHttpClient(process.env.CONVEX_URL!);
  
  // Get all objects without embeddings
  const objects = await client.query(api.gpsObjects.getObjectsWithoutEmbeddings, {
    datasetId
  });
  
  console.log(`Generating embeddings for ${objects.length} objects...`);
  
  // Process in batches of 10 (rate limiting)
  for (let i = 0; i < objects.length; i += 10) {
    const batch = objects.slice(i, i + 10);
    
    await Promise.all(
      batch.map(obj =>
        client.mutation(api.embeddings.generateEmbedding, {
          objectId: obj._id
        })
      )
    );
    
    console.log(`Progress: ${i + batch.length}/${objects.length}`);
    
    // Rate limiting: wait 1 second between batches
    await new Promise(resolve => setTimeout(resolve, 1000));
  }
  
  console.log("βœ… All embeddings generated");
}

Basic Semantic Search

Query Implementation

typescript
// convex/search.ts
import { query } from "./_generated/server";
import { v } from "convex/values";
import { OpenAI } from "openai";

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY
});

export const semanticSearch = query({
  args: {
    query: v.string(),
    datasetId: v.optional(v.id("gpsDatasets")),
    limit: v.optional(v.number())
  },
  handler: async (ctx, args) => {
    // Generate query embedding
    const response = await openai.embeddings.create({
      model: "text-embedding-3-small",
      input: args.query
    });
    const queryEmbedding = response.data[0].embedding;
    
    // Fetch embeddings (optionally filtered by dataset)
    let embeddingsQuery = ctx.db.query("gpsObjectEmbeddings");
    if (args.datasetId) {
      embeddingsQuery = embeddingsQuery.withIndex(
        "by_dataset",
        q => q.eq("datasetId", args.datasetId)
      );
    }
    const allEmbeddings = await embeddingsQuery.collect();
    
    // Calculate cosine similarity
    const withScores = allEmbeddings.map(doc => ({
      ...doc,
      similarity: cosineSimilarity(queryEmbedding, doc.embedding)
    }));
    
    // Sort by similarity and take top results
    const topResults = withScores
      .sort((a, b) => b.similarity - a.similarity)
      .slice(0, args.limit ?? 10);
    
    // Fetch full GPS objects
    const objects = await Promise.all(
      topResults.map(r => ctx.db.get(r.objectId))
    );
    
    return topResults.map((result, i) => ({
      object: objects[i],
      similarity: result.similarity,
      matchedText: result.searchText.slice(0, 200) + "..."  // Preview
    }));
  }
});

// Cosine similarity helper
function cosineSimilarity(a: number[], b: number[]): number {
  const dotProduct = a.reduce((sum, val, i) => sum + val * b[i], 0);
  const magnitudeA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0));
  const magnitudeB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0));
  return dotProduct / (magnitudeA * magnitudeB);
}

Example Usage

typescript
// In React component
import { useQuery } from "convex/react";
import { api } from "../convex/_generated/api";

function SearchComponent() {
  const [query, setQuery] = useState("");
  
  const results = useQuery(api.search.semanticSearch, {
    query,
    limit: 20
  });
  
  return (
    <div>
      <input
        type="text"
        value={query}
        onChange={(e) => setQuery(e.target.value)}
        placeholder="Find recordings with heavy bass near the border"
      />
      
      {results?.map(result => (
        <div key={result.object._id}>
          <h3>{result.object.name}</h3>
          <p>Similarity: {(result.similarity * 100).toFixed(1)}%</p>
          <p>{result.matchedText}</p>
        </div>
      ))}
    </div>
  );
}

Hybrid Search: Semantic + Spatial + Filters

Advanced Query with Multiple Dimensions

typescript
// convex/search.ts
export const hybridSearch = query({
  args: {
    query: v.string(),
    datasetId: v.optional(v.id("gpsDatasets")),
    
    // Spatial filters
    nearLocation: v.optional(v.object({
      lat: v.number(),
      lng: v.number(),
      radiusKm: v.number()
    })),
    
    // Metadata filters
    tags: v.optional(v.array(v.string())),
    dateRange: v.optional(v.object({
      start: v.number(),
      end: v.number()
    })),
    
    // Ranking weights
    weights: v.optional(v.object({
      semantic: v.number(),   // Default: 0.7
      spatial: v.number(),    // Default: 0.2
      recency: v.number()     // Default: 0.1
    })),
    
    limit: v.optional(v.number())
  },
  handler: async (ctx, args) => {
    const weights = args.weights ?? {
      semantic: 0.7,
      spatial: 0.2,
      recency: 0.1
    };
    
    // 1. Generate query embedding
    const response = await openai.embeddings.create({
      model: "text-embedding-3-small",
      input: args.query
    });
    const queryEmbedding = response.data[0].embedding;
    
    // 2. Fetch embeddings with initial filters
    let embeddingsQuery = ctx.db.query("gpsObjectEmbeddings");
    if (args.datasetId) {
      embeddingsQuery = embeddingsQuery.withIndex(
        "by_dataset",
        q => q.eq("datasetId", args.datasetId)
      );
    }
    let embeddings = await embeddingsQuery.collect();
    
    // 3. Apply metadata filters
    if (args.tags && args.tags.length > 0) {
      embeddings = embeddings.filter(emb =>
        args.tags!.some(tag => emb.metadata.tags.includes(tag))
      );
    }
    
    // 4. Calculate semantic similarity
    const withSemanticScores = embeddings.map(emb => ({
      ...emb,
      semanticScore: cosineSimilarity(queryEmbedding, emb.embedding)
    }));
    
    // 5. Calculate spatial proximity score (if location provided)
    const withSpatialScores = withSemanticScores.map(emb => {
      let spatialScore = 1.0;  // Max score if no location filter
      
      if (args.nearLocation) {
        const distance = haversineDistance(
          args.nearLocation.lat,
          args.nearLocation.lng,
          emb.metadata.coordinates[1],  // lat
          emb.metadata.coordinates[0]   // lng
        );
        
        // Normalize: 0 km = 1.0, radiusKm = 0.0
        spatialScore = Math.max(
          0,
          1 - (distance / args.nearLocation.radiusKm)
        );
      }
      
      return { ...emb, spatialScore };
    });
    
    // 6. Calculate recency score
    const now = Date.now();
    const withRecencyScores = withSpatialScores.map(emb => {
      const ageMs = now - emb.createdAt;
      const ageDays = ageMs / (1000 * 60 * 60 * 24);
      
      // Normalize: 0 days = 1.0, 365 days = 0.0
      const recencyScore = Math.max(0, 1 - (ageDays / 365));
      
      return { ...emb, recencyScore };
    });
    
    // 7. Calculate weighted composite score
    const withCompositeScores = withRecencyScores.map(emb => ({
      ...emb,
      compositeScore:
        emb.semanticScore * weights.semantic +
        emb.spatialScore * weights.spatial +
        emb.recencyScore * weights.recency
    }));
    
    // 8. Sort by composite score
    const topResults = withCompositeScores
      .sort((a, b) => b.compositeScore - a.compositeScore)
      .slice(0, args.limit ?? 10);
    
    // 9. Fetch full GPS objects
    const objects = await Promise.all(
      topResults.map(r => ctx.db.get(r.objectId))
    );
    
    return topResults.map((result, i) => ({
      object: objects[i],
      scores: {
        composite: result.compositeScore,
        semantic: result.semanticScore,
        spatial: result.spatialScore,
        recency: result.recencyScore
      },
      explanation: buildExplanation(result, args.query)
    }));
  }
});

// Helper: Build human-readable explanation
function buildExplanation(result: any, query: string): string {
  const parts = [];
  
  parts.push(
    `Semantic match: ${(result.semanticScore * 100).toFixed(0)}%`
  );
  
  if (result.spatialScore < 1.0) {
    parts.push(
      `Location relevance: ${(result.spatialScore * 100).toFixed(0)}%`
    );
  }
  
  parts.push(
    `Recency: ${(result.recencyScore * 100).toFixed(0)}%`
  );
  
  return parts.join(" β€’ ");
}

// Haversine distance formula
function haversineDistance(
  lat1: number,
  lon1: number,
  lat2: number,
  lon2: number
): number {
  const R = 6371; // Earth's radius in km
  const dLat = toRadians(lat2 - lat1);
  const dLon = toRadians(lon2 - lon1);
  
  const a =
    Math.sin(dLat / 2) * Math.sin(dLat / 2) +
    Math.cos(toRadians(lat1)) *
      Math.cos(toRadians(lat2)) *
      Math.sin(dLon / 2) *
      Math.sin(dLon / 2);
  
  const c = 2 * Math.atan2(Math.sqrt(a), Math.sqrt(1 - a));
  return R * c;
}

function toRadians(degrees: number): number {
  return degrees * (Math.PI / 180);
}

Advanced Query Patterns

Pattern 1: Natural Language with Extracted Filters

User query: "Find jazz recordings near the border from last month"

Query processing:

typescript
// Use LLM to extract structured filters
const extractedFilters = await extractFilters(userQuery);
// Result:
// {
//   semantic_query: "jazz recordings",
//   location: { lat: 32.5332, lng: -117.0363, radiusKm: 5 },
//   date_range: { start: lastMonthStart, end: now },
//   tags: ["jazz"]
// }

const results = await hybridSearch({
  query: extractedFilters.semantic_query,
  nearLocation: extractedFilters.location,
  dateRange: extractedFilters.date_range,
  tags: extractedFilters.tags
});

Pattern 2: Follow-Up Refinement

Conversation:

User: "Find field recordings with heavy bass"
Agent: [Returns 10 results]

User: "Show only the ones near downtown Tijuana"
Agent: [Refines with spatial filter]

User: "From the last 3 months"
Agent: [Adds date range filter]

Implementation:

typescript
// Maintain conversation state
const [conversationState, setConversationState] = useState({
  baseQuery: "",
  filters: {}
});

// Update filters incrementally
function refineSearch(refinement: string) {
  const newFilters = extractRefinement(refinement, conversationState);
  
  setConversationState({
    ...conversationState,
    filters: { ...conversationState.filters, ...newFilters }
  });
  
  // Re-run search with updated filters
  hybridSearch({
    query: conversationState.baseQuery,
    ...conversationState.filters
  });
}

Pattern 3: Similar Objects

User action: Clicks "Find similar" on a GPS object

Implementation:

typescript
export const findSimilar = query({
  args: {
    objectId: v.id("gpsObjects"),
    limit: v.optional(v.number())
  },
  handler: async (ctx, args) => {
    // Get embedding for the object
    const embedding = await ctx.db
      .query("gpsObjectEmbeddings")
      .withIndex("by_object", q => q.eq("objectId", args.objectId))
      .first();
    
    if (!embedding) return [];
    
    // Find similar embeddings
    const allEmbeddings = await ctx.db
      .query("gpsObjectEmbeddings")
      .collect();
    
    const withScores = allEmbeddings
      .filter(emb => emb.objectId !== args.objectId)  // Exclude self
      .map(emb => ({
        ...emb,
        similarity: cosineSimilarity(embedding.embedding, emb.embedding)
      }));
    
    const topResults = withScores
      .sort((a, b) => b.similarity - a.similarity)
      .slice(0, args.limit ?? 10);
    
    const objects = await Promise.all(
      topResults.map(r => ctx.db.get(r.objectId))
    );
    
    return topResults.map((r, i) => ({
      object: objects[i],
      similarity: r.similarity
    }));
  }
});

Integration with Langchain.js

RAG Pattern: Retrieve + Generate

Use case: Answer questions about your GPS dataset using LLM

typescript
import { ChatOpenAI } from "@langchain/openai";
import { PromptTemplate } from "@langchain/core/prompts";

export const answerQuestion = query({
  args: {
    question: v.string(),
    datasetId: v.optional(v.id("gpsDatasets"))
  },
  handler: async (ctx, args) => {
    // 1. Retrieve relevant GPS objects
    const results = await ctx.runQuery(api.search.semanticSearch, {
      query: args.question,
      datasetId: args.datasetId,
      limit: 5
    });
    
    // 2. Build context from results
    const context = results
      .map(r => `
        Name: ${r.object.name}
        Description: ${r.object.description}
        Properties: ${JSON.stringify(r.object.properties, null, 2)}
      `)
      .join("\n\n---\n\n");
    
    // 3. Generate answer with LLM
    const llm = new ChatOpenAI({
      modelName: "gpt-4-turbo-preview",
      temperature: 0
    });
    
    const prompt = PromptTemplate.fromTemplate(`
      You are an expert in analyzing GPS-tagged field recordings and geospatial data.
      
      Context (GPS objects from database):
      {context}
      
      User question: {question}
      
      Provide a detailed answer based ONLY on the context above. If the context doesn't contain enough information, say so.
    `);
    
    const chain = prompt.pipe(llm);
    const response = await chain.invoke({
      context,
      question: args.question
    });
    
    return {
      answer: response.content,
      sources: results.map(r => ({
        name: r.object.name,
        similarity: r.similarity
      }))
    };
  }
});

Example usage:

typescript
const result = await answerQuestion({
  question: "What venues have the best acoustics for recording jazz?",
  datasetId: "music-venues-tijuana"
});

console.log(result.answer);
// "Based on the field recordings database, Tijuana Jazz Club shows
//  the best acoustics for jazz recording due to its intimate 150-person
//  capacity, sound-treated walls, and minimal ambient noise..."

console.log(result.sources);
// [
//   { name: "Tijuana Jazz Club - Friday Night Set", similarity: 0.89 },
//   { name: "ENTONO Live Music - Acoustic Session", similarity: 0.82 }
// ]

Performance Optimization

Caching Embeddings

typescript
// Cache query embeddings for common searches
const embeddingCache = new Map<string, number[]>();

async function getCachedEmbedding(text: string): Promise<number[]> {
  if (embeddingCache.has(text)) {
    return embeddingCache.get(text)!;
  }
  
  const response = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: text
  });
  
  const embedding = response.data[0].embedding;
  embeddingCache.set(text, embedding);
  
  return embedding;
}

Pre-filtering Before Similarity Search

typescript
// Bad: Calculate similarity for all 10,000 objects
const allEmbeddings = await ctx.db
  .query("gpsObjectEmbeddings")
  .collect();  // 10,000 objects

// Good: Pre-filter by tags or dataset
const filteredEmbeddings = await ctx.db
  .query("gpsObjectEmbeddings")
  .filter(q => q.eq(q.field("metadata.datasetType"), "audio_recordings"))
  .collect();  // 1,000 objects

Approximate Nearest Neighbor (ANN)

For datasets >50k objects, consider using specialized vector databases:

Option 1: Pinecone

typescript
import { PineconeClient } from "@pinecone-database/pinecone";

const pinecone = new PineconeClient();
await pinecone.init({
  apiKey: process.env.PINECONE_API_KEY,
  environment: "us-west1-gcp"
});

const index = pinecone.Index("gps-objects");

// Query (much faster than Convex for large datasets)
const results = await index.query({
  queryRequest: {
    vector: queryEmbedding,
    topK: 10,
    includeMetadata: true
  }
});

Option 2: Keep Convex, optimize queries

  • Use smaller embedding dimensions (384 instead of 1536)
  • Implement hierarchical search (coarse β†’ fine)
  • Cache frequently accessed embeddings in memory

Real-World Examples

Example 1: Audio Research

Query: "Find field recordings with prominent low-frequency content near industrial areas"

Result:

json
[
  {
    "object": {
      "name": "Border Crossing - Evening Rush",
      "coordinates": [-117.0363, 32.5421]
    },
    "scores": {
      "semantic": 0.87,  // Strong match: "low-frequency", "industrial"
      "spatial": 0.95,   // Very close to industrial zone
      "composite": 0.89
    },
    "explanation": "Strong semantic match (87%) β€’ Near industrial zone (95%)"
  }
]

Example 2: Infrastructure Planning

Query: "Sewer lines showing signs of deterioration in high-traffic areas"

Result:

json
[
  {
    "object": {
      "name": "Sewer Main TJ-SW-1234",
      "condition": "moderate_wear",
      "location": "Av. RevoluciΓ³n (high pedestrian traffic)"
    },
    "scores": {
      "semantic": 0.82,  // Match: "deterioration", "wear"
      "spatial": 0.78,   // Near high-traffic intersection
      "recency": 0.90,   // Recently inspected
      "composite": 0.83
    }
  }
]

Monitoring and Debugging

Query Analytics

typescript
// Track search performance
queryAnalytics: defineTable({
  query: v.string(),
  filters: v.any(),
  resultCount: v.number(),
  avgSimilarity: v.number(),
  executionTimeMs: v.number(),
  timestamp: v.number()
}).index("by_timestamp", ["timestamp"]);

Embedding Quality Metrics

typescript
// Periodically check embedding quality
export const checkEmbeddingQuality = internalMutation({
  handler: async (ctx) => {
    // Sample 100 random pairs
    const embeddings = await ctx.db
      .query("gpsObjectEmbeddings")
      .collect();
    
    const samples = samplePairs(embeddings, 100);
    
    // Calculate average similarity for same-dataset vs different-dataset
    const sameDatasetSim = samples
      .filter(([a, b]) => a.datasetId === b.datasetId)
      .map(([a, b]) => cosineSimilarity(a.embedding, b.embedding));
    
    const diffDatasetSim = samples
      .filter(([a, b]) => a.datasetId !== b.datasetId)
      .map(([a, b]) => cosineSimilarity(a.embedding, b.embedding));
    
    console.log("Embedding quality metrics:");
    console.log(`Same dataset avg: ${avg(sameDatasetSim)}`);
    console.log(`Different dataset avg: ${avg(diffDatasetSim)}`);
    console.log(`Separation ratio: ${avg(sameDatasetSim) / avg(diffDatasetSim)}`);
    // Good separation ratio: > 1.5
  }
});

See Also