Database Theory - midimaze

# Database Theory A database is an organized collection of data that can be efficiently queried, updated, and managed. Understanding database fundamentals helps you build scalable, reliable applications. ## Core Concepts ### CRUD Operations Every database supports four fundamental operations: **Create** - Add new data ```typescript await db.insert("songs", { title: "Giant Steps", artist: "John Coltrane", year: 1960 }); ``` **Read** - Retrieve data ```typescript const song = await db.query("songs") .filter(q => q.eq(q.field("title"), "Giant Steps")) .first(); ``` **Update** - Modify existing data ```typescript await db.patch(songId, { year: 1959 }); ``` **Delete** - Remove data ```typescript await db.delete(songId); ``` ### Schemas A **schema** defines the structure of your data - what fields exist and what types they have. **Example schema (Convex):** ```typescript import { defineSchema, defineTable } from "convex/server"; import { v } from "convex/values"; export default defineSchema({ songs: defineTable({ title: v.string(), artist: v.string(), year: v.number(), duration: v.optional(v.number()), // Optional field tags: v.array(v.string()) // Array of strings }) }); ``` **Why schemas matter:** - **Type safety** - Catch errors at compile time - **Data integrity** - Ensure consistency - **Performance** - Database can optimize storage ### Queries Queries retrieve data based on conditions. **Simple query:** ```typescript // Get all jazz songs const jazzSongs = await db.query("songs") .filter(q => q.eq(q.field("genre"), "jazz")) .collect(); ``` **Complex query:** ```typescript // Get songs from 1960s by Coltrane, sorted by title const songs = await db.query("songs") .filter(q => q.and( q.eq(q.field("artist"), "John Coltrane"), q.gte(q.field("year"), 1960), q.lte(q.field("year"), 1969) ) ) .order("asc", "title") .collect(); ``` ### Indexes **Indexes** make queries fast by creating lookup tables. **Without index:** - Database scans every record → O(n) time - 10,000 songs → 10,000 checks **With index:** - Database jumps to match → O(log n) time - 10,000 songs → ~13 checks **Creating indexes (Convex):** ```typescript songs: defineTable({ title: v.string(), artist: v.string(), year: v.number() }) .index("by_artist", ["artist"]) .index("by_year", ["year"]) .index("by_artist_year", ["artist", "year"]) // Compound index ``` **Using indexes:** ```typescript // Fast - uses index const coltraneSONGS = await db.query("songs") .withIndex("by_artist", q => q.eq("artist", "John Coltrane")) .collect(); // Slow - no index const oldSongs = await db.query("songs") .filter(q => q.lt(q.field("year"), 1950)) // No index on year < 1950 .collect(); ``` ## Database Types ### Relational Databases **Structure:** Tables with rows and columns **Relationships:** Foreign keys link tables **Query language:** SQL **Examples:** PostgreSQL, MySQL, SQLite **Song database (relational):** ```sql -- Songs table CREATE TABLE songs ( id SERIAL PRIMARY KEY, title VARCHAR(255), artist_id INT REFERENCES artists(id), album_id INT REFERENCES albums(id) ); -- Artists table CREATE TABLE artists ( id SERIAL PRIMARY KEY, name VARCHAR(255) ); -- Query with JOIN SELECT songs.title, artists.name FROM songs JOIN artists ON songs.artist_id = artists.id WHERE artists.name = 'John Coltrane'; ``` **Strengths:** - Strong data integrity (ACID transactions) - Complex relationships - Mature tooling **Weaknesses:** - Rigid schema (migrations required) - Harder to scale horizontally ### NoSQL Databases **Structure:** Flexible documents, key-value pairs, graphs **Relationships:** Embedded or referenced **Examples:** MongoDB, Convex, Firebase **Song database (NoSQL/Convex):** ```typescript { _id: "abc123", title: "Giant Steps", artist: { name: "John Coltrane", born: 1926 }, album: "Giant Steps", tracks: [ { position: 1, title: "Giant Steps", duration: 285 }, { position: 2, title: "Cousin Mary", duration: 342 } ] } ``` **Strengths:** - Flexible schema - Easy horizontal scaling - Fast development **Weaknesses:** - Less rigid data integrity - Complex relationships harder to model ## Convex-Specific Concepts ### Real-Time Reactivity Convex automatically updates React components when data changes. ```typescript function SongList() { // Subscribes to changes const songs = useQuery(api.songs.list); // Automatically re-renders when songs added/updated/deleted return ( <ul> {songs?.map(song => <li key={song._id}>{song.title}</li>)} </ul> ); } ``` ### Queries, Mutations, Actions **Query** - Read data (can't modify) ```typescript export const get = query({ args: { id: v.id("songs") }, handler: async (ctx, args) => { return await ctx.db.get(args.id); } }); ``` **Mutation** - Write data (deterministic) ```typescript export const add = mutation({ args: { title: v.string(), artist: v.string() }, handler: async (ctx, args) => { return await ctx.db.insert("songs", args); } }); ``` **Action** - Call external APIs (non-deterministic) ```typescript export const fetchFromSpotify = action({ args: { songId: v.string() }, handler: async (ctx, args) => { const data = await fetch(`https://api.spotify.com/...`); // Can call mutations to save results await ctx.runMutation(api.songs.update, { ... }); } }); ``` ### Document IDs Convex uses typed IDs: `Id<"songs">` ```typescript // Type-safe ID const songId: Id<"songs"> = "k17abc123"; // Can't accidentally use wrong ID const artist = await db.get(songId); // Error: Expected Id<"artists"> ``` ## Common Patterns ### One-to-Many Relationships **Example:** One artist, many songs **Approach 1: Array of IDs** ```typescript artists: defineTable({ name: v.string(), songIds: v.array(v.id("songs")) }) ``` **Approach 2: Back-reference (better for Convex)** ```typescript songs: defineTable({ title: v.string(), artistId: v.id("artists") }).index("by_artist", ["artistId"]) // Query all songs by artist const songs = await db.query("songs") .withIndex("by_artist", q => q.eq("artistId", artistId)) .collect(); ``` ### Many-to-Many Relationships **Example:** Songs can have multiple tags, tags apply to multiple songs **Junction table:** ```typescript songTags: defineTable({ songId: v.id("songs"), tagId: v.id("tags") }) .index("by_song", ["songId"]) .index("by_tag", ["tagId"]) ``` ### Denormalization **Principle:** Copy data to avoid joins **Example:** Store artist name in song (instead of just ID) ```typescript songs: defineTable({ title: v.string(), artistId: v.id("artists"), artistName: v.string() // Denormalized - avoid extra query }) ``` **Trade-off:** - ✅ Faster queries (no join) - ❌ More storage - ❌ Must update in multiple places ## Best Practices ### Design for Your Queries **Bad:** Design schema first, then figure out queries **Good:** Design queries first, then schema to support them **Example:** - Need to show "Recent songs by artist"? - Create index: `by_artist_created` ### Use Indexes Wisely **Too few indexes:** Slow queries **Too many indexes:** Slow writes, wasted storage **Rule of thumb:** Index fields you filter/sort by frequently ### Pagination for Large Results **Bad:** ```typescript const allSongs = await db.query("songs").collect(); // 10,000 songs! ``` **Good:** ```typescript const page = await db.query("songs") .order("desc", "createdAt") .take(50); // First 50 only ``` ### Atomic Updates Use transactions to ensure data consistency: ```typescript // Transfer plays from one song to another export const transferPlays = mutation({ args: { fromSongId: v.id("songs"), toSongId: v.id("songs") }, handler: async (ctx, args) => { const fromSong = await ctx.db.get(args.fromSongId); const toSong = await ctx.db.get(args.toSongId); // Both updates succeed or both fail (atomic) await ctx.db.patch(args.fromSongId, { plays: 0 }); await ctx.db.patch(args.toSongId, { plays: toSong.plays + fromSong.plays }); } }); ``` ## Real-World Example: NNT Ecosystem **Requirements:** - Store songs with multiple annotation instances - Each annotation has start/end times, text, embeddings - Real-time updates when annotations added - Fast search by song, time range, semantic content **Schema:** ```typescript export default defineSchema({ songs: defineTable({ title: v.string(), artist: v.string(), appleMusicId: v.optional(v.string()), spotifyId: v.optional(v.string()) }), annotations: defineTable({ songId: v.id("songs"), startTime: v.number(), endTime: v.number(), text: v.string(), embedding: v.array(v.float64()), createdBy: v.string(), createdAt: v.number() }) .index("by_song", ["songId"]) .index("by_song_time", ["songId", "startTime"]), tscribeInstances: defineTable({ articleId: v.string(), songId: v.id("songs"), section: v.string(), startTime: v.number(), endTime: v.number() }).index("by_article", ["articleId"]) }); ``` **Query examples:** ```typescript // Get all annotations for a song, ordered by time const annotations = await db.query("annotations") .withIndex("by_song_time", q => q.eq("songId", songId)) .order("asc", "startTime") .collect(); // Get annotations overlapping specific time range const overlap = await db.query("annotations") .withIndex("by_song_time", q => q.eq("songId", songId) .gte("startTime", 100) .lte("endTime", 200) ) .collect(); ``` ## See Also - [[Getting started with Convex]] - Convex setup - [[Langchain with Convex]] - Embeddings + vector search - [[Data Storage Architecture]] - NNT-specific schema details - [[Database Query Patterns]] - Efficient Convex queries