Tracking Human vs AI Authorship in Knowledge Bases
Tracking Human vs AI Authorship in Knowledge Bases
The Attribution Problem
As knowledge bases grow with both human-written and AI-generated content, a critical question emerges: How do you track authorship?
This isn't just academic—it has practical implications for:
- Quality control: Human-authored content may need different review processes
- Workflow decisions: Bulk refactoring strategies differ for human vs AI content
- Schema enforcement: Human educational content often needs structural flexibility
- Editing patterns: Substantial human editing time indicates conceptual investment
- Monetization: Some content types (gear reviews) may warrant different strategies
The Naive Approach (Doesn't Work)
Idea: Add author: human or author: ai field
Why it fails:
- Binary choice: Most content is hybrid (AI draft + human editing)
- Terminal creation: If all content comes through terminal/OpenCode, field is always ambiguous
- Evolution: Content that starts AI-generated may become majority-human through edits
- Maintenance burden: Manually updating author field as content evolves
Better: Infer Authorship from Behavioral Signals
Signal 1: edited_seconds Field
What it tracks: Cumulative time file is open in Obsidian editing mode
Interpretation:
edited_seconds: 0→ Likely pure AI, never manually editededited_seconds: 50-100→ Light human editing (typos, formatting)edited_seconds: 200-500→ Substantial human contributionedited_seconds: 1000+→ Primarily human-authored or heavily reworked
Example query:
bash# Find heavily human-edited articles grep "edited_seconds: [5-9][0-9][0-9]" **/*.md # 500+ seconds grep "edited_seconds: [1-9][0-9][0-9][0-9]" **/*.md # 1000+ seconds
Advantage: Captures authorship as a spectrum, not binary
Limitation: Plugin must be configured with timeout to avoid counting idle time
Signal 2: template_type Field (or Lack Thereof)
Pattern:
- AI-generated content: Always has
template_type(follows patterns) - Human-authored educational content: Often lacks strict template
Example:
bash# Find articles without template_type (likely human-authored) grep -L "^template_type:" **/*.md
Why this works: Humans write more organically, AI follows templates
Limitation: As you retroactively add template_type, this signal weakens
Signal 3: Content Markers
Human-authored indicators:
-module intro-sections (educational structure)- Personal anecdotes
- First-person language
- Organic section flows
- Course-specific references
AI-generated indicators:
- Strict pattern adherence
- Formal tone throughout
- Predictable section order
- Generic examples
Query:
bash# Find module-based educational content grep -l "module intro" **/*.md
Signal 4: File Creation vs Modification Time Ratio
Git-based analysis:
bash# Files with many commits = human-evolved git log --follow --oneline -- "file.md" | wc -l
Interpretation:
- 1-2 commits → Created and done (likely AI)
- 5+ commits → Iteratively refined (likely human)
- 10+ commits → Heavily curated (definitely human)
Signal 5: schema_validated Date vs created Date
Pattern:
- Old files (created before schema standardization) → Human-authored
- New files (created after AGENTS.md + templates) → AI-generated
Query:
bash# Files created before schema was formalized grep "created: 2025-11-0[1-7]" **/*.md # Before Nov 8, 2025
Hybrid Authorship Framework
Three Categories
1. Pure AI (Template-Driven)
yamlcreated: 2025-11-08T15:00:00-0800 edited_seconds: 0 template_type: feature-guide schema_validated: 2025-11-08
- Created through terminal
- Follows strict pattern
- Zero or minimal editing time
- Use case: Reference docs, feature guides, programmatic content
2. AI-Assisted (Hybrid)
yamlcreated: 2025-11-08T14:00:00-0800 edited_seconds: 250 template_type: workflow-guide schema_validated: 2025-11-08
- AI generated initial draft
- Human edited substantially (250 seconds)
- Follows template but has customizations
- Use case: Workflow documentation, conceptual explainers
3. Human-Authored (Organic)
yamlcreated: 2025-10-15T10:30:00-0800 edited_seconds: 1840 # No template_type field schema_validated: 2025-11-08 # Added retroactively
- Created before template system
- Or: Created manually in Obsidian
- Substantial editing time
- May not follow templates
- Use case: Educational modules, course materials, creative content
Schema Flexibility Strategies
Problem: Human Educational Content Doesn't Fit Templates
Scenario: You have -module intro- articles for courses with unique structures:
- Module 1: Basic Concepts
- Module 2: Intermediate Techniques
- Module 3: Advanced Applications
Each module has custom sections, pedagogical flow, embedded exercises.
If you enforce strict schema: You break the educational structure
Solution: Tiered Schema Enforcement
Tier 1: Universal Fields (Always Required)
Every file, regardless of authorship:
yamlcreated: [timestamp] updated: [timestamp] edited_seconds: [number] slug: [random-string]
Why: Enables basic tracking and programmatic operations
Tier 2: AI-Generated Fields (Required for AI Content)
Files created through templates:
yamltemplate_type: [pattern-name] schema_validated: [date]
Why: Enables pattern-based filtering and validation
Tier 3: Optional/Custom Fields (Content-Specific)
Gear guides:
yamlmanufacturer: [brand] model: [model-number] category: [type]
Educational modules:
yamlcourse: [course-code] module_number: [1-10] prerequisites: [list]
Implementation: Validation Script with Tiers
bash#!/bin/bash # validate-tiered-schema.sh for file in **/*.md; do # Tier 1: Universal (always check) if ! grep -q "^created:" "$file"; then echo "ERROR: $file missing required field 'created'" fi # Tier 2: Check if this looks AI-generated if grep -q "^template_type:" "$file"; then # AI content - enforce full schema if ! grep -q "^schema_validated:" "$file"; then echo "WARNING: $file has template_type but no schema_validated" fi else # Human content - skip template enforcement echo "INFO: $file appears human-authored, skipping template validation" fi # Tier 3: Content-specific validation if grep -q "template_type: gear-guide" "$file"; then if ! grep -q "^manufacturer:" "$file"; then echo "ERROR: Gear guide $file missing manufacturer field" fi fi done
Practical Workflow Decisions
Scenario 1: Bulk Refactoring
Question: Should I update all articles to new schema?
Decision tree:
bash# Check edited_seconds to gauge human investment if [ edited_seconds > 500 ]; then # High human investment - review manually before changes echo "Review this file individually" else # Low editing - safe for bulk operations echo "Include in bulk update" fi
Scenario 2: Template Enforcement
Question: Should this article follow a template pattern?
Decision tree:
- Check for
module introor course markers → Skip template enforcement - Check
edited_seconds> 1000 → Likely custom structure, skip - Check creation date < schema formalization → Legacy content, skip
- Otherwise → Encourage template adoption
Scenario 3: Schema Validation
Question: Should validation fail for non-conforming files?
Modes:
- Strict mode: Only AI-generated content (has
template_type) - Permissive mode: Human-authored content (no
template_typeor highedited_seconds)
bash# Validate in permissive mode ./validate-schema.sh --permissive # Only fails on AI content missing required fields # Warnings only for human content
The edited_seconds Problem
Issue: Plugin Counts Idle Time
Scenario: You open file for reference, leave it open for 20 minutes while doing other work. Plugin logs 1200 seconds.
Result: File appears heavily human-edited when it's not.
Solutions
1. Plugin Configuration (Ideal) Check if plugin supports:
- Idle timeout (stop counting after 60 seconds of no input)
- Activity detection (only count when typing)
- Pause/resume controls
2. Heuristic Filtering
bash# Find files with suspiciously round editing times (likely idle) grep "edited_seconds: [0-9]*00$" **/*.md # Ends in 00 # Example: 300, 600, 1200 (likely sat open idle)
3. Manual Field
Add active_editing_time for known accurate values:
yamledited_seconds: 1240 # Plugin value (includes idle) active_editing_time: 450 # Manual estimate
4. Accept Imperfection
Recognize edited_seconds is a rough signal, not precise measurement. Use in combination with other signals.
Monetization Implications
Content Type → Monetization Strategy
Gear Guides (AI + Human Review)
yamltemplate_type: gear-guide manufacturer: Shure model: SM7B affiliate_eligible: true
- Standardized structure
- Easy to add affiliate links programmatically
- Bulk update all gear guides with affiliate template
Query:
bash# Find all gear guides for affiliate link insertion grep -l "template_type: gear-guide" **/*.md
Educational Modules (Pure Human)
yamlcourse: RAT-120 module_number: 3 monetization: course-paywall
- Custom educational content
- Behind paywall or course enrollment
- Not suitable for affiliate links
Reference Articles (AI-Generated)
yamltemplate_type: feature-guide monetization: adsense
- High traffic potential
- Ad-supported model
- SEO-optimized
Programmatic Affiliate Link Insertion
Find gear guides without affiliate links:
bashfor file in $(grep -l "template_type: gear-guide" **/*.md); do if ! grep -q "affiliate_link:" "$file"; then echo "Add affiliate link to: $file" # Extract manufacturer + model, generate affiliate URL manufacturer=$(grep "^manufacturer:" "$file" | cut -d' ' -f2-) model=$(grep "^model:" "$file" | cut -d' ' -f2-) # Insert affiliate link section fi done
Result: Bulk monetization of AI-standardized content, while preserving educational content integrity
Recommended Schema Updates
Add Optional authorship_signal Field
yamlauthorship_signal: ai-generated # or: ai-assisted, human-authored, hybrid
When to use:
ai-generated: created via terminal,edited_seconds: 0-50ai-assisted: AI draft + substantial editing (edited_seconds: 100-500)human-authored: created in Obsidian oredited_seconds: 500+hybrid: multiple authors/iterations
Why optional: Can be inferred from other signals; explicit field helps with quick filtering
Add content_flexibility Flag
yamlcontent_flexibility: strict # or: flexible
When to use:
strict: Enforce template pattern, validate structureflexible: Educational/creative content, skip template validation
Benefits:
- Explicit signal for validation scripts
- Self-documenting exceptions
- Easy to query:
grep "content_flexibility: flexible"
Migration Strategy
For Existing Human-Authored Content
Don't: Force into templates retroactively
Do:
- Add universal fields (
slug,schema_validated) - Set
content_flexibility: flexible - Note high
edited_secondsas signal - Skip
template_typefield
Example:
yaml--- created: 2025-10-15T10:30:00-0800 updated: 2025-11-08T15:00:00-0800 edited_seconds: 1840 slug: abc123xyz98 content_flexibility: flexible schema_validated: 2025-11-08 # No template_type - human-authored educational content ---
For New AI-Generated Content
Always include:
yamlcreated: [timestamp] updated: [timestamp] edited_seconds: 0 slug: [random] template_type: [pattern] schema_validated: [date] authorship_signal: ai-generated content_flexibility: strict
Validation Rules
Universal Validation (All Files)
✅ created field present
✅ slug field present
✅ schema_validated present (even if retroactive)
AI Content Validation (has template_type)
✅ template_type value is valid pattern name
✅ Structure matches declared template
✅ All template-specific fields present
Human Content Validation (no template_type OR content_flexibility: flexible)
⚠️ Warnings only, no errors ℹ️ Info messages about missing fields ✅ Still check universal fields
Related
- Pattern-Driven Knowledge Bases with AI Assistance - Overall methodology
- Programmatic Schema Validation and Refactoring - Validation scripts
- Interrupting AI Conversations - When and Why It Works - Workflow integration