Back to articles

Tracking Human vs AI Authorship in Knowledge Bases

Computer TechAIMLKnowledge Base ManagementTracking Human vs AI Authorship in Knowledge Bases
Updated 4/23/2026

Tracking Human vs AI Authorship in Knowledge Bases

The Attribution Problem

As knowledge bases grow with both human-written and AI-generated content, a critical question emerges: How do you track authorship?

This isn't just academic—it has practical implications for:

  • Quality control: Human-authored content may need different review processes
  • Workflow decisions: Bulk refactoring strategies differ for human vs AI content
  • Schema enforcement: Human educational content often needs structural flexibility
  • Editing patterns: Substantial human editing time indicates conceptual investment
  • Monetization: Some content types (gear reviews) may warrant different strategies

The Naive Approach (Doesn't Work)

Idea: Add author: human or author: ai field

Why it fails:

  1. Binary choice: Most content is hybrid (AI draft + human editing)
  2. Terminal creation: If all content comes through terminal/OpenCode, field is always ambiguous
  3. Evolution: Content that starts AI-generated may become majority-human through edits
  4. Maintenance burden: Manually updating author field as content evolves

Better: Infer Authorship from Behavioral Signals

Signal 1: edited_seconds Field

What it tracks: Cumulative time file is open in Obsidian editing mode

Interpretation:

  • edited_seconds: 0 → Likely pure AI, never manually edited
  • edited_seconds: 50-100 → Light human editing (typos, formatting)
  • edited_seconds: 200-500 → Substantial human contribution
  • edited_seconds: 1000+ → Primarily human-authored or heavily reworked

Example query:

bash
# Find heavily human-edited articles
grep "edited_seconds: [5-9][0-9][0-9]" **/*.md  # 500+ seconds
grep "edited_seconds: [1-9][0-9][0-9][0-9]" **/*.md  # 1000+ seconds

Advantage: Captures authorship as a spectrum, not binary

Limitation: Plugin must be configured with timeout to avoid counting idle time

Signal 2: template_type Field (or Lack Thereof)

Pattern:

  • AI-generated content: Always has template_type (follows patterns)
  • Human-authored educational content: Often lacks strict template

Example:

bash
# Find articles without template_type (likely human-authored)
grep -L "^template_type:" **/*.md

Why this works: Humans write more organically, AI follows templates

Limitation: As you retroactively add template_type, this signal weakens

Signal 3: Content Markers

Human-authored indicators:

  • -module intro- sections (educational structure)
  • Personal anecdotes
  • First-person language
  • Organic section flows
  • Course-specific references

AI-generated indicators:

  • Strict pattern adherence
  • Formal tone throughout
  • Predictable section order
  • Generic examples

Query:

bash
# Find module-based educational content
grep -l "module intro" **/*.md

Signal 4: File Creation vs Modification Time Ratio

Git-based analysis:

bash
# Files with many commits = human-evolved
git log --follow --oneline -- "file.md" | wc -l

Interpretation:

  • 1-2 commits → Created and done (likely AI)
  • 5+ commits → Iteratively refined (likely human)
  • 10+ commits → Heavily curated (definitely human)

Signal 5: schema_validated Date vs created Date

Pattern:

  • Old files (created before schema standardization) → Human-authored
  • New files (created after AGENTS.md + templates) → AI-generated

Query:

bash
# Files created before schema was formalized
grep "created: 2025-11-0[1-7]" **/*.md  # Before Nov 8, 2025

Hybrid Authorship Framework

Three Categories

1. Pure AI (Template-Driven)

yaml
created: 2025-11-08T15:00:00-0800
edited_seconds: 0
template_type: feature-guide
schema_validated: 2025-11-08
  • Created through terminal
  • Follows strict pattern
  • Zero or minimal editing time
  • Use case: Reference docs, feature guides, programmatic content

2. AI-Assisted (Hybrid)

yaml
created: 2025-11-08T14:00:00-0800
edited_seconds: 250
template_type: workflow-guide
schema_validated: 2025-11-08
  • AI generated initial draft
  • Human edited substantially (250 seconds)
  • Follows template but has customizations
  • Use case: Workflow documentation, conceptual explainers

3. Human-Authored (Organic)

yaml
created: 2025-10-15T10:30:00-0800
edited_seconds: 1840
# No template_type field
schema_validated: 2025-11-08  # Added retroactively
  • Created before template system
  • Or: Created manually in Obsidian
  • Substantial editing time
  • May not follow templates
  • Use case: Educational modules, course materials, creative content

Schema Flexibility Strategies

Problem: Human Educational Content Doesn't Fit Templates

Scenario: You have -module intro- articles for courses with unique structures:

  • Module 1: Basic Concepts
  • Module 2: Intermediate Techniques
  • Module 3: Advanced Applications

Each module has custom sections, pedagogical flow, embedded exercises.

If you enforce strict schema: You break the educational structure

Solution: Tiered Schema Enforcement

Tier 1: Universal Fields (Always Required)

Every file, regardless of authorship:

yaml
created: [timestamp]
updated: [timestamp]
edited_seconds: [number]
slug: [random-string]

Why: Enables basic tracking and programmatic operations

Tier 2: AI-Generated Fields (Required for AI Content)

Files created through templates:

yaml
template_type: [pattern-name]
schema_validated: [date]

Why: Enables pattern-based filtering and validation

Tier 3: Optional/Custom Fields (Content-Specific)

Gear guides:

yaml
manufacturer: [brand]
model: [model-number]
category: [type]

Educational modules:

yaml
course: [course-code]
module_number: [1-10]
prerequisites: [list]

Implementation: Validation Script with Tiers

bash
#!/bin/bash
# validate-tiered-schema.sh

for file in **/*.md; do
  # Tier 1: Universal (always check)
  if ! grep -q "^created:" "$file"; then
    echo "ERROR: $file missing required field 'created'"
  fi
  
  # Tier 2: Check if this looks AI-generated
  if grep -q "^template_type:" "$file"; then
    # AI content - enforce full schema
    if ! grep -q "^schema_validated:" "$file"; then
      echo "WARNING: $file has template_type but no schema_validated"
    fi
  else
    # Human content - skip template enforcement
    echo "INFO: $file appears human-authored, skipping template validation"
  fi
  
  # Tier 3: Content-specific validation
  if grep -q "template_type: gear-guide" "$file"; then
    if ! grep -q "^manufacturer:" "$file"; then
      echo "ERROR: Gear guide $file missing manufacturer field"
    fi
  fi
done

Practical Workflow Decisions

Scenario 1: Bulk Refactoring

Question: Should I update all articles to new schema?

Decision tree:

bash
# Check edited_seconds to gauge human investment
if [ edited_seconds > 500 ]; then
  # High human investment - review manually before changes
  echo "Review this file individually"
else
  # Low editing - safe for bulk operations
  echo "Include in bulk update"
fi

Scenario 2: Template Enforcement

Question: Should this article follow a template pattern?

Decision tree:

  1. Check for module intro or course markers → Skip template enforcement
  2. Check edited_seconds > 1000 → Likely custom structure, skip
  3. Check creation date < schema formalization → Legacy content, skip
  4. Otherwise → Encourage template adoption

Scenario 3: Schema Validation

Question: Should validation fail for non-conforming files?

Modes:

  • Strict mode: Only AI-generated content (has template_type)
  • Permissive mode: Human-authored content (no template_type or high edited_seconds)
bash
# Validate in permissive mode
./validate-schema.sh --permissive

# Only fails on AI content missing required fields
# Warnings only for human content

The edited_seconds Problem

Issue: Plugin Counts Idle Time

Scenario: You open file for reference, leave it open for 20 minutes while doing other work. Plugin logs 1200 seconds.

Result: File appears heavily human-edited when it's not.

Solutions

1. Plugin Configuration (Ideal) Check if plugin supports:

  • Idle timeout (stop counting after 60 seconds of no input)
  • Activity detection (only count when typing)
  • Pause/resume controls

2. Heuristic Filtering

bash
# Find files with suspiciously round editing times (likely idle)
grep "edited_seconds: [0-9]*00$" **/*.md  # Ends in 00
# Example: 300, 600, 1200 (likely sat open idle)

3. Manual Field Add active_editing_time for known accurate values:

yaml
edited_seconds: 1240  # Plugin value (includes idle)
active_editing_time: 450  # Manual estimate

4. Accept Imperfection Recognize edited_seconds is a rough signal, not precise measurement. Use in combination with other signals.

Monetization Implications

Content Type → Monetization Strategy

Gear Guides (AI + Human Review)

yaml
template_type: gear-guide
manufacturer: Shure
model: SM7B
affiliate_eligible: true
  • Standardized structure
  • Easy to add affiliate links programmatically
  • Bulk update all gear guides with affiliate template

Query:

bash
# Find all gear guides for affiliate link insertion
grep -l "template_type: gear-guide" **/*.md

Educational Modules (Pure Human)

yaml
course: RAT-120
module_number: 3
monetization: course-paywall
  • Custom educational content
  • Behind paywall or course enrollment
  • Not suitable for affiliate links

Reference Articles (AI-Generated)

yaml
template_type: feature-guide
monetization: adsense
  • High traffic potential
  • Ad-supported model
  • SEO-optimized

Find gear guides without affiliate links:

bash
for file in $(grep -l "template_type: gear-guide" **/*.md); do
  if ! grep -q "affiliate_link:" "$file"; then
    echo "Add affiliate link to: $file"
    # Extract manufacturer + model, generate affiliate URL
    manufacturer=$(grep "^manufacturer:" "$file" | cut -d' ' -f2-)
    model=$(grep "^model:" "$file" | cut -d' ' -f2-)
    # Insert affiliate link section
  fi
done

Result: Bulk monetization of AI-standardized content, while preserving educational content integrity

Add Optional authorship_signal Field

yaml
authorship_signal: ai-generated  # or: ai-assisted, human-authored, hybrid

When to use:

  • ai-generated: created via terminal, edited_seconds: 0-50
  • ai-assisted: AI draft + substantial editing (edited_seconds: 100-500)
  • human-authored: created in Obsidian or edited_seconds: 500+
  • hybrid: multiple authors/iterations

Why optional: Can be inferred from other signals; explicit field helps with quick filtering

Add content_flexibility Flag

yaml
content_flexibility: strict  # or: flexible

When to use:

  • strict: Enforce template pattern, validate structure
  • flexible: Educational/creative content, skip template validation

Benefits:

  • Explicit signal for validation scripts
  • Self-documenting exceptions
  • Easy to query: grep "content_flexibility: flexible"

Migration Strategy

For Existing Human-Authored Content

Don't: Force into templates retroactively

Do:

  1. Add universal fields (slug, schema_validated)
  2. Set content_flexibility: flexible
  3. Note high edited_seconds as signal
  4. Skip template_type field

Example:

yaml
---
created: 2025-10-15T10:30:00-0800
updated: 2025-11-08T15:00:00-0800
edited_seconds: 1840
slug: abc123xyz98
content_flexibility: flexible
schema_validated: 2025-11-08
# No template_type - human-authored educational content
---

For New AI-Generated Content

Always include:

yaml
created: [timestamp]
updated: [timestamp]
edited_seconds: 0
slug: [random]
template_type: [pattern]
schema_validated: [date]
authorship_signal: ai-generated
content_flexibility: strict

Validation Rules

Universal Validation (All Files)

created field present ✅ slug field present ✅ schema_validated present (even if retroactive)

AI Content Validation (has template_type)

template_type value is valid pattern name ✅ Structure matches declared template ✅ All template-specific fields present

Human Content Validation (no template_type OR content_flexibility: flexible)

⚠️ Warnings only, no errors ℹ️ Info messages about missing fields ✅ Still check universal fields