Tracking Human vs AI Authorship in Knowledge Bases

The Attribution Problem

As knowledge bases grow with both human-written and AI-generated content, a critical question emerges: How do you track authorship?

This isn't just academic—it has practical implications for:

Quality control: Human-authored content may need different review processes
Workflow decisions: Bulk refactoring strategies differ for human vs AI content
Schema enforcement: Human educational content often needs structural flexibility
Editing patterns: Substantial human editing time indicates conceptual investment
Monetization: Some content types (gear reviews) may warrant different strategies

The Naive Approach (Doesn't Work)

Idea: Add `author: human` or `author: ai` field

Why it fails:

Binary choice: Most content is hybrid (AI draft + human editing)
Terminal creation: If all content comes through terminal/OpenCode, field is always ambiguous
Evolution: Content that starts AI-generated may become majority-human through edits
Maintenance burden: Manually updating author field as content evolves

Better: Infer Authorship from Behavioral Signals

Signal 1: `edited_seconds` Field

What it tracks: Cumulative time file is open in Obsidian editing mode

Interpretation:

edited_seconds: 0 → Likely pure AI, never manually edited
edited_seconds: 50-100 → Light human editing (typos, formatting)
edited_seconds: 200-500 → Substantial human contribution
edited_seconds: 1000+ → Primarily human-authored or heavily reworked

Example query:

bash
# Find heavily human-edited articles
grep "edited_seconds: [5-9][0-9][0-9]" **/*.md  # 500+ seconds
grep "edited_seconds: [1-9][0-9][0-9][0-9]" **/*.md  # 1000+ seconds

Advantage: Captures authorship as a spectrum, not binary

Limitation: Plugin must be configured with timeout to avoid counting idle time

Signal 2: `template_type` Field (or Lack Thereof)

Pattern:

AI-generated content: Always has template_type (follows patterns)
Human-authored educational content: Often lacks strict template

Example:

bash
# Find articles without template_type (likely human-authored)
grep -L "^template_type:" **/*.md

Why this works: Humans write more organically, AI follows templates

Limitation: As you retroactively add template_type, this signal weakens

Signal 3: Content Markers

Human-authored indicators:

-module intro- sections (educational structure)
Personal anecdotes
First-person language
Organic section flows
Course-specific references

AI-generated indicators:

Strict pattern adherence
Formal tone throughout
Predictable section order
Generic examples

Query:

bash
# Find module-based educational content
grep -l "module intro" **/*.md

Signal 4: File Creation vs Modification Time Ratio

Git-based analysis:

bash
# Files with many commits = human-evolved
git log --follow --oneline -- "file.md" | wc -l

Interpretation:

1-2 commits → Created and done (likely AI)
5+ commits → Iteratively refined (likely human)
10+ commits → Heavily curated (definitely human)

Signal 5: `schema_validated` Date vs `created` Date

Pattern:

Old files (created before schema standardization) → Human-authored
New files (created after AGENTS.md + templates) → AI-generated

Query:

bash
# Files created before schema was formalized
grep "created: 2025-11-0[1-7]" **/*.md  # Before Nov 8, 2025

Hybrid Authorship Framework

Three Categories

1. Pure AI (Template-Driven)

yaml
created: 2025-11-08T15:00:00-0800
edited_seconds: 0
template_type: feature-guide
schema_validated: 2025-11-08

Created through terminal
Follows strict pattern
Zero or minimal editing time
Use case: Reference docs, feature guides, programmatic content

2. AI-Assisted (Hybrid)

yaml
created: 2025-11-08T14:00:00-0800
edited_seconds: 250
template_type: workflow-guide
schema_validated: 2025-11-08

AI generated initial draft
Human edited substantially (250 seconds)
Follows template but has customizations
Use case: Workflow documentation, conceptual explainers

3. Human-Authored (Organic)

yaml
created: 2025-10-15T10:30:00-0800
edited_seconds: 1840
# No template_type field
schema_validated: 2025-11-08  # Added retroactively

Created before template system
Or: Created manually in Obsidian
Substantial editing time
May not follow templates
Use case: Educational modules, course materials, creative content

Schema Flexibility Strategies

Problem: Human Educational Content Doesn't Fit Templates

Scenario: You have -module intro- articles for courses with unique structures:

Module 1: Basic Concepts
Module 2: Intermediate Techniques
Module 3: Advanced Applications

Each module has custom sections, pedagogical flow, embedded exercises.

If you enforce strict schema: You break the educational structure

Solution: Tiered Schema Enforcement

Tier 1: Universal Fields (Always Required)

Every file, regardless of authorship:

yaml
created: [timestamp]
updated: [timestamp]
edited_seconds: [number]
slug: [random-string]

Why: Enables basic tracking and programmatic operations

Tier 2: AI-Generated Fields (Required for AI Content)

Files created through templates:

yaml
template_type: [pattern-name]
schema_validated: [date]

Why: Enables pattern-based filtering and validation

Tier 3: Optional/Custom Fields (Content-Specific)

Gear guides:

yaml
manufacturer: [brand]
model: [model-number]
category: [type]

Educational modules:

yaml
course: [course-code]
module_number: [1-10]
prerequisites: [list]

Implementation: Validation Script with Tiers

bash
#!/bin/bash
# validate-tiered-schema.sh

for file in **/*.md; do
  # Tier 1: Universal (always check)
  if ! grep -q "^created:" "$file"; then
    echo "ERROR: $file missing required field 'created'"
  fi
  
  # Tier 2: Check if this looks AI-generated
  if grep -q "^template_type:" "$file"; then
    # AI content - enforce full schema
    if ! grep -q "^schema_validated:" "$file"; then
      echo "WARNING: $file has template_type but no schema_validated"
    fi
  else
    # Human content - skip template enforcement
    echo "INFO: $file appears human-authored, skipping template validation"
  fi
  
  # Tier 3: Content-specific validation
  if grep -q "template_type: gear-guide" "$file"; then
    if ! grep -q "^manufacturer:" "$file"; then
      echo "ERROR: Gear guide $file missing manufacturer field"
    fi
  fi
done

Practical Workflow Decisions

Scenario 1: Bulk Refactoring

Question: Should I update all articles to new schema?

Decision tree:

bash
# Check edited_seconds to gauge human investment
if [ edited_seconds > 500 ]; then
  # High human investment - review manually before changes
  echo "Review this file individually"
else
  # Low editing - safe for bulk operations
  echo "Include in bulk update"
fi

Scenario 2: Template Enforcement

Question: Should this article follow a template pattern?

Decision tree:

Check for module intro or course markers → Skip template enforcement
Check edited_seconds > 1000 → Likely custom structure, skip
Check creation date < schema formalization → Legacy content, skip
Otherwise → Encourage template adoption

Scenario 3: Schema Validation

Question: Should validation fail for non-conforming files?

Modes:

Strict mode: Only AI-generated content (has template_type)
Permissive mode: Human-authored content (no template_type or high edited_seconds)

bash
# Validate in permissive mode
./validate-schema.sh --permissive

# Only fails on AI content missing required fields
# Warnings only for human content

The `edited_seconds` Problem

Issue: Plugin Counts Idle Time

Scenario: You open file for reference, leave it open for 20 minutes while doing other work. Plugin logs 1200 seconds.

Result: File appears heavily human-edited when it's not.

Solutions

1. Plugin Configuration (Ideal) Check if plugin supports:

Idle timeout (stop counting after 60 seconds of no input)
Activity detection (only count when typing)
Pause/resume controls

2. Heuristic Filtering

bash
# Find files with suspiciously round editing times (likely idle)
grep "edited_seconds: [0-9]*00$" **/*.md  # Ends in 00
# Example: 300, 600, 1200 (likely sat open idle)

3. Manual Field Add active_editing_time for known accurate values:

yaml
edited_seconds: 1240  # Plugin value (includes idle)
active_editing_time: 450  # Manual estimate

4. Accept Imperfection Recognize edited_seconds is a rough signal, not precise measurement. Use in combination with other signals.

Monetization Implications

Content Type → Monetization Strategy

Gear Guides (AI + Human Review)

yaml
template_type: gear-guide
manufacturer: Shure
model: SM7B
affiliate_eligible: true

Standardized structure
Easy to add affiliate links programmatically
Bulk update all gear guides with affiliate template

Query:

bash
# Find all gear guides for affiliate link insertion
grep -l "template_type: gear-guide" **/*.md

Educational Modules (Pure Human)

yaml
course: RAT-120
module_number: 3
monetization: course-paywall

Custom educational content
Behind paywall or course enrollment
Not suitable for affiliate links

Reference Articles (AI-Generated)

yaml
template_type: feature-guide
monetization: adsense

High traffic potential
Ad-supported model
SEO-optimized

Programmatic Affiliate Link Insertion

Find gear guides without affiliate links:

bash
for file in $(grep -l "template_type: gear-guide" **/*.md); do
  if ! grep -q "affiliate_link:" "$file"; then
    echo "Add affiliate link to: $file"
    # Extract manufacturer + model, generate affiliate URL
    manufacturer=$(grep "^manufacturer:" "$file" | cut -d' ' -f2-)
    model=$(grep "^model:" "$file" | cut -d' ' -f2-)
    # Insert affiliate link section
  fi
done

Result: Bulk monetization of AI-standardized content, while preserving educational content integrity

Recommended Schema Updates

Add Optional `authorship_signal` Field

yaml
authorship_signal: ai-generated  # or: ai-assisted, human-authored, hybrid

When to use:

ai-generated: created via terminal, edited_seconds: 0-50
ai-assisted: AI draft + substantial editing (edited_seconds: 100-500)
human-authored: created in Obsidian or edited_seconds: 500+
hybrid: multiple authors/iterations

Why optional: Can be inferred from other signals; explicit field helps with quick filtering

Add `content_flexibility` Flag

yaml
content_flexibility: strict  # or: flexible

When to use:

strict: Enforce template pattern, validate structure
flexible: Educational/creative content, skip template validation

Benefits:

Explicit signal for validation scripts
Self-documenting exceptions
Easy to query: grep "content_flexibility: flexible"

Migration Strategy

For Existing Human-Authored Content

Don't: Force into templates retroactively

Do:

Add universal fields (slug, schema_validated)
Set content_flexibility: flexible
Note high edited_seconds as signal
Skip template_type field

Example:

yaml
---
created: 2025-10-15T10:30:00-0800
updated: 2025-11-08T15:00:00-0800
edited_seconds: 1840
slug: abc123xyz98
content_flexibility: flexible
schema_validated: 2025-11-08
# No template_type - human-authored educational content
---

For New AI-Generated Content

Always include:

yaml
created: [timestamp]
updated: [timestamp]
edited_seconds: 0
slug: [random]
template_type: [pattern]
schema_validated: [date]
authorship_signal: ai-generated
content_flexibility: strict

Validation Rules

Universal Validation (All Files)

✅ created field present ✅ slug field present ✅ schema_validated present (even if retroactive)

AI Content Validation (has `template_type`)

✅ template_type value is valid pattern name ✅ Structure matches declared template ✅ All template-specific fields present

Human Content Validation (no `template_type` OR `content_flexibility: flexible`)

⚠️ Warnings only, no errors ℹ️ Info messages about missing fields ✅ Still check universal fields

Pattern-Driven Knowledge Bases with AI Assistance - Overall methodology
Programmatic Schema Validation and Refactoring - Validation scripts
Interrupting AI Conversations - When and Why It Works - Workflow integration

Tracking Human vs AI Authorship in Knowledge Bases

Tracking Human vs AI Authorship in Knowledge Bases

The Attribution Problem

The Naive Approach (Doesn't Work)

Idea: Add author: human or author: ai field

Better: Infer Authorship from Behavioral Signals

Signal 1: edited_seconds Field

Signal 2: template_type Field (or Lack Thereof)

Signal 3: Content Markers

Signal 4: File Creation vs Modification Time Ratio

Signal 5: schema_validated Date vs created Date

Hybrid Authorship Framework

Three Categories

Schema Flexibility Strategies

Problem: Human Educational Content Doesn't Fit Templates

Tier 1: Universal Fields (Always Required)

Tier 2: AI-Generated Fields (Required for AI Content)

Tier 3: Optional/Custom Fields (Content-Specific)

Implementation: Validation Script with Tiers

Practical Workflow Decisions

Scenario 1: Bulk Refactoring

Scenario 2: Template Enforcement

Scenario 3: Schema Validation

The edited_seconds Problem

Issue: Plugin Counts Idle Time

Solutions

Monetization Implications

Content Type → Monetization Strategy

Programmatic Affiliate Link Insertion

Recommended Schema Updates

Add Optional authorship_signal Field

Add content_flexibility Flag

Migration Strategy

For Existing Human-Authored Content

For New AI-Generated Content

Validation Rules

Universal Validation (All Files)

AI Content Validation (has template_type)

Human Content Validation (no template_type OR content_flexibility: flexible)

Related

Idea: Add `author: human` or `author: ai` field

Signal 1: `edited_seconds` Field

Signal 2: `template_type` Field (or Lack Thereof)

Signal 5: `schema_validated` Date vs `created` Date

The `edited_seconds` Problem

Add Optional `authorship_signal` Field

Add `content_flexibility` Flag

AI Content Validation (has `template_type`)

Human Content Validation (no `template_type` OR `content_flexibility: flexible`)