AI Evaluation

Evaluation Methodology

A systematic evaluation of named entity extraction quality on U.S. government domain text, comparing a Claude LLM extractor against a spaCy statistical NER baseline.

Overview

This evaluation measures the quality of named entity extraction on U.S. government domain text. Two extractors are compared: a spaCy statistical NER pipeline (baseline) and a Claude LLM extractor (challenger) — both scored against a curated gold dataset using precision, recall, and F1 metrics.

The evaluation uses 113 articles across three government branches (legislative, executive, judicial) plus a CoNLL-2003 general-domain validation set. A fuzzy matching strategy with 6 priority levels prevents inflating error rates due to reasonable boundary differences between extractors and gold annotations.

0.60

Claude F1 (Gov)

vs

0.31

spaCy F1 (Gov)

Claude achieves ~2× spaCy's F1 on government domain text, while spaCy leads on general-domain (CoNLL F1: 0.905 vs 0.867)

Entity Taxonomy

Both extractors produce entities in a shared 7-type taxonomy designed for government domain text.

person

Named individuals

Elizabeth WarrenJohn Roberts

government_org

Government bodies, agencies, courts

SenateEPASupreme Court

organization

Non-government organizations

Georgetown UniversityAP

location

Geographic entities

WashingtonPennsylvania

event

Named events

Civil Warinauguration

concept

Political groups, ideologies

RepublicanDemocratic

legislation

Laws, bills, executive orders

Affordable Care Act

Gold Dataset Construction

The evaluation gold dataset was built through a 4-stage pipeline: structured facts → synthetic articles → automated entity derivation → human curation.

KB Facts

Subject/predicate/object tuples from knowledge base

~3,400 facts

Article Generation

LLM generates realistic news articles from facts

113 articles

Automated Derivation

Script maps predicates → entity annotations with offsets

601 entities

Human Curation

Articles manually reviewed, corrected, enriched

64 reviewed
Auto-enrichment: 190 entities automatically added across 88 articles (dateline locations, government organizations, multi-word cities).
BranchArticlesCuratedEntities
Legislative5314308
Executive2015125
Judicial151081
CoNLL-2003252587
Total11364601

Evaluation Metrics

Standard information retrieval metrics for named entity recognition. Each extracted entity is classified as a true positive, false positive, or false negative relative to the gold annotations.

Precision =
TPTP + FP

Of entities extracted, what fraction are correct?

Recall =
TPTP + FN

Of gold entities, what fraction were found?

F1 =
2 × P × RP + R

Harmonic mean balancing precision and recall.

TP = True Positive (correctly extracted)FP = False Positive (hallucinated)FN = False Negative (missed)

Fuzzy Matching Strategy

Strict exact-match evaluation penalizes extractors for reasonable boundary differences. Our scorer uses a 6-priority fuzzy matching system that awards full or partial credit for semantically correct extractions with minor text or type variations.

Why Fuzzy Matching?

ExtractedGoldStrictFuzzy
Banking CommitteeSenate Banking CommitteeFP + FNTP (substring)
John FettermannJohn FettermanFP + FNTP (Levenshtein ≥ 0.8)
EPA (organization)EPA (government_org)FP + FN0.5 TP (type mismatch)

6-Priority Matching System

PriorityRuleCreditExample
1Exact text + type match1.0 TPSenateSenate (both government_org)
2Exact text + type mismatch0.5 TPEPA (organization)EPA (government_org) (type differs)
3Substring containment + type match1.0 TPBanking CommitteeSenate Banking Committee (substring)
4Substring containment + type mismatch0.5 TPBanking Committee (org)Senate Banking Committee (gov_org) (substring + type differs)
5Levenshtein similarity ≥ 0.8 + type match1.0 TPJohn FettermannJohn Fetterman (fuzzy match)
6Levenshtein similarity ≥ 0.8 + type mismatch0.5 TPJohn Fettermann (person)John Fetterman (concept) (fuzzy + type differs)

Results

Both extractors receive identical input (same article text) and are scored by the same scorer against the same gold annotations. This controlled comparison isolates extraction quality.

Aggregate Comparison

DatasetExtractorPrecisionRecallF1
Legislative (53)spaCy0.1510.9630.261
Claude0.4260.9770.593
Executive (20)spaCy0.2200.9830.359
Claude0.4321.0000.603
Judicial (15)spaCy0.1920.9250.318
Claude0.4560.9380.614
CoNLL-2003 (25)spaCy0.9600.8560.905
Claude0.7890.9630.867

Per-Entity-Type Performance

Legislative branch — largest dataset (53 articles)

Entity TypespaCyClaude
PRF1PRF1
person0.201.000.330.561.000.72
government org0.181.000.310.481.000.65
location0.130.880.230.420.920.58
concept0.141.000.250.331.000.49
organization0.031.000.060.031.000.06
event0.000.000.000.251.000.40

Performance Patterns

spaCy's weakness is precision, not recall

The small en_core_web_sm model finds most entities (recall > 0.90) but generates massive false positives — 1,620 FPs on 53 legislative articles.

Claude's advantage is disciplined extraction

Similar recall but 4× fewer false positives (399 vs 1,620 on legislative). This drives the F1 improvement from 0.261 to 0.593 — a 2.3× gain.

spaCy excels on CoNLL-2003

On general newswire (the domain spaCy was trained on), it achieves 0.905 F1 with near-perfect precision (0.960). The government article weakness is a domain gap issue.

Cost / Quality Tradeoff

MetricspaCyClaude (Sonnet)
Cost per article$0.00~$0.004
Cost for 50 articles$0.00~$0.20
Avg F1 (government)0.310.6
F1 improvement+94%

Limitations & Future Work

Known Limitations

  1. 1Gold dataset size: 113 articles with 64 curated is sufficient for methodology demonstration but small for production-grade evaluation.
  2. 2Synthetic articles: Generated articles may not fully represent real-world news complexity.
  3. 3Single annotator: Curation was performed by one person; inter-annotator agreement was not measured.
  4. 4Entity type ambiguity: The boundary between "organization" and "government_org" is subjective for some entities.

Future Work

EVAL-3: Cognitive Bias Evaluation

Extend the harness pattern to evaluate bias detection using an ontology-grounded approach.

Larger gold dataset

Expand to 200+ curated articles with multiple annotators for inter-annotator agreement.

Additional extractors

Add GPT-4, Gemini, and larger spaCy models for broader comparison.

Active learning

Use evaluation results to identify articles where annotation would most improve the dataset.

Prompt engineering

Iterate on Claude's extraction prompt to improve precision on "concept" and "organization" types.

Tools & Technologies

Technologies used across the evaluation pipeline — from dataset construction to scoring to visualization.

Evaluation

Promptfoo

NLP

spaCy

AI

Claude API

Language

PythonTypeScript

Validation

Pydantic

Framework

Next.js

Visualization

Recharts

CI/CD

GitHub Actions