Evaluation Methodology

A systematic evaluation of named entity extraction quality on U.S. government domain text, comparing a Claude LLM extractor against a spaCy statistical NER baseline.

Overview

This evaluation measures the quality of named entity extraction on U.S. government domain text. Two extractors are compared: a spaCy statistical NER pipeline (baseline) and a Claude LLM extractor (challenger) — both scored against a curated gold dataset using precision, recall, and F1 metrics.

The evaluation uses 113 articles across three government branches (legislative, executive, judicial) plus a CoNLL-2003 general-domain validation set. A fuzzy matching strategy with 6 priority levels prevents inflating error rates due to reasonable boundary differences between extractors and gold annotations.

0.60

Claude F1 (Gov)

0.31

spaCy F1 (Gov)

Claude achieves ~2× spaCy's F1 on government domain text, while spaCy leads on general-domain (CoNLL F1: 0.905 vs 0.867)

Entity Taxonomy

Both extractors produce entities in a shared 7-type taxonomy designed for government domain text.

person

Named individuals

Elizabeth WarrenJohn Roberts

government_org

Government bodies, agencies, courts

SenateEPASupreme Court

organization

Non-government organizations

Georgetown UniversityAP

location

Geographic entities

WashingtonPennsylvania

event

Named events

Civil Warinauguration

concept

Political groups, ideologies

RepublicanDemocratic

legislation

Laws, bills, executive orders

Affordable Care Act

Gold Dataset Construction

The evaluation gold dataset was built through a 4-stage pipeline: structured facts → synthetic articles → automated entity derivation → human curation.

KB Facts

Subject/predicate/object tuples from knowledge base

~3,400 facts

Article Generation

LLM generates realistic news articles from facts

113 articles

Automated Derivation

Script maps predicates → entity annotations with offsets

601 entities

Human Curation

Articles manually reviewed, corrected, enriched

64 reviewed

Auto-enrichment: 190 entities automatically added across 88 articles (dateline locations, government organizations, multi-word cities).

Branch	Articles	Curated	Entities
Legislative	53	14	308
Executive	20	15	125
Judicial	15	10	81
CoNLL-2003	25	25	87
Total	113	64	601

Evaluation Metrics

Standard information retrieval metrics for named entity recognition. Each extracted entity is classified as a true positive, false positive, or false negative relative to the gold annotations.

Precision =

TPTP + FP

Of entities extracted, what fraction are correct?

Recall =

TPTP + FN

Of gold entities, what fraction were found?

F1 =

2 × P × RP + R

Harmonic mean balancing precision and recall.

TP = True Positive (correctly extracted)FP = False Positive (hallucinated)FN = False Negative (missed)

Fuzzy Matching Strategy

Strict exact-match evaluation penalizes extractors for reasonable boundary differences. Our scorer uses a 6-priority fuzzy matching system that awards full or partial credit for semantically correct extractions with minor text or type variations.

Why Fuzzy Matching?

Extracted	Gold	Strict	Fuzzy
Banking Committee	Senate Banking Committee	FP + FN	TP (substring)
John Fettermann	John Fetterman	FP + FN	TP (Levenshtein ≥ 0.8)
EPA (organization)	EPA (government_org)	FP + FN	0.5 TP (type mismatch)

6-Priority Matching System

Priority	Rule	Credit	Example
1	Exact text + type match	1.0 TP	Senate → Senate (both government_org)
2	Exact text + type mismatch	0.5 TP	EPA (organization) → EPA (government_org) (type differs)
3	Substring containment + type match	1.0 TP	Banking Committee → Senate Banking Committee (substring)
4	Substring containment + type mismatch	0.5 TP	Banking Committee (org) → Senate Banking Committee (gov_org) (substring + type differs)
5	Levenshtein similarity ≥ 0.8 + type match	1.0 TP	John Fettermann → John Fetterman (fuzzy match)
6	Levenshtein similarity ≥ 0.8 + type mismatch	0.5 TP	John Fettermann (person) → John Fetterman (concept) (fuzzy + type differs)

Results

Both extractors receive identical input (same article text) and are scored by the same scorer against the same gold annotations. This controlled comparison isolates extraction quality.

Aggregate Comparison

Dataset	Extractor	Precision	Recall	F1
Legislative (53)	spaCy	0.151	0.963	0.261
Legislative (53)	Claude	0.426	0.977	0.593
Executive (20)	spaCy	0.220	0.983	0.359
Executive (20)	Claude	0.432	1.000	0.603
Judicial (15)	spaCy	0.192	0.925	0.318
Judicial (15)	Claude	0.456	0.938	0.614
CoNLL-2003 (25)	spaCy	0.960	0.856	0.905
CoNLL-2003 (25)	Claude	0.789	0.963	0.867

Per-Entity-Type Performance

Legislative branch — largest dataset (53 articles)

Entity Type	spaCy			Claude
Entity Type	P	R	F1	P	R	F1
person	0.20	1.00	0.33	0.56	1.00	0.72
government org	0.18	1.00	0.31	0.48	1.00	0.65
location	0.13	0.88	0.23	0.42	0.92	0.58
concept	0.14	1.00	0.25	0.33	1.00	0.49
organization	0.03	1.00	0.06	0.03	1.00	0.06
event	0.00	0.00	0.00	0.25	1.00	0.40

Performance Patterns

spaCy's weakness is precision, not recall

The small en_core_web_sm model finds most entities (recall > 0.90) but generates massive false positives — 1,620 FPs on 53 legislative articles.

Claude's advantage is disciplined extraction

Similar recall but 4× fewer false positives (399 vs 1,620 on legislative). This drives the F1 improvement from 0.261 to 0.593 — a 2.3× gain.

spaCy excels on CoNLL-2003

On general newswire (the domain spaCy was trained on), it achieves 0.905 F1 with near-perfect precision (0.960). The government article weakness is a domain gap issue.

Cost / Quality Tradeoff

Metric	spaCy	Claude (Sonnet)
Cost per article	$0.00	~$0.004
Cost for 50 articles	$0.00	~$0.20
Avg F1 (government)	0.31	0.6
F1 improvement	—	+94%

Limitations & Future Work

Known Limitations

1Gold dataset size: 113 articles with 64 curated is sufficient for methodology demonstration but small for production-grade evaluation.
2Synthetic articles: Generated articles may not fully represent real-world news complexity.
3Single annotator: Curation was performed by one person; inter-annotator agreement was not measured.
4Entity type ambiguity: The boundary between "organization" and "government_org" is subjective for some entities.

Future Work

EVAL-3: Cognitive Bias Evaluation

Extend the harness pattern to evaluate bias detection using an ontology-grounded approach.

Larger gold dataset

Expand to 200+ curated articles with multiple annotators for inter-annotator agreement.

Additional extractors

Add GPT-4, Gemini, and larger spaCy models for broader comparison.

Active learning

Use evaluation results to identify articles where annotation would most improve the dataset.

Prompt engineering

Iterate on Claude's extraction prompt to improve precision on "concept" and "organization" types.

Tools & Technologies

Technologies used across the evaluation pipeline — from dataset construction to scoring to visualization.

Evaluation

Promptfoo

NLP

spaCy

Claude API

Language

PythonTypeScript

Validation

Pydantic

Framework

Next.js

Visualization

Recharts

CI/CD

GitHub Actions