Evaluation Methodology
A systematic evaluation of named entity extraction quality on U.S. government domain text, comparing a Claude LLM extractor against a spaCy statistical NER baseline.
Overview
This evaluation measures the quality of named entity extraction on U.S. government domain text. Two extractors are compared: a spaCy statistical NER pipeline (baseline) and a Claude LLM extractor (challenger) — both scored against a curated gold dataset using precision, recall, and F1 metrics.
The evaluation uses 113 articles across three government branches (legislative, executive, judicial) plus a CoNLL-2003 general-domain validation set. A fuzzy matching strategy with 6 priority levels prevents inflating error rates due to reasonable boundary differences between extractors and gold annotations.
0.60
Claude F1 (Gov)
0.31
spaCy F1 (Gov)
Entity Taxonomy
Both extractors produce entities in a shared 7-type taxonomy designed for government domain text.
person
Named individuals
government_org
Government bodies, agencies, courts
organization
Non-government organizations
location
Geographic entities
event
Named events
concept
Political groups, ideologies
legislation
Laws, bills, executive orders
Gold Dataset Construction
The evaluation gold dataset was built through a 4-stage pipeline: structured facts → synthetic articles → automated entity derivation → human curation.
KB Facts
Subject/predicate/object tuples from knowledge base
~3,400 factsArticle Generation
LLM generates realistic news articles from facts
113 articlesAutomated Derivation
Script maps predicates → entity annotations with offsets
601 entitiesHuman Curation
Articles manually reviewed, corrected, enriched
64 reviewed| Branch | Articles | Curated | Entities |
|---|---|---|---|
| Legislative | 53 | 14 | 308 |
| Executive | 20 | 15 | 125 |
| Judicial | 15 | 10 | 81 |
| CoNLL-2003 | 25 | 25 | 87 |
| Total | 113 | 64 | 601 |
Evaluation Metrics
Standard information retrieval metrics for named entity recognition. Each extracted entity is classified as a true positive, false positive, or false negative relative to the gold annotations.
Of entities extracted, what fraction are correct?
Of gold entities, what fraction were found?
Harmonic mean balancing precision and recall.
Fuzzy Matching Strategy
Strict exact-match evaluation penalizes extractors for reasonable boundary differences. Our scorer uses a 6-priority fuzzy matching system that awards full or partial credit for semantically correct extractions with minor text or type variations.
Why Fuzzy Matching?
| Extracted | Gold | Strict | Fuzzy |
|---|---|---|---|
| Banking Committee | Senate Banking Committee | FP + FN | TP (substring) |
| John Fettermann | John Fetterman | FP + FN | TP (Levenshtein ≥ 0.8) |
| EPA (organization) | EPA (government_org) | FP + FN | 0.5 TP (type mismatch) |
6-Priority Matching System
| Priority | Rule | Credit | Example |
|---|---|---|---|
| 1 | Exact text + type match | 1.0 TP | Senate → Senate (both government_org) |
| 2 | Exact text + type mismatch | 0.5 TP | EPA (organization) → EPA (government_org) (type differs) |
| 3 | Substring containment + type match | 1.0 TP | Banking Committee → Senate Banking Committee (substring) |
| 4 | Substring containment + type mismatch | 0.5 TP | Banking Committee (org) → Senate Banking Committee (gov_org) (substring + type differs) |
| 5 | Levenshtein similarity ≥ 0.8 + type match | 1.0 TP | John Fettermann → John Fetterman (fuzzy match) |
| 6 | Levenshtein similarity ≥ 0.8 + type mismatch | 0.5 TP | John Fettermann (person) → John Fetterman (concept) (fuzzy + type differs) |
Results
Both extractors receive identical input (same article text) and are scored by the same scorer against the same gold annotations. This controlled comparison isolates extraction quality.
Aggregate Comparison
| Dataset | Extractor | Precision | Recall | F1 |
|---|---|---|---|---|
| Legislative (53) | spaCy | 0.151 | 0.963 | 0.261 |
| Claude | 0.426 | 0.977 | 0.593 | |
| Executive (20) | spaCy | 0.220 | 0.983 | 0.359 |
| Claude | 0.432 | 1.000 | 0.603 | |
| Judicial (15) | spaCy | 0.192 | 0.925 | 0.318 |
| Claude | 0.456 | 0.938 | 0.614 | |
| CoNLL-2003 (25) | spaCy | 0.960 | 0.856 | 0.905 |
| Claude | 0.789 | 0.963 | 0.867 |
Per-Entity-Type Performance
Legislative branch — largest dataset (53 articles)
| Entity Type | spaCy | Claude | ||||
|---|---|---|---|---|---|---|
| P | R | F1 | P | R | F1 | |
| person | 0.20 | 1.00 | 0.33 | 0.56 | 1.00 | 0.72 |
| government org | 0.18 | 1.00 | 0.31 | 0.48 | 1.00 | 0.65 |
| location | 0.13 | 0.88 | 0.23 | 0.42 | 0.92 | 0.58 |
| concept | 0.14 | 1.00 | 0.25 | 0.33 | 1.00 | 0.49 |
| organization | 0.03 | 1.00 | 0.06 | 0.03 | 1.00 | 0.06 |
| event | 0.00 | 0.00 | 0.00 | 0.25 | 1.00 | 0.40 |
Performance Patterns
spaCy's weakness is precision, not recall
The small en_core_web_sm model finds most entities (recall > 0.90) but generates massive false positives — 1,620 FPs on 53 legislative articles.
Claude's advantage is disciplined extraction
Similar recall but 4× fewer false positives (399 vs 1,620 on legislative). This drives the F1 improvement from 0.261 to 0.593 — a 2.3× gain.
spaCy excels on CoNLL-2003
On general newswire (the domain spaCy was trained on), it achieves 0.905 F1 with near-perfect precision (0.960). The government article weakness is a domain gap issue.
Cost / Quality Tradeoff
| Metric | spaCy | Claude (Sonnet) |
|---|---|---|
| Cost per article | $0.00 | ~$0.004 |
| Cost for 50 articles | $0.00 | ~$0.20 |
| Avg F1 (government) | 0.31 | 0.6 |
| F1 improvement | — | +94% |
Limitations & Future Work
Known Limitations
- 1Gold dataset size: 113 articles with 64 curated is sufficient for methodology demonstration but small for production-grade evaluation.
- 2Synthetic articles: Generated articles may not fully represent real-world news complexity.
- 3Single annotator: Curation was performed by one person; inter-annotator agreement was not measured.
- 4Entity type ambiguity: The boundary between "organization" and "government_org" is subjective for some entities.
Future Work
EVAL-3: Cognitive Bias Evaluation
Extend the harness pattern to evaluate bias detection using an ontology-grounded approach.
Larger gold dataset
Expand to 200+ curated articles with multiple annotators for inter-annotator agreement.
Additional extractors
Add GPT-4, Gemini, and larger spaCy models for broader comparison.
Active learning
Use evaluation results to identify articles where annotation would most improve the dataset.
Prompt engineering
Iterate on Claude's extraction prompt to improve precision on "concept" and "organization" types.
Tools & Technologies
Technologies used across the evaluation pipeline — from dataset construction to scoring to visualization.
Evaluation
NLP
AI
Language
Validation
Framework
Visualization
CI/CD