AI Evaluation

This section showcases a systematic evaluation of entity extraction quality comparing Claude (LLM) against spaCy (statistical NLP) across U.S. government domain text. Explore the results, browse the gold dataset, and read the full methodology.

Explore

Model Comparison Results

Visual comparison of spaCy vs Claude with P/R/F1 charts, per-branch breakdown, and per-entity-type analysis.

Gold Dataset Explorer

Browse the 113 evaluation articles with ground-truth entity annotations, perturbation labels, and difficulty ratings.

Evaluation Methodology

Entity taxonomy, gold dataset construction, fuzzy matching strategy, and the full evaluation design.