How to Extract Entities From Documents (NER in 2026)
Named entity recognition (NER) is the foundation of every document-based knowledge graph. Get NER wrong and the entire downstream graph inherits the noise. This tutorial compares the three production-grade options in 2026 — classical spaCy pipelines, the zero-shot GLiNER family, and prompt-driven LLM extraction — then shows how to combine them for the best precision-recall trade-off.
Step 1: Decide which entity types you actually need
The OntoNotes 5 schema covers about 80% of generic use cases. Skip it if your domain needs something specific. Pharma needs `Drug`, `Disease`, `Gene`, `Trial`. Legal needs `Statute`, `Court`, `Defendant`, `Doctrine`.
Step 2: Run the spaCy baseline
spaCy 3.7 with `en_core_web_trf` is still the fastest way to get a credible baseline. It runs at ~5000 tokens/sec on a CPU, hits ~89 F1 on OntoNotes, and you do not need a GPU.
Step 3: Use GLiNER for zero-shot custom types
GLiNER (Generalist and Lightweight model for NER, Zaratiana et al., NAACL 2024) lets you specify entity types at inference time without retraining. It hits ~80 F1 on out-of-domain types — remarkable for a 300MB model.
Step 4: Use an LLM when you need cross-sentence context
Classical NER and GLiNER both operate sentence-by-sentence. They miss anaphora, implicit entities, and entities mentioned only by role. LLMs handle these natively because they see the full context.
Step 5: Combine models for precision
Production pipelines rarely use a single model. The standard pattern is two-of-three voting: keep an entity if any two of (spaCy, GLiNER, LLM) agree.
Common pitfalls
- Trusting confidence scores blindly. Across spaCy, GLiNER, and LLMs, scores are not calibrated — a 0.95 from one model is not a 0.95 from another.
- Tokenization mismatch. spaCy's offsets are character-based; some downstream tools assume word-based. Always validate offsets when chaining tools.
- Forgetting to lowercase / strip punctuation before merging.
- Over-fitting to OntoNotes. The schema is biased toward news text — your scientific or legal data will look weird through it.
- Skipping a held-out eval set. You cannot tell if a prompt change improved things without 100+ manually labeled sentences from your real data.
Related reading
Frequently Asked Questions
Should I fine-tune a model or use zero-shot?
Use zero-shot (GLiNER or LLM) until you have at least 500 manually labeled examples per entity type. Past 500 examples, a fine-tuned spaCy or BERT model usually beats zero-shot on speed-per-dollar.
How do I evaluate NER quality?
Standard practice is span-level F1 with strict matching: a prediction counts only if both the boundaries and the label match. Use `seqeval` (the same library that ranks the CoNLL leaderboard).
What about non-English documents?
spaCy ships pipelines for 23 languages; GLiNER's `gliner_multi-v2.1` covers 11 languages zero-shot at ~75 F1. For Arabic, AraBERT-NER fine-tuned on ANERcorp hits ~89 F1.
How do I handle nested entities?
Classical NER assumes flat spans. Use a span-based model like SpERT or any LLM with a nested schema. GLiNER 2 supports nested spans natively as of v2.1.
How do I deal with abbreviations?
Resolve them in a post-processing pass. Tools like `scispaCy`'s `AbbreviationDetector` automate this for biomedical text and reach >95% precision on PubMed abstracts.
Source
Honnibal & Montani (2017) introduced spaCy's transition-based parser; the underlying NER architecture combines a transformer encoder with a CRF tagging layer. The Stanford NER paper (Finkel, Grenager, Manning, ACL 2005) defined the F1-on-CoNLL benchmark that every subsequent paper still reports against. [link]
Ready to Try KnodeGraph?
Start free with 3 graphs and 100 nodes. Upgrade to Pro for AI extraction, unlimited graphs, and 50K nodes.
Get Started Free