Tutorials 10 min read

How to Extract Entities From Documents (NER in 2026)

Named entity recognition (NER) is the foundation of every document-based knowledge graph. Get NER wrong and the entire downstream graph inherits the noise. This tutorial compares the three production-grade options in 2026 — classical spaCy pipelines, the zero-shot GLiNER family, and prompt-driven LLM extraction — then shows how to combine them for the best precision-recall trade-off.

Step 1: Decide which entity types you actually need

The OntoNotes 5 schema covers about 80% of generic use cases. Skip it if your domain needs something specific. Pharma needs `Drug`, `Disease`, `Gene`, `Trial`. Legal needs `Statute`, `Court`, `Defendant`, `Doctrine`.

Step 2: Run the spaCy baseline

spaCy 3.7 with `en_core_web_trf` is still the fastest way to get a credible baseline. It runs at ~5000 tokens/sec on a CPU, hits ~89 F1 on OntoNotes, and you do not need a GPU.

Step 3: Use GLiNER for zero-shot custom types

GLiNER (Generalist and Lightweight model for NER, Zaratiana et al., NAACL 2024) lets you specify entity types at inference time without retraining. It hits ~80 F1 on out-of-domain types — remarkable for a 300MB model.

Step 4: Use an LLM when you need cross-sentence context

Classical NER and GLiNER both operate sentence-by-sentence. They miss anaphora, implicit entities, and entities mentioned only by role. LLMs handle these natively because they see the full context.

Step 5: Combine models for precision

Production pipelines rarely use a single model. The standard pattern is two-of-three voting: keep an entity if any two of (spaCy, GLiNER, LLM) agree.

Common pitfalls

  • Trusting confidence scores blindly. Across spaCy, GLiNER, and LLMs, scores are not calibrated — a 0.95 from one model is not a 0.95 from another.
  • Tokenization mismatch. spaCy's offsets are character-based; some downstream tools assume word-based. Always validate offsets when chaining tools.
  • Forgetting to lowercase / strip punctuation before merging.
  • Over-fitting to OntoNotes. The schema is biased toward news text — your scientific or legal data will look weird through it.
  • Skipping a held-out eval set. You cannot tell if a prompt change improved things without 100+ manually labeled sentences from your real data.

Related reading

Frequently Asked Questions

Should I fine-tune a model or use zero-shot?

Use zero-shot (GLiNER or LLM) until you have at least 500 manually labeled examples per entity type. Past 500 examples, a fine-tuned spaCy or BERT model usually beats zero-shot on speed-per-dollar.

How do I evaluate NER quality?

Standard practice is span-level F1 with strict matching: a prediction counts only if both the boundaries and the label match. Use `seqeval` (the same library that ranks the CoNLL leaderboard).

What about non-English documents?

spaCy ships pipelines for 23 languages; GLiNER's `gliner_multi-v2.1` covers 11 languages zero-shot at ~75 F1. For Arabic, AraBERT-NER fine-tuned on ANERcorp hits ~89 F1.

How do I handle nested entities?

Classical NER assumes flat spans. Use a span-based model like SpERT or any LLM with a nested schema. GLiNER 2 supports nested spans natively as of v2.1.

How do I deal with abbreviations?

Resolve them in a post-processing pass. Tools like `scispaCy`'s `AbbreviationDetector` automate this for biomedical text and reach >95% precision on PubMed abstracts.

Source

Honnibal & Montani (2017) introduced spaCy's transition-based parser; the underlying NER architecture combines a transformer encoder with a CRF tagging layer. The Stanford NER paper (Finkel, Grenager, Manning, ACL 2005) defined the F1-on-CoNLL benchmark that every subsequent paper still reports against. [link]

Ready to Try KnodeGraph?

Start free with 3 graphs and 100 nodes. Upgrade to Pro for AI extraction, unlimited graphs, and 50K nodes.

Get Started Free