Named Entity Recognition Explained: From Stanford NER to BERT and LLMs
What NER actually does
Named Entity Recognition is two tasks bolted together: find spans of text that refer to a real-world entity (the 'recognition'), and classify each span by type (the 'naming'). Standard types are Person, Organisation, Location, Date, Money, and a Misc bucket. Domain-specific NER adds types like Drug, Gene, Statute, Regulation, or Vessel.
If you are building a knowledge graph from documents, NER is the on-ramp. Every node in the graph that came from a document came from a NER hit; every edge is a relationship between two NER hits. The quality of the rest of the pipeline is bounded by NER's recall.
Three eras: rules → CRFs → neural
The first generation of NER (1995-2005) was rule-based. Hand-written regex and gazetteers — lists of known company names, place names, and so on. It worked acceptably on news copy and broke instantly on new domains. Stanford NER, released in 2005 by Finkel et al., was the canonical example of the second generation: a Conditional Random Field trained on hand-labelled data. It was the dominant tool for a decade because it was fast, deterministic, and good enough on standard benchmarks like CoNLL-2003.
The third generation arrived with BERT (Devlin et al., 2018-2019). Fine-tuning a pre-trained transformer on labelled NER data dropped CoNLL-2003 F1 from the high 80s to the low 90s and — more importantly — cut the labelled-data requirement by an order of magnitude. By 2022 the field had moved to span-classification heads on top of larger language models.
The fourth generation, which we are in now, uses general-purpose LLMs in zero-shot or few-shot mode. Claude or GPT-4 can label entities in a document with no fine-tuning at all. The accuracy is not always better than a fine-tuned BERT — but the cost of bringing up a new domain is dramatically lower.
Where each approach still fails
Domain shift is the perpetual NER killer. A model trained on news will mislabel medical entities; one trained on biomedical text will get tripped up by financial filings. Cross-domain NER is still an open research problem, and most teams solve it by buying or labelling domain data.
Low-resource languages remain a real gap. Whisper-grade ASR exists for 100 languages, but production-quality NER does not. Arabic dialectal NER lags Modern Standard Arabic by 5–10 F1 points; Urdu lags Hindi; many African languages have effectively no off-the-shelf NER at all. LLMs partially close this gap but not as completely as advertised.
Nested entities — 'Bank of America Corporation' contains 'America' and 'Bank of America', all of which might be relevant — still confuse most flat sequence labellers. Span-based architectures (DyGIE++, Spert) handle them; classic CRF-style NER does not.
Code-mixed text ('We're using تطبيق KnodeGraph for...') is where almost everything falls over. Production pipelines either route to a code-mix-specialised model or accept a 10–15% accuracy hit at the language boundary.
The modern stack in 2026
The pragmatic 2026 stack is layered. spaCy's industrial-strength CNN-based NER for fast, cheap pre-filtering on the bulk of the corpus. A fine-tuned transformer (a DeBERTa-v3 or one of the smaller LLaMA variants) for any domain where you have a few thousand labelled examples. An LLM call as the long-tail fallback for ambiguous spans, or as the only step on small corpora where the throughput cost of an LLM per document is acceptable.
Concrete numbers: on CoNLL-2003 English, fine-tuned DeBERTa is roughly 93–94 F1. spaCy's en_core_web_trf is 89–90. A Claude 3.5 zero-shot call is 88–90 with a well-written prompt. The 'right' choice is whatever fits your latency and cost envelope, not whatever has the highest leaderboard number.
From NER to a knowledge graph
NER gives you nodes. To get edges — the actual graph — you need a second step, relation extraction, which is its own substantial topic. Relation extraction takes pairs of NER spans within a context window and classifies the relationship between them: PRESCRIBED, ACQUIRED, SUBSIDIARY_OF, AUTHOR_OF, and so on.
Most modern pipelines run NER and relation extraction together as a joint model — the same transformer encoder produces span representations and pairwise representations, and two heads classify entity types and relation types. The KnodeGraph extraction worker uses Claude as a single-shot joint extractor, which sidesteps the joint-model engineering at the cost of per-document API spend.
Related reading
- PDF to knowledge graph — Where NER actually runs: on extracted text from documents.
- Knowledge graphs for legal research — Legal NER (parties, statutes, jurisdictions) is one of the highest-stakes domain applications.
- KnodeGraph vs Diffbot — Diffbot ships a black-box web-scale NER as a hosted service — useful contrast.
- Knowledge graphs for medical literature — BioBERT and PubMedBERT are the field-specific transformers most teams reach for in this domain.
Frequently Asked Questions
Should I fine-tune a transformer or just call an LLM?
Fine-tune if you process more than a few thousand documents per day and accuracy matters at the 1-2 F1 point level — the per-call cost difference accumulates fast at volume. Use an LLM if you process fewer documents, or if your domain shifts often enough that the labelling cost of fine-tuning would outweigh the inference cost of LLM calls.
Why is NER so much worse on Arabic and Hindi than English?
Two reasons. The labelled-data gap: CoNLL and OntoNotes have hundreds of thousands of tagged English entities; the equivalent Arabic and Hindi corpora are 5-10x smaller and skew toward formal news copy. And tokenisation is harder: Arabic clitics and Hindi compound noun phrases need language-aware tokenisers, and using English-trained tokenisers fragments words in ways the NER head cannot recover from.
What's the difference between NER and entity linking?
NER tells you 'this span is an Organisation'. Entity linking tells you 'this span is the specific Organisation with Wikidata ID Q123, distinct from all other Organisations with the same surface form'. Linking is harder, requires a target knowledge base, and is what turns NER output from a list into a graph. Tools like REL and BLINK are entity linkers; spaCy's NER is not.
Do I need NER if I'm using a Claude or GPT extraction prompt?
Implicitly yes — the LLM is doing NER in its head as part of the extraction. Explicitly, no — you do not need to call a separate NER tool first. The benefit of doing it as a separate step is auditable spans (you can show a user the exact text that became a node); the benefit of skipping it is one fewer model in the pipeline. KnodeGraph's pipeline does it as one Claude call.
What's the smallest labelled dataset I need to fine-tune NER for a new domain?
Practical floor is 500-1000 labelled documents if you're starting from a strong pre-trained transformer. Below that, a well-prompted LLM with a domain-specific entity list usually beats fine-tuning. Above 5000 documents, fine-tuning starts to clearly win on cost-per-call.
Source
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding'. NAACL-HLT 2019. [link]
Ready to Try KnodeGraph?
Start free with 3 graphs and 100 nodes. Upgrade to Pro for AI extraction, unlimited graphs, and 50K nodes.
Get Started Free