How to Build a Knowledge Graph From PDFs
PDFs are the world's default container for unstructured knowledge — research papers, contracts, regulatory filings, manuals — and they are also the worst format for downstream analysis. This tutorial walks through a working pipeline that turns a folder of PDFs into a queryable knowledge graph: extract text, recover layout, lift entities and relations, dedupe across documents, and load Cypher triples into a graph database. Every step uses tools you can install today.
Step 1: Pick a PDF parser that respects layout
The single biggest determinant of graph quality is whether your parser understands columns, tables, and reading order. A naive `pdftotext` dump on a two-column journal article scrambles paragraphs and ruins entity extraction. Use a layout-aware parser.
- PyMuPDF (a.k.a. `fitz`, version 1.24+) — fastest pure-Python option, exposes text blocks with bounding boxes.
- pdfplumber 0.11 — slower, but excellent for table extraction.
- Unstructured.io 0.15 — chunks documents into typed elements (Title, NarrativeText, Table) which are perfect upstream of an LLM.
- For scanned PDFs, fall back to Tesseract 5.4 or PaddleOCR 2.8 with a layout model like LayoutLMv3.
Step 2: Detect when OCR is required
Roughly 15-20% of PDFs in the wild are image-only scans with no text layer. Detect this cheaply: if `page.get_text()` returns under 50 characters but the page has images, route the page through OCR.
Modern OCR is no longer the bottleneck — Tesseract 5 with `--psm 6` and an en+fra+ara LangData hits 96%+ accuracy on clean scans. Layout still suffers, which is why the next step matters.
Step 3: Extract entities and relations from each chunk
You have two viable approaches. Option A: a frozen NER model like spaCy `en_core_web_trf` 3.7 plus a relation classifier. Option B: an LLM with a structured-output schema (Claude, GPT-4o, or a fine-tuned Llama). Option B wins on recall and on cross-sentence relations; option A wins on cost and latency for repetitive domains.
Always require an `evidence` field — a verbatim quote from the source — for every relation. This makes hallucinations trivial to spot during review and gives you a click-through link from graph edges back to the PDF.
Step 4: Dedupe entities across documents
Without deduplication you will end up with "World Health Organization", "WHO", and "W.H.O." as three separate nodes. Use a two-pass approach: normalize surface forms, then merge by embedding similarity within entity type.
Step 5: Load triples into a graph database
Pick FalkorDB (Redis-compatible, what KnodeGraph uses internally) or Neo4j 5. Both speak Cypher. Use idempotent MERGE patterns so you can rerun extraction without creating duplicates.
Once loaded, you can answer real questions in tens of milliseconds against millions of edges using multi-hop Cypher traversals.
Common pitfalls
- Skipping the staging step. Auto-committing extractions guarantees your graph fills with hallucinated relations within the first 50 documents.
- Forgetting to store provenance. Without `source_doc` and `evidence` on every edge you can never explain why a relationship is in the graph.
- Treating page numbers as authoritative. Reorder PDFs (e.g., legal exhibits) where the printed page number disagrees with the file index.
- Using a single huge chunk for extraction. LLMs lose recall above ~3000 tokens of dense prose — chunk by section and overlap by 200 tokens.
- Indexing on raw entity name. Always index on a normalized `id` (slug + type) to keep MERGE idempotent.
Related reading
Frequently Asked Questions
How long does it take to process 100 PDFs?
On a single worker, plan for 30-90 seconds per PDF including layout parsing, chunking, LLM extraction, and dedup — so 50-150 minutes for 100 documents. Parallelizing across four Celery workers brings that under 30 minutes.
Do I need a GPU?
No. The heavy NLP runs in a hosted LLM (Claude or GPT-4o), not locally. spaCy transformer models do benefit from a GPU but a CPU is fine if you batch sentences.
How do I keep the graph in sync when documents are updated?
Hash each PDF (sha256 of bytes) and store the hash on every node and edge created from it. When a new version arrives, run an `OPTIONAL MATCH ... DETACH DELETE` on the old hash before re-extracting.
Can I do this without writing code?
Yes — KnodeGraph wraps the entire pipeline in a UI. Upload a PDF, it runs PyMuPDF + Claude extraction + dedup + FalkorDB load, and surfaces the result in a Cytoscape.js canvas.
What about tables and figures?
Tables convert reasonably well with pdfplumber's `extract_tables()` plus a row-to-fact prompt to the LLM. Figures and diagrams are still an open problem — vision-capable models like Claude 4.7 with image input now read most flowcharts correctly.
Source
DocLayNet (Pfitzmann et al., IBM Research, 2022) released 80,863 manually annotated pages across six layout classes, demonstrating that layout-aware models outperform pure text extraction by 23+ F1 points on table and figure recovery — which is why a layout-aware parser is non-negotiable for graph quality. [link]
Ready to Try KnodeGraph?
Start free with 3 graphs and 100 nodes. Upgrade to Pro for AI extraction, unlimited graphs, and 50K nodes.
Get Started Free