Knodegraph is a knowledge graph builder that lets you upload documents and automatically extract entities and relationships using AI. You can also build graphs manually with a drag-and-drop editor.

How much does Knodegraph cost?

Knodegraph has a free tier with 3 graphs and 100 nodes per graph. The Pro plan is $14.99/month and includes unlimited graphs, 50,000 nodes, NLP extraction, multilingual support, and API access.

What languages does Knodegraph support?

Knodegraph supports over 100 languages for entity extraction, including English, Arabic (with full RTL support), French, Spanish, German, Chinese, Japanese, and many more.

Is my data secure on Knodegraph?

Yes. Knodegraph uses per-user data isolation, meaning your knowledge graphs are completely separate from other users. All data is stored on secure, self-hosted infrastructure with encrypted connections.

How does AI extraction work?

Upload any document and Knodegraph uses Claude AI to identify entities (people, organizations, locations, concepts) and the relationships between them. Extracted data is staged for your review before being added to your graph - you always have final control.

Can I export my knowledge graphs?

Yes. Free users can export to PNG, SVG, JSON, and CSV. Pro users additionally get access to JSON-LD, RDF, and Neo4j export formats.

Tutorials 11 min read

How to Build a Knowledge Graph From PDFs

Published 2026-04-30

PDFs are the world's default container for unstructured knowledge — research papers, contracts, regulatory filings, manuals — and they are also the worst format for downstream analysis. This tutorial walks through a working pipeline that turns a folder of PDFs into a queryable knowledge graph: extract text, recover layout, lift entities and relations, dedupe across documents, and load Cypher triples into a graph database. Every step uses tools you can install today.

Step 1: Pick a PDF parser that respects layout

The single biggest determinant of graph quality is whether your parser understands columns, tables, and reading order. A naive `pdftotext` dump on a two-column journal article scrambles paragraphs and ruins entity extraction. Use a layout-aware parser.

PyMuPDF (a.k.a. `fitz`, version 1.24+) — fastest pure-Python option, exposes text blocks with bounding boxes.
pdfplumber 0.11 — slower, but excellent for table extraction.
Unstructured.io 0.15 — chunks documents into typed elements (Title, NarrativeText, Table) which are perfect upstream of an LLM.
For scanned PDFs, fall back to Tesseract 5.4 or PaddleOCR 2.8 with a layout model like LayoutLMv3.

Step 2: Detect when OCR is required

Roughly 15-20% of PDFs in the wild are image-only scans with no text layer. Detect this cheaply: if `page.get_text()` returns under 50 characters but the page has images, route the page through OCR.

Modern OCR is no longer the bottleneck — Tesseract 5 with `--psm 6` and an en+fra+ara LangData hits 96%+ accuracy on clean scans. Layout still suffers, which is why the next step matters.

Step 3: Extract entities and relations from each chunk

You have two viable approaches. Option A: a frozen NER model like spaCy `en_core_web_trf` 3.7 plus a relation classifier. Option B: an LLM with a structured-output schema (Claude, GPT-4o, or a fine-tuned Llama). Option B wins on recall and on cross-sentence relations; option A wins on cost and latency for repetitive domains.

Always require an `evidence` field — a verbatim quote from the source — for every relation. This makes hallucinations trivial to spot during review and gives you a click-through link from graph edges back to the PDF.

Step 4: Dedupe entities across documents

Without deduplication you will end up with "World Health Organization", "WHO", and "W.H.O." as three separate nodes. Use a two-pass approach: normalize surface forms, then merge by embedding similarity within entity type.

Step 5: Load triples into a graph database

Pick FalkorDB (Redis-compatible, what KnodeGraph uses internally) or Neo4j 5. Both speak Cypher. Use idempotent MERGE patterns so you can rerun extraction without creating duplicates.

Once loaded, you can answer real questions in tens of milliseconds against millions of edges using multi-hop Cypher traversals.

Common pitfalls

Skipping the staging step. Auto-committing extractions guarantees your graph fills with hallucinated relations within the first 50 documents.
Forgetting to store provenance. Without `source_doc` and `evidence` on every edge you can never explain why a relationship is in the graph.
Treating page numbers as authoritative. Reorder PDFs (e.g., legal exhibits) where the printed page number disagrees with the file index.
Using a single huge chunk for extraction. LLMs lose recall above ~3000 tokens of dense prose — chunk by section and overlap by 200 tokens.
Indexing on raw entity name. Always index on a normalized `id` (slug + type) to keep MERGE idempotent.

Frequently Asked Questions

How long does it take to process 100 PDFs?

On a single worker, plan for 30-90 seconds per PDF including layout parsing, chunking, LLM extraction, and dedup — so 50-150 minutes for 100 documents. Parallelizing across four Celery workers brings that under 30 minutes.

Do I need a GPU?

No. The heavy NLP runs in a hosted LLM (Claude or GPT-4o), not locally. spaCy transformer models do benefit from a GPU but a CPU is fine if you batch sentences.

How do I keep the graph in sync when documents are updated?

Hash each PDF (sha256 of bytes) and store the hash on every node and edge created from it. When a new version arrives, run an `OPTIONAL MATCH ... DETACH DELETE` on the old hash before re-extracting.

Can I do this without writing code?

Yes — KnodeGraph wraps the entire pipeline in a UI. Upload a PDF, it runs PyMuPDF + Claude extraction + dedup + FalkorDB load, and surfaces the result in a Cytoscape.js canvas.

What about tables and figures?

Tables convert reasonably well with pdfplumber's `extract_tables()` plus a row-to-fact prompt to the LLM. Figures and diagrams are still an open problem — vision-capable models like Claude 4.7 with image input now read most flowcharts correctly.

Source

DocLayNet (Pfitzmann et al., IBM Research, 2022) released 80,863 manually annotated pages across six layout classes, demonstrating that layout-aware models outperform pure text extraction by 23+ F1 points on table and figure recovery — which is why a layout-aware parser is non-negotiable for graph quality. [link]

Ready to Try KnodeGraph?

Start free with 3 graphs and 100 nodes. Upgrade to Pro for AI extraction, unlimited graphs, and 50K nodes.

Get Started Free