Turn Any PDF into a Knowledge Graph

PDFs are the lingua franca of serious documents — research papers, court filings, financial reports, government records, technical manuals. They're also the most painful format to extract structured information from. KnodeGraph reads PDFs (text-based or scanned with OCR), pulls out entities and relationships, and gives you an interactive graph instead of yet another folder of files no one rereads.

Why connect PDF to KnodeGraph

  • PDF turned 30 in 2023 — over 2.5 trillion PDF files exist on the public web according to Adobe's 2023 PDF Day estimate. Most enterprise knowledge lives in PDFs.
  • Tesseract OCR (the open-source OCR engine KnodeGraph uses for scanned PDFs) achieves ~94% character accuracy on clean print scans, ~85% on smartphone-photographed pages — review the staging extractions accordingly.
  • Apache PDFBox + pdfminer.six handle text-based PDF extraction natively; KnodeGraph supports both, plus AcroForm field extraction for form-heavy filings.
  • A standard research paper averages ~7K words; a typical legal pleading runs 15–40 pages. Pro tier ingests up to 5 MB and ~150 pages per upload.
  • Bookmarks, headings, and page numbers are preserved as document-structural nodes so you can trace any entity back to the page it came from.
  • Multi-language: KnodeGraph handles Arabic, Mandarin, Cyrillic, and right-to-left scripts without separate pipelines, via Claude's native multilingual support.

How it works end-to-end

1.Upload PDFs

Drag in a folder, paste URLs to public PDFs, or use the API. Up to 50 PDFs per Pro batch, each up to 5 MB.

2.OCR or text-extract

Text-based PDFs (the kind made from Word or LaTeX) extract directly. Image-based scans run through OCR. KnodeGraph picks automatically based on each page's content.

3.Pick a template

Domain templates (Legal, Medical, Research, Business) constrain entity types so extraction matches your domain's vocabulary.

4.Review and curate

Staging UI shows the extracted entities and relationships per source page. Approve, reject, merge — your graph reflects only what you confirm.

5.Walk and export

Filter, search, and visualise in the browser. Export as PNG, SVG, JSON, or CSV when you're ready.

Why KnodeGraph is a good fit

  • Most PDF tools do search or summarisation; KnodeGraph adds structured entity-relationship extraction at the same price point.
  • Templates handle the domain-specific vocabulary (drug names, statute citations, financial instruments) that generic NLP tools mangle.
  • Provenance: every entity links back to the document and page it came from, so the graph is auditable.
  • 100+ language support with no additional setup — drop in a Spanish judgment alongside an English contract and they share the graph.
  • Visual editor lets a non-programmer fix extraction errors directly, instead of writing a Python regex.

Supported formats

  • Text-based PDFs (made from Word, LaTeX, or any digital source)
  • Scanned PDFs (run through Tesseract OCR — accuracy 85–94% depending on scan quality)
  • PDFs with AcroForm fields (form data extracted alongside body text)
  • Multi-page PDFs up to 150 pages per file on Pro
  • PDFs in 100+ languages including Arabic, Hebrew, Mandarin, Russian, and most European scripts

Limitations to know up front

  • Image-only PDFs depend on OCR quality — review staging carefully for scans of faxes, photographed pages, or low-DPI documents.
  • Tables in PDFs are notoriously hard to extract structurally; KnodeGraph captures table cells as text but doesn't preserve the row/column relationships unless you explicitly export the table separately first.
  • Encrypted or password-protected PDFs must be unlocked before upload.
  • Forms with handwritten input require OCR + manual review — handwriting accuracy is much lower than print.

Frequently Asked Questions

What if my PDF is just a scanned image?

KnodeGraph runs Tesseract OCR automatically for image-based PDFs. Accuracy is ~94% on clean print scans, ~85% on smartphone-photographed pages, lower on faxes. The staging review step exists for exactly this reason — you'll catch the OCR quirks before they pollute the graph. For mission-critical material, run a higher-end OCR pass first (ABBYY FineReader, Adobe Acrobat Pro) and upload the cleaned PDF.

Can it handle large PDF batches — say a 200-document litigation set?

Pro tier supports 50 PDFs per batch and queues additional batches. A 200-document set takes 4–6 hours end-to-end including extraction. We have customers running weekly 1,000-document refreshes by scripting against the API; reach out if your volume is heavier than that.

What about tables in PDFs (financial statements, lab results)?

Tables are PDF's hardest data structure. KnodeGraph extracts the cell text but doesn't reliably preserve the table's row/column structure — that requires layout-aware tools (Tabula, Camelot) as a preprocessing step. For financial filings or lab reports, run a table-extraction tool first, then ingest the resulting CSV alongside the narrative PDF.

How do I trace an extracted entity back to its source page?

Every node and edge has a provenance attribute pointing to the document ID and page number. In the staging UI you can click any extracted item and see the highlighted source passage. After approval, the provenance link persists in the graph — invaluable for audit, fact-checking, or legal-grade work.

Is there an API for batch PDF ingestion?

Yes — Pro tier includes API access. POST a PDF (or a list of presigned URLs) to /api/v1/documents/ingest with a target graph and optional template ID. The job runs async; webhook callbacks fire when extraction is staged. Full docs at the /docs endpoint.

Connect PDF to KnodeGraph

Start free with 3 graphs and 100 nodes. Upgrade to Pro for AI extraction, unlimited graphs, and 50K nodes.

Get Started Free