Turn Microsoft Word Documents into a Knowledge Graph

Word is the document format of record for legal, finance, government, and large-enterprise work — the .docx files that get emailed, redlined, and signed are usually where the binding decisions actually live. KnodeGraph parses .docx as the structured OOXML (Office Open XML) format it really is, pulling text, headings, comments, footnotes, embedded images, and track-changes annotations into a typed entity graph. The result is a queryable view across hundreds of contracts, briefs, or reports — auditable back to the source paragraph and styled exactly the way Word stored it.

Why connect Microsoft Word to KnodeGraph

  • Microsoft 365 had 400+ million paid commercial seats in 2024 (Microsoft FY24 earnings) — Word is the dominant document authoring surface in regulated and enterprise sectors.
  • .docx files are ZIP archives containing OOXML — a well-specified XML schema (ISO/IEC 29500) that KnodeGraph parses directly via python-docx and lxml without conversion to PDF first.
  • The OOXML body is split across `word/document.xml` (main text), `word/comments.xml` (review comments), `word/footnotes.xml` and `word/endnotes.xml` (footnotes/endnotes), and `word/styles.xml` (style definitions) — KnodeGraph reads each appropriately.
  • Track-changes (revision marks) live as `<w:ins>` and `<w:del>` elements in the body XML; KnodeGraph captures both the proposed text and the reviewer attribution, so 'who proposed this clause' becomes a graph edge rather than lost markup.
  • Embedded images live in `word/media/` and are referenced by relationship IDs; KnodeGraph preserves the image references on the parent paragraph node and runs OCR if needed.
  • Document section breaks (`<w:sectPr>`) preserve structural divisions — front matter, exhibits, schedules — as graph hierarchy rather than flat text.
  • Style information (Heading 1, 2, 3; Body; Block Quote; Title) drives heading-aware extraction — entity context inherits from the section heading, so 'Termination' clauses don't blur with 'Indemnity' clauses.

How it works end-to-end

1.Drop in .docx files

Drag a folder of Word documents into KnodeGraph. Each .docx is a ZIP we unpack on the server; the OOXML inside is parsed structurally. You can ingest 50 .docx files per Pro batch, each up to 10 MB — that's typical for legal pleadings, M&A drafts, and government reports.

2.Heading-aware extraction

KnodeGraph uses the styles.xml definitions to identify Heading 1/2/3 paragraphs as section anchors. Extraction then runs section by section so entities inherit their parent heading as context. 'Effective Date' under a 'Term' heading becomes a different node from 'Effective Date' under an 'Insurance' heading — exactly the disambiguation legal review needs.

3.Pick a domain template

Use 'Contract & Agreement' (entities: party, defined_term, effective_date, governing_law, obligation, condition; relations: indemnifies, terminates_on, governed_by) for legal. Use 'Brief & Filing' (entities: court, party, statute_cited, holding, fact, claim) for litigation. Use 'Report & Memo' for general business prose. Templates encode each domain's vocabulary so extraction is on-brand.

4.Review with track-changes context

Track-changes annotations are surfaced in the staging UI alongside the underlying entity. So a clause that was inserted by reviewer X on date Y carries that provenance into the graph. Useful for negotiation timelines: 'show me every clause Counterparty pushed back on across the last three drafts'.

5.Footnotes and comments as first-class nodes

Footnotes (often where citations live in legal writing) become 'cited_by' edges from the body paragraph to the cited source. Comments (review-pane annotations) become 'commented_on' edges with the author preserved. Both are queryable independently.

6.Trace back to the source paragraph

Every extracted entity links back to the .docx file plus a paragraph index. Click any node and the staging viewer scrolls to the highlighted source paragraph — no jumping between Word and the graph.

Why KnodeGraph is a good fit

  • Native .docx parsing means no fidelity loss from converting through PDF first — track-changes, comments, and styles all survive.
  • Heading-aware extraction is meaningful for long structured documents where 'Section 4.2(a)' context matters.
  • Comments and track-changes preserved with their author and timestamp — negotiation history becomes a graph rather than buried markup.
  • Templates handle legal, medical, and financial vocabularies that generic NLP tools mangle (defined terms, statute citations, drug names, ticker symbols).
  • Self-hosted plan keeps privileged or commercially sensitive .docx content (M&A drafts, regulatory filings) inside your VPC — particularly important for outside-counsel and i-banking workflows.
  • Multi-language: Word documents in Arabic, Mandarin, German, Spanish, and Cyrillic scripts handle the same way English does — useful for international firms.

Supported formats

  • Modern Word documents (.docx — OOXML, ISO/IEC 29500)
  • Legacy Word documents (.doc) — not natively supported yet; convert to .docx first in Word or via online converters (native LibreOffice-headless ingest is on the roadmap)
  • Word-exported PDF (text-based — ingested via the PDF integration; loses some track-changes/comments fidelity vs native .docx)
  • Word documents with embedded images (image references preserved; OCR runs if --ocr flag is set on ingest)
  • Word documents with footnotes and endnotes (parsed as structural sub-nodes of their parent paragraph)
  • Word documents with comments and track-changes (parsed with author and timestamp metadata preserved)
  • Word documents with section breaks (`sectPr`) — section hierarchy preserved as graph structure

Limitations to know up front

  • .doc legacy format is not natively supported yet — convert to .docx first (Word's File → Save As, or any online .doc-to-.docx converter). Native LibreOffice-headless ingest is on the roadmap, but today the worker container does not ship with LibreOffice.
  • Embedded Excel objects (linked spreadsheets via OLE) capture as image placeholders; export the embedded data as a separate .xlsx and ingest alongside if numeric data matters.
  • Word macros (.docm files with VBA code) are ignored — KnodeGraph reads body text, not executable macro source. If a macro generates document content at open time, you'll need to run the macro first and save a static .docx for ingest.
  • Equations (Office Math, OOXML `m:oMath`) capture as text; complex mathematical notation may render imperfectly. For heavy mathematical content, ingest the rendered PDF instead.
  • Custom XML parts and content controls (used by some legal-document automation systems like HotDocs and Contract Express) read as raw XML — extraction works on the body text but doesn't interpret the automation logic.
  • Password-protected .docx files must be unlocked before ingest.

Frequently Asked Questions

How do you handle track-changes in negotiation drafts?

Track-changes annotations live as `<w:ins>` (insertions) and `<w:del>` (deletions) elements in the OOXML body, each with an author attribute and a timestamp. KnodeGraph captures both the original and the proposed text, plus the reviewer attribution and date. In the graph this becomes 'Reviewer X proposed clause Y on date Z' — queryable across multiple drafts of the same contract. Particularly useful for M&A drafts where you want to see every change Counterparty's lawyers made over a 6-week negotiation.

What about defined terms in contracts?

Defined terms (the capitalised words usually introduced as 'X means …' or 'X shall have the meaning set forth in …') extract as 'defined_term' entities under the Contract template. The definition itself becomes the entity description; subsequent uses of the term link back to the definition node. So you can answer 'show me every clause that uses the defined term Material Adverse Effect' across a corpus of 100 contracts in one query — an exercise that's traditionally a CTRL-F nightmare in Word.

Can it handle very long documents — say a 500-page regulatory filing?

Yes, with care. The 10 MB-per-file limit handles most regulatory filings as native .docx. For very long documents, the heading-aware extraction is what makes it tractable — KnodeGraph processes section by section using the Heading 1/2/3 anchors so the long structure becomes the graph structure rather than a flat blob. A 500-page filing with proper styling typically yields 300-800 entities and processes in 8-12 minutes. For files over 10 MB or with no heading styles, split by section first.

Does it preserve the formatting and styles?

KnodeGraph captures the structural styles (Heading 1, 2, 3; Title; Block Quote) as graph metadata because they carry semantic meaning. Inline formatting (bold, italic, font, colour) is generally not preserved — KnodeGraph is a knowledge graph, not a document viewer. If you need formatted output back, the original .docx is your source of truth; the graph is the queryable index over it. For lightly-styled documents (where authors don't actually use Heading styles consistently), set up a one-time normalisation pass via Word's 'Apply Heading Styles' before ingest.

What about confidential or privileged Word documents?

Privileged legal work, M&A drafts, regulatory filings under embargo — none of those should go through hosted SaaS. The self-hosted plan deploys the entire KnodeGraph stack inside your own VPC or on-prem environment with your own Anthropic API key. The .docx never leaves your perimeter, and Anthropic operates under enterprise data-handling commitments when you bring your own key. Outside-counsel firms and in-house legal teams use this configuration; talk to us about the BAA-equivalent terms if your engagement requires them.

Will KnodeGraph edit my Word documents?

No. KnodeGraph is strictly read-only against your .docx files; it never writes back. The graph is built in KnodeGraph's own database (PostgreSQL + FalkorDB), with provenance back to the source paragraph. If you edit the original .docx in Word, the staged graph still reflects the version you ingested — re-ingest to refresh. There's no risk of KnodeGraph 'mangling' the source documents because it never opens them for write.

Connect Microsoft Word to KnodeGraph

Start free with 3 graphs and 100 nodes. Upgrade to Pro for AI extraction, unlimited graphs, and 50K nodes.

Get Started Free