Build a Knowledge Graph from Your Confluence Spaces
Confluence is where mid-to-large teams write things down properly — runbooks, RFCs, post-mortems, onboarding docs, project rooms. The catch: by year three of any active workspace, finding what's still relevant inside thousands of pages across dozens of spaces is its own full-time job. KnodeGraph ingests Confluence space exports (HTML or XML), walks the page hierarchy via page IDs, pulls in attachments via the REST API attachments endpoint, and extracts a typed graph of the people, projects, decisions, and dependencies actually inside the prose. The CQL search you've been writing becomes a graph traversal you can do in the browser.
Why connect Confluence to KnodeGraph
- Atlassian disclosed 75,000+ paying Confluence customers and tens of millions of users in their FY24 results — for any team on Jira, Confluence is usually the de-facto knowledge base.
- Confluence Cloud's REST API v2 (`/wiki/api/v2/`) exposes pages, spaces, attachments, comments, and labels with cursor-based pagination — a polling integration is in development.
- Today the supported path is Space Export (Space Settings → Content Tools → Export) which produces an HTML or XML archive containing all pages, attachments, and the space hierarchy.
- Each Confluence page has a stable numeric page ID — KnodeGraph preserves these as node attributes so deep links back to the original page survive across re-ingests.
- Atlassian Confluence Query Language (CQL) is powerful but unfamiliar — KnodeGraph's graph traversal answers most CQL-shaped questions visually, no query syntax to learn.
- Attachments (PDFs, .docx, images) link off the parent page in the export — KnodeGraph can ingest them inline so a runbook PDF and the page that references it share a graph.
- Atlassian Connect apps (the marketplace ecosystem) often add their own page metadata; KnodeGraph reads the standard fields and ignores app-specific extensions that vary by tenant.
How it works end-to-end
1.Export the relevant Confluence spaces
In Confluence Cloud: Space Settings → Content Tools → Export → HTML (recommended) or XML (full fidelity, larger file). For Confluence Server / Data Center the path is Space Tools → Content Tools → Export. You can include or exclude attachments, comments, and the space hierarchy. For multi-space exports run one per space — the export tool is per-space by design.
2.Upload the space archive
Drag the resulting ZIP into KnodeGraph. We walk the directory tree, parse each page's HTML, preserve the page ID and parent-page-ID hierarchy, and queue any attachments referenced by the pages for separate ingest. Page labels, last-modified timestamps, and author metadata are extracted as node attributes.
3.Pick a Confluence-aware template
Use 'RFC & Decision Log' (entities: rfc, decision, alternative, owner, deadline; relations: supersedes, blocks, depends_on, decided_by) for engineering spaces. Use 'Runbook & Operations' (entities: service, runbook, alert, on_call_rotation, incident) for ops spaces. Use 'Project Room' for a generic mixed-content space.
4.Review and curate
Confluence pages tend to be longer than chat or notes — extraction yields more entities per source. The staging UI groups by space and parent-page hierarchy so you can review section by section. Approved entities deduplicate by name across spaces, so a service mentioned in three different runbook spaces becomes one node.
5.Walk and trace
Every node retains a backlink to the original Confluence page ID, which renders as a clickable URL in the graph viewer. Find every page that references a deprecated service, every decision blocked on a person who left, every runbook missing an owner. CQL-shaped questions answered without writing CQL.
6.Refresh per space
Re-export and re-ingest the spaces that change most often (engineering RFCs typically; HR onboarding usually doesn't). KnodeGraph dedupes by page ID so re-imports merge cleanly into the existing graph rather than producing duplicates.
Why KnodeGraph is a good fit
- •Confluence's tree structure is great for browsing; KnodeGraph adds the graph structure for cross-space analysis.
- •Page IDs preserved as node attributes mean every extracted entity is one click from the source — provenance never breaks.
- •Templates encode RFC, runbook, and project-room conventions so extraction matches how engineering spaces actually get written.
- •Attachment ingest pulls PDFs and .docx files into the same project graph as the parent pages — no second tool needed for the document side.
- •100+ language support handles multilingual Confluence instances (common in EMEA and APAC orgs) without splitting tooling.
- •Self-hosted plan keeps Confluence-internal strategy content inside your perimeter — particularly relevant for Confluence Data Center customers with on-prem requirements.
Supported formats
- Confluence Space Export — HTML format (recommended for most use cases)
- Confluence Space Export — XML format (full fidelity, larger files, preserves macros as raw XML)
- Page-level Export PDF (single-page export — ingested via the PDF integration if you only need one page)
- Attachments referenced from pages (.pdf, .docx, .xlsx — auto-ingested if 'include attachments' was selected at export time)
- Confluence labels (parsed as node tags rather than narrative text)
- Page hierarchy via page IDs and parent-page-IDs (preserved as graph structure)
Limitations to know up front
- No live REST API integration today — the workflow is space-export-and-ingest. A polling integration against REST API v2 with the attachments endpoint is on the next-quarter roadmap.
- Confluence macros (`{toc}`, `{children}`, `{include}`, `{jira}`) render at view time — the export captures their rendered HTML output rather than the macro source. Most cases this is fine; Jira-issue-list macros lose their live link.
- Atlassian Connect app extensions (third-party marketplace apps that add their own page widgets) vary by tenant; KnodeGraph reads the standard fields and skips unrecognised app payloads.
- Restricted pages: anything you can't see in Confluence won't be in your space export. The export honours Confluence permissions on the actor who triggered it.
- Comments and inline comments are flattened into the page text and tagged with author — full thread structure is preserved but visual nesting is lost.
- Page versions: only the latest version exports. Edit history isn't part of the standard space export and isn't ingested.
Frequently Asked Questions
Does KnodeGraph work with Confluence Cloud, Server, and Data Center?
Yes to all three. The space-export format is consistent across Cloud, Server, and Data Center, so the ingest path works identically. The only thing that differs is the menu path in the Confluence UI (Space Settings on Cloud, Space Tools on Server/DC). When live REST API integration ships, it'll target Cloud first (REST API v2), with Server/Data Center support following — the on-prem REST API has slightly different auth flows that need their own handling.
How do I handle very large Confluence instances — say 10,000+ pages?
Export per-space rather than trying to ingest everything at once. A 10K-page Confluence usually breaks down into 30-50 spaces by team or project; ingest the 5-10 that hold the strategic content (engineering RFCs, product decisions, ops runbooks) and skip the rest (HR, vacation calendars, social). On Pro tier, a single 1K-page space takes 15-25 minutes end to end. For multi-space ingests, use the API to script the upload.
What about Jira integration — Confluence pages often link to Jira issues?
Jira-issue-list macros and inline Jira links capture as text plus URL in the export; KnodeGraph extracts the Jira issue keys (PROJ-1234) as identifiers and creates 'jira_issue' nodes referenced by the page. They're not live-synced with Jira state today, but they're queryable as a graph (every page that references a deprecated Jira project, every decision blocked on a Jira issue assigned to a former employee). Live Jira-API sync is a separate roadmap item.
Can I keep extraction inside our Atlassian-cloud perimeter?
For SOC 2 and similar compliance regimes, use the self-hosted KnodeGraph plan deployed inside the same VPC where your Confluence Data Center or Atlassian Cloud egress flows. The whole stack runs locally with your own Anthropic API key — extraction never leaves your perimeter. Hosted SaaS is fine for general engineering documentation but not the right place for confidential strategy or M&A spaces.
How does this compare to Atlassian Intelligence?
Atlassian Intelligence (the AI features baked into Confluence Cloud) does in-page summarisation and natural-language search — it's an in-product assistant. KnodeGraph builds a structured cross-space graph of entities and relationships, so it answers questions Atlassian Intelligence can't, like 'show every decision in the past two quarters that depended on a service marked deprecated in our runbook space'. They're complementary; Atlassian Intelligence helps you read pages, KnodeGraph helps you analyse the corpus.
Connect Confluence to KnodeGraph
Start free with 3 graphs and 100 nodes. Upgrade to Pro for AI extraction, unlimited graphs, and 50K nodes.
Get Started Free