Knowledge Graph Schema Design: A Practical Guide
Graph databases let you skip schema design at first. That gets you to a prototype in an afternoon and a 50-million-edge unqueryable graph in a year. This tutorial gives you the schema discipline that prevents that, with concrete patterns for Neo4j 5 and FalkorDB.
Step 1: Inventory your real-world entities
Before you write Cypher, list the kinds of things you want to track. Aim for 5-12 node types. More than 15 and you are probably modelling adjectives as nouns.
Step 2: Name relationships in active voice
The Neo4j convention is UPPER_SNAKE_CASE, active voice, source-to-target. (Author)-[:AUTHORED]->(Paper) reads correctly; (Author)-[:WRITTEN_BY]->(Paper) is backwards.
Step 3: Decide what is a property versus what is a node
If you ever want to query 'find all X with property Y', and Y has fewer than ~1000 distinct values, Y should probably be its own node.
Step 4: Add unique constraints and indexes early
Constraints and indexes are the difference between a query that returns in 2ms and one that returns in 4 seconds. Add them on day one.
Step 5: Plan for schema evolution
Your schema will change. Store a `schema_version` property on every node, prefer additive changes, and keep migrations in version control.
Step 6: Document the schema in code
Treat your schema like an API. Write down node types, relationship types, properties, allowed values in a single file your application reads at startup.
Common pitfalls
- Modelling adjectives as labels. `:ActiveCustomer` should be a `:Customer` with `status: 'active'`.
- Generic `:RELATED_TO` edges. They make the graph one big mush.
- Storing arrays of strings as properties when they should be edges.
- Putting weight or count on nodes when it belongs on edges.
- Skipping uniqueness constraints. The day a duplicate slips in is the day every query returning that node returns it twice.
Related reading
Frequently Asked Questions
Should I use RDF or a property graph?
For 95% of applications, property graphs win on developer ergonomics and query performance. RDF wins only when you need W3C-standard interop with linked-data publishers.
How many node types is too many?
Past 30 distinct labels, query writing becomes painful and the planner starts making bad choices. Look for sub-types you can collapse into a property.
Can I have multiple labels on one node?
Yes. Use it for genuine multiple inheritance. Do not use it as a substitute for properties.
How do I model time-varying relationships?
Add `valid_from` and `valid_to` properties on the edge. For frequently-changing relationships, promote to a node.
Should I denormalize for performance?
Generally no. The exception is precomputing degree counts or PageRank scores onto nodes for ranking queries.
Source
Hogan et al. (2021), 'Knowledge Graphs', ACM Computing Surveys vol 54, surveys 30+ production knowledge graphs and explicitly recommends a small typed vocabulary plus property graphs over RDF for industrial scale. The Neo4j Graph Data Modeling guide reinforces the same conclusions. [link]
Ready to Try KnodeGraph?
Start free with 3 graphs and 100 nodes. Upgrade to Pro for AI extraction, unlimited graphs, and 50K nodes.
Get Started Free