§A Tree with Tables
There are plenty of reasons why you’d need to model a tree of data in your database.
Maybe you need to allow your users to organise some of their other data into groups, maybe you want to model inherited properties in your data, maybe you are trying to represent folders in your database, or a ton of other reasons.
In this post I’ll explore if there is any benefit in using the
extension for modelling the data you’d normally do with a regular parent/child relationship.
§Modelling a Tree
The usual way to do this is by adding a
parent_id column that references
the same table’s
id column. This way you can still use foreign keys.
Each row is a node in the tree:
With this we can represent a root row (
parent_id IS NULL)
or a child row (
parent_id IS NOT NULL)
Let’s create a tiny amount of test data:
Now we have the following tree:
─┬─1───3─┬─5 │ └─6───9 └─2───4─┬─7 └─8
§Querying a Tree
In Postgres we can query this structure with a recursive CTEs, allowing us to query the entire tree with a single query.
To get a row with id
4 and all of it’s descendants:
What if we need to find both descendants and ancestors?
1WITH RECURSIVE 2 mydescendants AS ( 3 SELECT * FROM mytree WHERE parent_id = 4 -- note that I changed this to not include id=4 itself 4 UNION ALL 5 SELECT mytree.* FROM mytree, mydescendants WHERE mytree.parent_id = mydescendants.id 6 ), 7 myancestors AS ( 8 SELECT * FROM mytree WHERE id = 4 9 UNION ALL 10 SELECT mytree.* FROM mytree, myancestors WHERE mytree.id = myancestors.parent_id 11 ) 12SELECT * FROM myancestors 13UNION ALL 14SELECT * FROM mydescendants; 15-- "id","parent_id" 16-- 4,2 17-- 2,NULL 18-- 7,4 19-- 8,4
We now both have the ancestor
2, the row itself
4, and the decendants
§Using the ltree Extension
Another way to model trees in postgres is to use the
ltree datatype is specifically designed for
representing labels of data stored in a hierarchical tree-like structure
“Labels” in this context is
a sequence of alphanumeric characters and underscores (for example, in C locale the characters A-Za-z0-9_ are allowed)
ltree data type provides us with a way to represent and search
for data stored in a tree-like structure, so maybe we can use this to model our
§Modelling a Tree (Again)
Let’s create some test data. We’ll use the same structure as we did previously.
We’ll use the
id column from before as our labels, but note that we could
use anything that is a valid label here.
Let’s take a look at our new table:
§Querying a Tree (Again)
Let’s query for the same things as earlier.
First, given the node 4 (which is now represented by the path
do we get that row and all of it’s descendants?
This returned the same rows as earlier.
What if we need to find both descendants and ancestors of the label
@> operator checks if the left argument is an ancestor of the right argument.
<@ operator checks if the left argument is a descendant of the right argument.
So the query above says “SELECT all columns FROM myltree WHERE path is a decendant of
2.4 OR path is an ancestor of
These queries are much simpler than before.
§What about Performance?
To detect any kind of performance difference, we’ll need to generate a much larger dataset to test on.
I’ll generate a dataset with 100 root rows, and add 5 child rows for all rows 7 levels deep.
I’m using these two queries to generate our test datasets:
1-- myltree dataset 2BEGIN; 3TRUNCATE TABLE myltree; 4CREATE TEMPORARY SEQUENCE ltreeid; 5INSERT INTO myltree (path) 6WITH RECURSIVE t(path, lvl) AS 7( 8 SELECT text2ltree(nextval('ltreeid')::text), 1 FROM generate_series(1,100)n 9 UNION ALL 10 SELECT ltree_addtext(path, nextval('ltreeid')::text), lvl+1 FROM t, generate_series(1,5)n 11 WHERE lvl < 7 12) 13SELECT path FROM t; 14DROP SEQUENCE ltreeid; 15COMMIT;
which inserted 1,953,100 rows in 32.4 seconds.
1-- mytree dataset 2BEGIN; 3TRUNCATE TABLE mytree; 4CREATE TEMPORARY SEQUENCE treeid; 5INSERT INTO mytree (id, parent_id) 6WITH RECURSIVE t(id, parent_id, lvl) AS 7( 8 SELECT nextval('treeid')::bigint, null::bigint, 1 FROM generate_series(1,100)n 9 UNION ALL 10 SELECT nextval('treeid')::bigint, t.id, lvl+1 FROM t, generate_series(1,5)n 11 WHERE lvl < 7 12) 13SELECT id, parent_id FROM t; 14DROP SEQUENCE treeid; 15COMMIT;
which again inserted 1,953,100 rows, this time in 77.2 seconds.
§Finding a specific row
(NOTE: I’m running these with
max_parallel_workers_per_gather = 0)
Let’s take a look at the last row we inserted, row 1953100:
We can see that both queries return a row where the parent is
so it looks like our datasets are similar.
What the two queries don’t agree on is the time it takes to find a row, 570 ms vs 2 ms.
If you want to take a sidetrack with me and figure out why this is, I’ll dig into indexing over an ltree in this post.
The summary is that when we do
WHERE path ~ '*.1953100', that condition
can’t use our primary key btree index to find the row, and we end up
having to scan the entire table.
Moving forward, I’ll be using a GIST index over
Of course, the ID of our ltree table isn’t just
this isn’t a fair comparison. The ID of the
myltree table is the full path,
so let’s try and use that:
So if we have the full ID, the performance of looking up a row is similar.
ltree allows us to write much simpler queries, but there are also
some additional features from using ltree.
§Example: Count LCA descendants
Let’s say we know the id of two rows:
And we are interested in finding the number of descendants contained by the longest common ancestor of those two rows.
ltree, that’s really easy:
§Example: Find all siblings:
1SELECT * FROM myltree 2WHERE subpath('100.600.3100.15600.65600.190600', 0, -1) @> path 3AND nlevel('100.600.3100.15600.65600.190600') = nlevel(path); 4-- "path" 5-- "100.600.3100.15600.65600.128100" 6-- "100.600.3100.15600.65600.190600" 7-- "100.600.3100.15600.65600.253100" 8-- "100.600.3100.15600.65600.315600" 9-- "100.600.3100.15600.65600.378100"
§Adding a row?
Adding a row to our
mytree table is pretty easy.
We need to know the parent we want to create a child for.
1INSERT INTO mytree (parent_id) VALUES (3);
Inserting into our
myltree table is mostly the same.
However we don’t automatically get a new ID to use for this node. We could make up a value, or we could create a sequence to use each time we create a new row.
Using a sequence is most like the parent/child relationship in
let’s assume we have a sequence available:
1CREATE SEQUENCE myltreeseq;
Inserting a new row then becomes
There is nothing to protect us against inserting a row where the parent no longer exists. If we wanted to ensure that, we’d need to write a trigger that blocks the insert if no parent exists in the table.
§Moving a row?
mytree table, moving a row so that it is parented to another row is trivial:
1UPDATE mytree SET parent_id = 3 WHERE id = 4;
The story is a bit different with an ltree.
Since all of our rows are independant of each other, if we need to move a row, we also need to move all of that rows descendants. Otherwise, their paths won’t change.
This is actually a feature of
ltree, but in our specific scenario, we are not
interested in having a subpath in our path that isn’t also a valid row on it’s own, so we
need to take some care to move everything correctly:
The obvious downside is that we need to update
n rows instead of 1.
For large trees this might get very expensive.
We like having a single integer id for our row.
We also like having a
parent_id with a foreign key to that row, such that
we can cascade deletes and ensure a parent exists on insert.
But what if we also like the features
Could we build a table where we could maintain the relationships between
the rows with our
parent_id column, but also have a
path column to query?
My first idea was a materialized view, but that since we can’t incrementally refresh those in postgres (yet), we’d have to rebuild the entire view every time we wanted to INSERT/UPDATE/DELETE.
My second idea was to write a trigger to maintain the ltree when a row changes. So that’s what I did:
1-- hybrid table: 2CREATE TABLE hybrid ( 3 id bigint PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY, 4 parent_id bigint REFERENCES hybrid(id) ON DELETE CASCADE, 5 path ltree 6); 7-- index on the path ltree 8CREATE INDEX path_gist_idx ON hybrid USING GIST (path gist_ltree_ops(siglen=124)); 9 10-- trigger to update path if parent_id changes 11CREATE FUNCTION trigger_set_path_from_parent() RETURNS TRIGGER AS $$ 12DECLARE 13 parentpath ltree; 14 currentnode ltree; 15BEGIN 16 -- On INSERT, nobody has seen this new row before, so we can safely just insert it 17 -- and construct the path from it's parent 18 IF TG_OP = 'INSERT' THEN 19 IF NEW.parent_id IS NULL THEN 20 NEW.path = text2ltree(NEW.id::text); 21 ELSE 22 SELECT ltree_addtext(path, NEW.id::text) INTO NEW.path FROM hybrid WHERE id = NEW.parent_id; 23 END IF; 24 -- If UPDATE, when a row's parent changes, we also need to also patch the path of all of the children. 25 ELSIF TG_OP = 'UPDATE' AND NEW.parent_id IS DISTINCT FROM OLD.parent_id THEN 26 -- we want to move the paths such that all descedants of NEW.id gets their path patched. 27 currentnode = text2ltree(NEW.id::text); 28 SELECT path into parentpath FROM hybrid WHERE id = NEW.parent_id; 29 IF index(parentpath, currentnode) > -1 THEN 30 RAISE EXCEPTION 'Circular reference. Tried to set parent to % for row %, but % is already in parent path %', 31 NEW.parent_id, NEW.id, NEW.id, parentpath; 32 ELSE 33 NEW.path = parentpath || currentnode; 34 END IF; 35 UPDATE hybrid 36 SET path = NEW.path || subpath(path, nlevel(OLD.path)) 37 WHERE path <@ OLD.path AND id <> OLD.id; 38 END IF; 39 RETURN NEW; 40END; 41$$ LANGUAGE plpgsql; 42 43CREATE TRIGGER set_tree_path 44 BEFORE INSERT OR UPDATE ON hybrid 45 FOR EACH ROW 46 EXECUTE FUNCTION trigger_set_path_from_parent();
Now, let’s see how long it takes to generate the example dataset from earlier.
Note that we now have a trigger that needs to run for each row, and a GIST index
that needs to be maintained for the
1-- hybrid dataset 2BEGIN; 3TRUNCATE TABLE hybrid; 4CREATE TEMPORARY SEQUENCE hybridid; 5INSERT INTO hybrid (id, parent_id) 6WITH RECURSIVE t(id, parent_id, lvl) AS 7( 8 SELECT nextval('hybridid')::bigint, null::bigint, 1 FROM generate_series(1,100)n 9 UNION ALL 10 SELECT nextval('hybridid')::bigint, t.id, lvl+1 FROM t, generate_series(1,5)n 11 WHERE lvl < 7 12) 13SELECT id, parent_id FROM t; 14DROP SEQUENCE hybridid; 15COMMIT; 16-- took: 229.2s
Ok, so it took quite a bit longer to insert those rows. Let’s test out some queries.
1EXPLAIN (VERBOSE, ANALYZE) 2SELECT * FROM hybrid WHERE path <@ '11.311.2811.7811.20311'; 3-- Bitmap Heap Scan on public.hybrid (cost=17.92..750.15 rows=195 width=78) (actual time=0.120..0.486 rows=31 loops=1) 4-- Output: id, parent_id, path 5-- Recheck Cond: (hybrid.path <@ '11.311.2811.7811.20311'::ltree) 6-- Heap Blocks: exact=31 7-- -> Bitmap Index Scan on path_gist_idx (cost=0.00..17.88 rows=195 width=0) (actual time=0.093..0.102 rows=31 loops=1) 8-- Index Cond: (hybrid.path <@ '11.311.2811.7811.20311'::ltree) 9-- Planning Time: 0.243 ms 10-- Execution Time: 0.896 ms
And the recursive CTE way:
1EXPLAIN (VERBOSE, ANALYZE) 2WITH RECURSIVE 3 mydescendants AS ( 4 SELECT * FROM hybrid WHERE id = 20311 5 UNION ALL 6 SELECT hybrid.* FROM hybrid, mydescendants WHERE hybrid.parent_id = mydescendants.id 7 ) 8SELECT * FROM mydescendants; 9-- CTE Scan on mydescendants (cost=533247.96..533257.98 rows=501 width=48) (actual time=0.079..84677.808 rows=31 loops=1) 10-- Output: mydescendants.id, mydescendants.parent_id, mydescendants.path 11-- CTE mydescendants 12-- -> Recursive Union (cost=0.43..533247.96 rows=501 width=78) (actual time=0.061..84677.223 rows=31 loops=1) 13-- -> Index Scan using hybrid_pkey on public.hybrid (cost=0.43..8.45 rows=1 width=78) (actual time=0.042..0.067 rows=1 loops=1) 14-- Output: hybrid.id, hybrid.parent_id, hybrid.path 15-- Index Cond: (hybrid.id = 20311) 16-- -> Hash Join (cost=0.33..53322.95 rows=50 width=78) (actual time=11531.735..28225.489 rows=10 loops=3) 17-- Output: hybrid_1.id, hybrid_1.parent_id, hybrid_1.path 18-- Hash Cond: (hybrid_1.parent_id = mydescendants_1.id) 19-- -> Seq Scan on public.hybrid hybrid_1 (cost=0.00..45998.00 rows=1953100 width=78) (actual time=0.017..14011.599 rows=1953100 loops=3) 20-- Output: hybrid_1.id, hybrid_1.parent_id, hybrid_1.path 21-- -> Hash (cost=0.20..0.20 rows=10 width=8) (actual time=85.103..85.110 rows=10 loops=3) 22-- Output: mydescendants_1.id 23-- Buckets: 1024 Batches: 1 Memory Usage: 9kB 24-- -> WorkTable Scan on mydescendants mydescendants_1 (cost=0.00..0.20 rows=10 width=8) (actual time=84.907..84.994 rows=10 loops=3) 25-- Output: mydescendants_1.id 26-- Planning Time: 0.812 ms 27-- JIT: 28-- Functions: 11 29-- Options: Inlining true, Optimization true, Expressions true, Deforming true 30-- Timing: Generation 3.194 ms, Inlining 66.380 ms, Optimization 110.841 ms, Emission 77.258 ms, Total 257.673 ms 31-- Execution Time: 84805.831 ms
Ouch. 84.8s is a little long to wait for 31 rows.
Looking closer, There is a
Seq Scan in there returning the majority of our rows.
That should have been an
Index Scan. I forgot to create an index on
1CREATE INDEX ON hybrid (parent_id);
And let’s try the above recursive query again
1EXPLAIN (VERBOSE, ANALYZE) 2WITH RECURSIVE 3 mydescendants AS ( 4 SELECT * FROM hybrid WHERE id = 20311 5 UNION ALL 6 SELECT hybrid.* FROM hybrid, mydescendants WHERE hybrid.parent_id = mydescendants.id 7 ) 8SELECT * FROM mydescendants; 9-- CTE Scan on mydescendants (cost=1928.42..1938.44 rows=501 width=48) (actual time=0.085..4.217 rows=31 loops=1) 10-- Output: mydescendants.id, mydescendants.parent_id, mydescendants.path 11-- CTE mydescendants 12-- -> Recursive Union (cost=0.43..1928.42 rows=501 width=78) (actual time=0.064..3.387 rows=31 loops=1) 13-- -> Index Scan using hybrid_pkey on public.hybrid (cost=0.43..8.45 rows=1 width=78) (actual time=0.045..0.064 rows=1 loops=1) 14-- Output: hybrid.id, hybrid.parent_id, hybrid.path 15-- Index Cond: (hybrid.id = 20311) 16-- -> Nested Loop (cost=0.43..191.00 rows=50 width=78) (actual time=0.476..0.858 rows=10 loops=3) 17-- Output: hybrid_1.id, hybrid_1.parent_id, hybrid_1.path 18-- -> WorkTable Scan on mydescendants mydescendants_1 (cost=0.00..0.20 rows=10 width=8) (actual time=0.009..0.109 rows=10 loops=3) 19-- Output: mydescendants_1.id, mydescendants_1.parent_id, mydescendants_1.path 20-- -> Index Scan using hybrid_parent_id_idx on public.hybrid hybrid_1 (cost=0.43..19.03 rows=5 width=78) (actual time=0.015..0.026 rows=1 loops=31) 21-- Output: hybrid_1.id, hybrid_1.parent_id, hybrid_1.path 22-- Index Cond: (hybrid_1.parent_id = mydescendants_1.id) 23-- Planning Time: 0.223 ms 24-- Execution Time: 4.796 ms
4.8ms, that seems more reasonable.
It’s still a few times slower than querying by
path, but probably not detrimentally so.
Depending on your table depth and width, these differences will change as well, so remember to test
with a dataset that reflects your domain.
Let’s try an update. We’ll move a single node with 780 descendants:
1EXPLAIN (VERBOSE, ANALYZE) 2UPDATE hybrid SET parent_id = 101 WHERE id = 1000; 3-- Update on public.hybrid (cost=0.43..8.45 rows=0 width=0) (actual time=54.660..54.685 rows=0 loops=1) 4-- -> Index Scan using hybrid_pkey on public.hybrid (cost=0.43..8.45 rows=1 width=14) (actual time=0.108..0.144 rows=1 loops=1) 5-- Output: '101'::bigint, ctid 6-- Index Cond: (hybrid.id = 1000) 7-- Planning Time: 0.074 ms 8-- Trigger RI_ConstraintTrigger_c_20050 for constraint hybrid_parent_id_fkey: time=0.287 calls=1 9-- Trigger set_tree_path: time=53.347 calls=1 10-- Execution Time: 55.168 ms
Contrast that with the same query on the table without the
path and trigger:
1EXPLAIN (VERBOSE, ANALYZE) 2UPDATE hybrid SET parent_id = 101 WHERE id = 1000; 3-- Update on public.mytree (cost=0.43..8.45 rows=0 width=0) (actual time=0.616..0.642 rows=0 loops=1) 4-- -> Index Scan using mytree_pkey on public.mytree (cost=0.43..8.45 rows=1 width=14) (actual time=0.175..0.195 rows=1 loops=1) 5-- Output: '101'::bigint, ctid 6-- Index Cond: (mytree.id = 1000) 7-- Planning Time: 0.558 ms 8-- Trigger RI_ConstraintTrigger_c_19896 for constraint mytree_parent_id_fkey: time=0.370 calls=1 9-- Execution Time: 1.122 ms
Our hybrid table takes 50 times longer to update in this example.
Finally, let’s take a look at the sizes of our relations:
|primary key index||42 MB|
|parent_id index||22 MB|
|primary key index||229 MB|
|path GIST index (siglen=124)||268 MB|
|primary key index||42 MB|
|parent_id index||22 MB|
|path GIST index (siglen=124)||250 MB|
ltree tables are significantly larger.
This was a very unscientific dive into
ltree, but I still think we can draw some conclusions from it.
ltreeis a really cool extension!
- If we want to exactly mirror the behavior of having a
ltreealone is probably not the solution.
- For a small set of read-heavy workloads, a hybrid approach might be worth investigating,
especially if you have queries where you can take advantage of the
- A pure table with a
parent_idand using recursive CTE queries is probably still the best option for most, the added complexity of a hybrid approach, having to deal with triggers, the possibility of bugs in your triggers, etc, is probably not worth it.
There are other usecases for
ltree where it would be a better fit.
A category tree where you can fit you categories directly into the constraint of what a label can contain, will probably be really neat.
Uses where a thing can be in many places at once is also a thing that ltree enables, like dependency trees, labels (obviously), etc.
Modelling tags/labels with
ltree is something I’d like to explore in the future.