§A Tree with Tables
There are plenty of reasons why you’d need to model a tree of data in your database.
Maybe you need to allow your users to organise some of their other data into groups, maybe you want to model inherited properties in your data, maybe you are trying to represent folders in your database, or a ton of other reasons.
In this post I’ll explore if there is any benefit in using the ltree
extension for modelling the data you’d normally do with a regular parent/child relationship.
§Modelling a Tree
The usual way to do this is by adding a parent_id
column that references
the same table’s id
column. This way you can still use foreign keys.
Each row is a node in the tree:
1CREATE TABLE mytree (
2 id bigint PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
3 parent_id bigint REFERENCES mytree(id) ON DELETE CASCADE
4);
With this we can represent a root row (parent_id IS NULL
)
or a child row (parent_id IS NOT NULL
)
Let’s create a tiny amount of test data:
1TRUNCATE mytree RESTART IDENTITY; -- if you already had some values in there
2INSERT INTO mytree (parent_id) VALUES (NULL), (NULL), (1), (2), (3), (3), (4), (4), (6);
Now we have the following tree:
─┬─1───3─┬─5
│ └─6───9
└─2───4─┬─7
└─8
§Querying a Tree
In Postgres we can query this structure with a recursive CTEs, allowing us to query the entire tree with a single query.
To get a row with id 4
and all of it’s descendants:
1WITH RECURSIVE
2 mydescendants AS (
3 SELECT * FROM mytree WHERE id = 4
4 UNION ALL
5 SELECT mytree.* FROM mytree, mydescendants WHERE mytree.parent_id = mydescendants.id
6 )
7SELECT * FROM mydescendants;
8-- "id","parent_id"
9-- 4,2
10-- 7,4
11-- 8,4
What if we need to find both descendants and ancestors?
No problem:
1WITH RECURSIVE
2 mydescendants AS (
3 SELECT * FROM mytree WHERE parent_id = 4 -- note that I changed this to not include id=4 itself
4 UNION ALL
5 SELECT mytree.* FROM mytree, mydescendants WHERE mytree.parent_id = mydescendants.id
6 ),
7 myancestors AS (
8 SELECT * FROM mytree WHERE id = 4
9 UNION ALL
10 SELECT mytree.* FROM mytree, myancestors WHERE mytree.id = myancestors.parent_id
11 )
12SELECT * FROM myancestors
13UNION ALL
14SELECT * FROM mydescendants;
15-- "id","parent_id"
16-- 4,2
17-- 2,NULL
18-- 7,4
19-- 8,4
We now both have the ancestor 2
, the row itself 4
, and the decendants 7
and 8
.
§Using the ltree Extension
Another way to model trees in postgres is to use the ltree
extension.
The ltree
datatype is specifically designed for
representing labels of data stored in a hierarchical tree-like structure
“Labels” in this context is
a sequence of alphanumeric characters and underscores (for example, in C locale the characters A-Za-z0-9_ are allowed)
The ltree
data type provides us with a way to represent and search
for data stored in a tree-like structure, so maybe we can use this to model our
tree?
Let’s explore.
§Modelling a Tree (Again)
Let’s create some test data. We’ll use the same structure as we did previously.
We’ll use the id
column from before as our labels, but note that we could
use anything that is a valid label here.
1INSERT INTO myltree (path)
2VALUES ('1'), ('2'), ('1.3'), ('2.4'), ('1.3.5'),
3 ('1.3.6'), ('2.4.7'), ('2.4.8'), ('1.3.6.9');
Let’s take a look at our new table:
1SELECT * FROM myltree;
2-- "path"
3-- "1"
4-- "2"
5-- "1.3"
6-- "2.4"
7-- "1.3.5"
8-- "1.3.6"
9-- "2.4.7"
10-- "2.4.8"
11-- "1.3.6.9"
§Querying a Tree (Again)
Let’s query for the same things as earlier.
First, given the node 4 (which is now represented by the path 2.4
), how
do we get that row and all of it’s descendants?
This returned the same rows as earlier.
What if we need to find both descendants and ancestors of the label '4'
?
Also easy:
1SELECT * FROM myltree WHERE path <@ '2.4' OR path @> '2.4';
2-- "path"
3-- "2"
4-- "2.4"
5-- "2.4.7"
6-- "2.4.8"
The @>
operator checks if the left argument is an ancestor of the right argument.
The <@
operator checks if the left argument is a descendant of the right argument.
So the query above says “SELECT all columns FROM myltree WHERE path is a decendant of 2.4
OR path is an ancestor of 2.4
”.
These queries are much simpler than before.
§What about Performance?
To detect any kind of performance difference, we’ll need to generate a much larger dataset to test on.
I’ll generate a dataset with 100 root rows, and add 5 child rows for all rows 7 levels deep.
I’m using these two queries to generate our test datasets:
1-- myltree dataset
2BEGIN;
3TRUNCATE TABLE myltree;
4CREATE TEMPORARY SEQUENCE ltreeid;
5INSERT INTO myltree (path)
6WITH RECURSIVE t(path, lvl) AS
7(
8 SELECT text2ltree(nextval('ltreeid')::text), 1 FROM generate_series(1,100)n
9 UNION ALL
10 SELECT ltree_addtext(path, nextval('ltreeid')::text), lvl+1 FROM t, generate_series(1,5)n
11 WHERE lvl < 7
12)
13SELECT path FROM t;
14DROP SEQUENCE ltreeid;
15COMMIT;
which inserted 1,953,100 rows in 32.4 seconds.
and
1-- mytree dataset
2BEGIN;
3TRUNCATE TABLE mytree;
4CREATE TEMPORARY SEQUENCE treeid;
5INSERT INTO mytree (id, parent_id)
6WITH RECURSIVE t(id, parent_id, lvl) AS
7(
8 SELECT nextval('treeid')::bigint, null::bigint, 1 FROM generate_series(1,100)n
9 UNION ALL
10 SELECT nextval('treeid')::bigint, t.id, lvl+1 FROM t, generate_series(1,5)n
11 WHERE lvl < 7
12)
13SELECT id, parent_id FROM t;
14DROP SEQUENCE treeid;
15COMMIT;
which again inserted 1,953,100 rows, this time in 77.2 seconds.
§Finding a specific row
(NOTE: I’m running these with max_parallel_workers_per_gather = 0
)
Let’s take a look at the last row we inserted, row 1953100:
1SELECT * FROM myltree WHERE path ~ '*.1953100';
2-- "path"
3-- "100.600.3100.15600.78100.390600.1953100"
4-- took: 570ms
We can see that both queries return a row where the parent is 390600
,
so it looks like our datasets are similar.
What the two queries don’t agree on is the time it takes to find a row, 570 ms vs 2 ms.
If you want to take a sidetrack with me and figure out why this is, I’ll dig into indexing over an ltree in this post.
The summary is that when we do WHERE path ~ '*.1953100'
, that condition
can’t use our primary key btree index to find the row, and we end up
having to scan the entire table.
Moving forward, I’ll be using a GIST index over path
with siglen=124
.
Of course, the ID of our ltree table isn’t just 1953100
, so
this isn’t a fair comparison. The ID of the myltree
table is the full path,
so let’s try and use that:
1SELECT * FROM myltree WHERE path = '100.600.3100.15600.78100.390600.1953100';
2-- "path"
3-- "100.600.3100.15600.78100.390600.1953100"
4-- took: 2ms
So if we have the full ID, the performance of looking up a row is similar.
ltree
allows us to write much simpler queries, but there are also
some additional features from using ltree.
§Example: Count LCA descendants
Let’s say we know the id of two rows:
100.600.3100.15600.65600.190600
100.600.3100.15600.78100.390600.1953100
And we are interested in finding the number of descendants contained by the longest common ancestor of those two rows.
With ltree
, that’s really easy:
1SELECT count(*)
2FROM myltree
3WHERE lca('100.600.3100.15600.65600.190600', '100.600.3100.15600.78100.390600.1953100') @> path;
4-- "count"
5-- 156
§Example: Find all siblings:
1SELECT * FROM myltree
2WHERE subpath('100.600.3100.15600.65600.190600', 0, -1) @> path
3AND nlevel('100.600.3100.15600.65600.190600') = nlevel(path);
4-- "path"
5-- "100.600.3100.15600.65600.128100"
6-- "100.600.3100.15600.65600.190600"
7-- "100.600.3100.15600.65600.253100"
8-- "100.600.3100.15600.65600.315600"
9-- "100.600.3100.15600.65600.378100"
§Adding a row?
Adding a row to our mytree
table is pretty easy.
We need to know the parent we want to create a child for.
1INSERT INTO mytree (parent_id) VALUES (3);
Inserting into our myltree
table is mostly the same.
However we don’t automatically get a new ID to use for this node. We could make up a value, or we could create a sequence to use each time we create a new row.
Using a sequence is most like the parent/child relationship in mytree
so
let’s assume we have a sequence available:
1CREATE SEQUENCE myltreeseq;
Inserting a new row then becomes
1-- '3'::ltree being the parent path
2INSERT INTO myltree (path) VALUES ('3'::ltree || text2ltree(nextval('myltreeseq')::text))
There is nothing to protect us against inserting a row where the parent no longer exists. If we wanted to ensure that, we’d need to write a trigger that blocks the insert if no parent exists in the table.
§Moving a row?
In our mytree
table, moving a row so that it is parented to another row is trivial:
1UPDATE mytree SET parent_id = 3 WHERE id = 4;
The story is a bit different with an ltree.
Since all of our rows are independant of each other, if we need to move a row,
we also need to move all of that rows descendants. Otherwise, their paths won’t change.
This is actually a feature of ltree
, but in our specific scenario, we are not
interested in having a subpath in our path that isn’t also a valid row on it’s own, so we
need to take some care to move everything correctly:
1-- assuming we want to move 2.4 so that it becomes 1.3.4:
2UPDATE myltree SET path = '1.3' || subpath(path, nlevel('2.4') - 1) WHERE path <@ '2.4';
The obvious downside is that we need to update n
rows instead of 1.
For large trees this might get very expensive.
§Hybrid approach
We like having a single integer id for our row.
We also like having a parent_id
with a foreign key to that row, such that
we can cascade deletes and ensure a parent exists on insert.
But what if we also like the features ltree
provides?
Could we build a table where we could maintain the relationships between
the rows with our parent_id
column, but also have a path
column to query?
My first idea was a materialized view, but that since we can’t incrementally refresh those in postgres (yet), we’d have to rebuild the entire view every time we wanted to INSERT/UPDATE/DELETE.
My second idea was to write a trigger to maintain the ltree when a row changes. So that’s what I did:
1-- hybrid table:
2CREATE TABLE hybrid (
3 id bigint PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
4 parent_id bigint REFERENCES hybrid(id) ON DELETE CASCADE,
5 path ltree
6);
7-- index on the path ltree
8CREATE INDEX path_gist_idx ON hybrid USING GIST (path gist_ltree_ops(siglen=124));
9
10-- trigger to update path if parent_id changes
11CREATE FUNCTION trigger_set_path_from_parent() RETURNS TRIGGER AS $$
12DECLARE
13 parentpath ltree;
14 currentnode ltree;
15BEGIN
16 -- On INSERT, nobody has seen this new row before, so we can safely just insert it
17 -- and construct the path from it's parent
18 IF TG_OP = 'INSERT' THEN
19 IF NEW.parent_id IS NULL THEN
20 NEW.path = text2ltree(NEW.id::text);
21 ELSE
22 SELECT ltree_addtext(path, NEW.id::text) INTO NEW.path FROM hybrid WHERE id = NEW.parent_id;
23 END IF;
24 -- If UPDATE, when a row's parent changes, we also need to also patch the path of all of the children.
25 ELSIF TG_OP = 'UPDATE' AND NEW.parent_id IS DISTINCT FROM OLD.parent_id THEN
26 -- we want to move the paths such that all descedants of NEW.id gets their path patched.
27 currentnode = text2ltree(NEW.id::text);
28 SELECT path into parentpath FROM hybrid WHERE id = NEW.parent_id;
29 IF index(parentpath, currentnode) > -1 THEN
30 RAISE EXCEPTION 'Circular reference. Tried to set parent to % for row %, but % is already in parent path %',
31 NEW.parent_id, NEW.id, NEW.id, parentpath;
32 ELSE
33 NEW.path = parentpath || currentnode;
34 END IF;
35 UPDATE hybrid
36 SET path = NEW.path || subpath(path, nlevel(OLD.path))
37 WHERE path <@ OLD.path AND id <> OLD.id;
38 END IF;
39 RETURN NEW;
40END;
41$$ LANGUAGE plpgsql;
42
43CREATE TRIGGER set_tree_path
44 BEFORE INSERT OR UPDATE ON hybrid
45 FOR EACH ROW
46 EXECUTE FUNCTION trigger_set_path_from_parent();
Now, let’s see how long it takes to generate the example dataset from earlier.
Note that we now have a trigger that needs to run for each row, and a GIST index
that needs to be maintained for the path
column:
1-- hybrid dataset
2BEGIN;
3TRUNCATE TABLE hybrid;
4CREATE TEMPORARY SEQUENCE hybridid;
5INSERT INTO hybrid (id, parent_id)
6WITH RECURSIVE t(id, parent_id, lvl) AS
7(
8 SELECT nextval('hybridid')::bigint, null::bigint, 1 FROM generate_series(1,100)n
9 UNION ALL
10 SELECT nextval('hybridid')::bigint, t.id, lvl+1 FROM t, generate_series(1,5)n
11 WHERE lvl < 7
12)
13SELECT id, parent_id FROM t;
14DROP SEQUENCE hybridid;
15COMMIT;
16-- took: 229.2s
Ok, so it took quite a bit longer to insert those rows. Let’s test out some queries.
1EXPLAIN (VERBOSE, ANALYZE)
2SELECT * FROM hybrid WHERE path <@ '11.311.2811.7811.20311';
3-- Bitmap Heap Scan on public.hybrid (cost=17.92..750.15 rows=195 width=78) (actual time=0.120..0.486 rows=31 loops=1)
4-- Output: id, parent_id, path
5-- Recheck Cond: (hybrid.path <@ '11.311.2811.7811.20311'::ltree)
6-- Heap Blocks: exact=31
7-- -> Bitmap Index Scan on path_gist_idx (cost=0.00..17.88 rows=195 width=0) (actual time=0.093..0.102 rows=31 loops=1)
8-- Index Cond: (hybrid.path <@ '11.311.2811.7811.20311'::ltree)
9-- Planning Time: 0.243 ms
10-- Execution Time: 0.896 ms
And the recursive CTE way:
1EXPLAIN (VERBOSE, ANALYZE)
2WITH RECURSIVE
3 mydescendants AS (
4 SELECT * FROM hybrid WHERE id = 20311
5 UNION ALL
6 SELECT hybrid.* FROM hybrid, mydescendants WHERE hybrid.parent_id = mydescendants.id
7 )
8SELECT * FROM mydescendants;
9-- CTE Scan on mydescendants (cost=533247.96..533257.98 rows=501 width=48) (actual time=0.079..84677.808 rows=31 loops=1)
10-- Output: mydescendants.id, mydescendants.parent_id, mydescendants.path
11-- CTE mydescendants
12-- -> Recursive Union (cost=0.43..533247.96 rows=501 width=78) (actual time=0.061..84677.223 rows=31 loops=1)
13-- -> Index Scan using hybrid_pkey on public.hybrid (cost=0.43..8.45 rows=1 width=78) (actual time=0.042..0.067 rows=1 loops=1)
14-- Output: hybrid.id, hybrid.parent_id, hybrid.path
15-- Index Cond: (hybrid.id = 20311)
16-- -> Hash Join (cost=0.33..53322.95 rows=50 width=78) (actual time=11531.735..28225.489 rows=10 loops=3)
17-- Output: hybrid_1.id, hybrid_1.parent_id, hybrid_1.path
18-- Hash Cond: (hybrid_1.parent_id = mydescendants_1.id)
19-- -> Seq Scan on public.hybrid hybrid_1 (cost=0.00..45998.00 rows=1953100 width=78) (actual time=0.017..14011.599 rows=1953100 loops=3)
20-- Output: hybrid_1.id, hybrid_1.parent_id, hybrid_1.path
21-- -> Hash (cost=0.20..0.20 rows=10 width=8) (actual time=85.103..85.110 rows=10 loops=3)
22-- Output: mydescendants_1.id
23-- Buckets: 1024 Batches: 1 Memory Usage: 9kB
24-- -> WorkTable Scan on mydescendants mydescendants_1 (cost=0.00..0.20 rows=10 width=8) (actual time=84.907..84.994 rows=10 loops=3)
25-- Output: mydescendants_1.id
26-- Planning Time: 0.812 ms
27-- JIT:
28-- Functions: 11
29-- Options: Inlining true, Optimization true, Expressions true, Deforming true
30-- Timing: Generation 3.194 ms, Inlining 66.380 ms, Optimization 110.841 ms, Emission 77.258 ms, Total 257.673 ms
31-- Execution Time: 84805.831 ms
Ouch. 84.8s is a little long to wait for 31 rows.
Looking closer, There is a Seq Scan
in there returning the majority of our rows.
That should have been an Index Scan
. I forgot to create an index on parent_id
.
1CREATE INDEX ON hybrid (parent_id);
And let’s try the above recursive query again
1EXPLAIN (VERBOSE, ANALYZE)
2WITH RECURSIVE
3 mydescendants AS (
4 SELECT * FROM hybrid WHERE id = 20311
5 UNION ALL
6 SELECT hybrid.* FROM hybrid, mydescendants WHERE hybrid.parent_id = mydescendants.id
7 )
8SELECT * FROM mydescendants;
9-- CTE Scan on mydescendants (cost=1928.42..1938.44 rows=501 width=48) (actual time=0.085..4.217 rows=31 loops=1)
10-- Output: mydescendants.id, mydescendants.parent_id, mydescendants.path
11-- CTE mydescendants
12-- -> Recursive Union (cost=0.43..1928.42 rows=501 width=78) (actual time=0.064..3.387 rows=31 loops=1)
13-- -> Index Scan using hybrid_pkey on public.hybrid (cost=0.43..8.45 rows=1 width=78) (actual time=0.045..0.064 rows=1 loops=1)
14-- Output: hybrid.id, hybrid.parent_id, hybrid.path
15-- Index Cond: (hybrid.id = 20311)
16-- -> Nested Loop (cost=0.43..191.00 rows=50 width=78) (actual time=0.476..0.858 rows=10 loops=3)
17-- Output: hybrid_1.id, hybrid_1.parent_id, hybrid_1.path
18-- -> WorkTable Scan on mydescendants mydescendants_1 (cost=0.00..0.20 rows=10 width=8) (actual time=0.009..0.109 rows=10 loops=3)
19-- Output: mydescendants_1.id, mydescendants_1.parent_id, mydescendants_1.path
20-- -> Index Scan using hybrid_parent_id_idx on public.hybrid hybrid_1 (cost=0.43..19.03 rows=5 width=78) (actual time=0.015..0.026 rows=1 loops=31)
21-- Output: hybrid_1.id, hybrid_1.parent_id, hybrid_1.path
22-- Index Cond: (hybrid_1.parent_id = mydescendants_1.id)
23-- Planning Time: 0.223 ms
24-- Execution Time: 4.796 ms
4.8ms, that seems more reasonable.
It’s still a few times slower than querying by path
, but probably not detrimentally so.
Depending on your table depth and width, these differences will change as well, so remember to test
with a dataset that reflects your domain.
Let’s try an update. We’ll move a single node with 780 descendants:
1EXPLAIN (VERBOSE, ANALYZE)
2UPDATE hybrid SET parent_id = 101 WHERE id = 1000;
3-- Update on public.hybrid (cost=0.43..8.45 rows=0 width=0) (actual time=54.660..54.685 rows=0 loops=1)
4-- -> Index Scan using hybrid_pkey on public.hybrid (cost=0.43..8.45 rows=1 width=14) (actual time=0.108..0.144 rows=1 loops=1)
5-- Output: '101'::bigint, ctid
6-- Index Cond: (hybrid.id = 1000)
7-- Planning Time: 0.074 ms
8-- Trigger RI_ConstraintTrigger_c_20050 for constraint hybrid_parent_id_fkey: time=0.287 calls=1
9-- Trigger set_tree_path: time=53.347 calls=1
10-- Execution Time: 55.168 ms
Contrast that with the same query on the table without the path
and trigger:
1EXPLAIN (VERBOSE, ANALYZE)
2UPDATE hybrid SET parent_id = 101 WHERE id = 1000;
3-- Update on public.mytree (cost=0.43..8.45 rows=0 width=0) (actual time=0.616..0.642 rows=0 loops=1)
4-- -> Index Scan using mytree_pkey on public.mytree (cost=0.43..8.45 rows=1 width=14) (actual time=0.175..0.195 rows=1 loops=1)
5-- Output: '101'::bigint, ctid
6-- Index Cond: (mytree.id = 1000)
7-- Planning Time: 0.558 ms
8-- Trigger RI_ConstraintTrigger_c_19896 for constraint mytree_parent_id_fkey: time=0.370 calls=1
9-- Execution Time: 1.122 ms
Our hybrid table takes 50 times longer to update in this example.
Finally, let’s take a look at the sizes of our relations:
§Regular Table:
Size | |
---|---|
table | 82 MB |
primary key index | 42 MB |
parent_id index | 22 MB |
§ltree Table:
Size | |
---|---|
table | 177 MB |
primary key index | 229 MB |
path GIST index (siglen=124) | 268 MB |
§Hybrid Table:
Size | |
---|---|
table | 207 MB |
primary key index | 42 MB |
parent_id index | 22 MB |
path GIST index (siglen=124) | 250 MB |
The ltree
tables are significantly larger.
§Conclusion
This was a very unscientific dive into ltree
, but I still think we can draw some conclusions from it.
ltree
is a really cool extension!- If we want to exactly mirror the behavior of having a
parent_id
column,ltree
alone is probably not the solution. - For a small set of read-heavy workloads, a hybrid approach might be worth investigating,
especially if you have queries where you can take advantage of the
ltree
properties. - A pure table with a
parent_id
and using recursive CTE queries is probably still the best option for most, the added complexity of a hybrid approach, having to deal with triggers, the possibility of bugs in your triggers, etc, is probably not worth it.
There are other usecases for ltree
where it would be a better fit.
A category tree where you can fit you categories directly into the constraint of what
a label can contain, will probably be really neat.
Uses where a thing can be in many places at once is also a thing that ltree enables, like dependency trees, labels (obviously), etc.
Modelling tags/labels with ltree
is something I’d like to explore in the future.