tudborg.com

§A Tree with Tables

There are plenty of reasons why you’d need to model a tree of data in your database.

Maybe you need to allow your users to organise some of their other data into groups, maybe you want to model inherited properties in your data, maybe you are trying to represent folders in your database, or a ton of other reasons.

In this post I’ll explore if there is any benefit in using the ltree extension for modelling the data you’d normally do with a regular parent/child relationship.

§Modelling a Tree

The usual way to do this is by adding a parent_id column that references the same table’s id column. This way you can still use foreign keys. Each row is a node in the tree:

1CREATE TABLE mytree (
2  id bigint PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
3  parent_id bigint REFERENCES mytree(id) ON DELETE CASCADE
4);

With this we can represent a root row (parent_id IS NULL) or a child row (parent_id IS NOT NULL)

Let’s create a tiny amount of test data:

1TRUNCATE mytree RESTART IDENTITY; -- if you already had some values in there
2INSERT INTO mytree (parent_id) VALUES (NULL), (NULL), (1), (2), (3), (3), (4), (4), (6);

Now we have the following tree:

─┬─1───3─┬─5
 │       └─6───9
 └─2───4─┬─7
         └─8

§Querying a Tree

In Postgres we can query this structure with a recursive CTEs, allowing us to query the entire tree with a single query.

To get a row with id 4 and all of it’s descendants:

 1WITH RECURSIVE
 2  mydescendants AS (
 3    SELECT * FROM mytree WHERE id = 4
 4    UNION ALL
 5    SELECT mytree.* FROM mytree, mydescendants WHERE mytree.parent_id = mydescendants.id
 6  )
 7SELECT * FROM mydescendants;
 8-- "id","parent_id"
 9-- 4,2
10-- 7,4
11-- 8,4

What if we need to find both descendants and ancestors?
No problem:

 1WITH RECURSIVE
 2  mydescendants AS (
 3    SELECT * FROM mytree WHERE parent_id = 4 -- note that I changed this to not include id=4 itself
 4    UNION ALL
 5    SELECT mytree.* FROM mytree, mydescendants WHERE mytree.parent_id = mydescendants.id
 6  ),
 7  myancestors AS (
 8    SELECT * FROM mytree WHERE id = 4
 9    UNION ALL
10    SELECT mytree.* FROM mytree, myancestors WHERE mytree.id = myancestors.parent_id
11  )
12SELECT * FROM myancestors
13UNION ALL
14SELECT * FROM mydescendants;
15-- "id","parent_id"
16-- 4,2
17-- 2,NULL
18-- 7,4
19-- 8,4

We now both have the ancestor 2, the row itself 4, and the decendants 7 and 8.

§Using the ltree Extension

Another way to model trees in postgres is to use the ltree extension. The ltree datatype is specifically designed for

representing labels of data stored in a hierarchical tree-like structure

“Labels” in this context is

a sequence of alphanumeric characters and underscores (for example, in C locale the characters A-Za-z0-9_ are allowed)

The ltree data type provides us with a way to represent and search for data stored in a tree-like structure, so maybe we can use this to model our tree?

Let’s explore.

§Modelling a Tree (Again)

1CREATE TABLE myltree (
2  path ltree PRIMARY KEY
3);

Let’s create some test data. We’ll use the same structure as we did previously.

We’ll use the id column from before as our labels, but note that we could use anything that is a valid label here.

1INSERT INTO myltree (path)
2VALUES ('1'), ('2'), ('1.3'), ('2.4'), ('1.3.5'),
3       ('1.3.6'), ('2.4.7'), ('2.4.8'), ('1.3.6.9');

Let’s take a look at our new table:

 1SELECT * FROM myltree;
 2-- "path"
 3-- "1"
 4-- "2"
 5-- "1.3"
 6-- "2.4"
 7-- "1.3.5"
 8-- "1.3.6"
 9-- "2.4.7"
10-- "2.4.8"
11-- "1.3.6.9"

§Querying a Tree (Again)

Let’s query for the same things as earlier.

First, given the node 4 (which is now represented by the path 2.4), how do we get that row and all of it’s descendants?

1SELECT * FROM myltree WHERE path <@ '2.4';
2-- "path"
3-- "2.4"
4-- "2.4.7"
5-- "2.4.8"

This returned the same rows as earlier.

What if we need to find both descendants and ancestors of the label '4'? Also easy:

1SELECT * FROM myltree WHERE path <@ '2.4' OR path @> '2.4';
2-- "path"
3-- "2"
4-- "2.4"
5-- "2.4.7"
6-- "2.4.8"

The @> operator checks if the left argument is an ancestor of the right argument.
The <@ operator checks if the left argument is a descendant of the right argument.

So the query above says “SELECT all columns FROM myltree WHERE path is a decendant of 2.4 OR path is an ancestor of 2.4”.
These queries are much simpler than before.

§What about Performance?

To detect any kind of performance difference, we’ll need to generate a much larger dataset to test on.

I’ll generate a dataset with 100 root rows, and add 5 child rows for all rows 7 levels deep.

I’m using these two queries to generate our test datasets:

 1-- myltree dataset
 2BEGIN;
 3TRUNCATE TABLE myltree;
 4CREATE TEMPORARY SEQUENCE ltreeid;
 5INSERT INTO myltree (path)
 6WITH RECURSIVE t(path, lvl) AS
 7(
 8  SELECT text2ltree(nextval('ltreeid')::text), 1 FROM generate_series(1,100)n
 9  UNION ALL
10  SELECT ltree_addtext(path, nextval('ltreeid')::text), lvl+1 FROM t, generate_series(1,5)n
11  WHERE lvl < 7
12)
13SELECT path FROM t;
14DROP SEQUENCE ltreeid;
15COMMIT;

which inserted 1,953,100 rows in 32.4 seconds.

and

 1-- mytree dataset
 2BEGIN;
 3TRUNCATE TABLE mytree;
 4CREATE TEMPORARY SEQUENCE treeid;
 5INSERT INTO mytree (id, parent_id)
 6WITH RECURSIVE t(id, parent_id, lvl) AS
 7(
 8  SELECT nextval('treeid')::bigint, null::bigint, 1 FROM generate_series(1,100)n
 9  UNION ALL
10  SELECT nextval('treeid')::bigint, t.id, lvl+1 FROM t, generate_series(1,5)n
11  WHERE lvl < 7
12)
13SELECT id, parent_id FROM t;
14DROP SEQUENCE treeid;
15COMMIT;

which again inserted 1,953,100 rows, this time in 77.2 seconds.

§Finding a specific row

(NOTE: I’m running these with max_parallel_workers_per_gather = 0)

Let’s take a look at the last row we inserted, row 1953100:

1SELECT * FROM mytree WHERE id = 1953100;
2-- "id","parent_id"
3-- 1953100,390600
4-- took: 2ms

1SELECT * FROM myltree WHERE path ~ '*.1953100';
2-- "path"
3-- "100.600.3100.15600.78100.390600.1953100"
4-- took: 570ms

We can see that both queries return a row where the parent is 390600, so it looks like our datasets are similar.

What the two queries don’t agree on is the time it takes to find a row, 570 ms vs 2 ms.

If you want to take a sidetrack with me and figure out why this is, I’ll dig into indexing over an ltree in this post.

The summary is that when we do WHERE path ~ '*.1953100', that condition can’t use our primary key btree index to find the row, and we end up having to scan the entire table.

Moving forward, I’ll be using a GIST index over path with siglen=124.

Of course, the ID of our ltree table isn’t just 1953100, so this isn’t a fair comparison. The ID of the myltree table is the full path, so let’s try and use that:

1SELECT * FROM myltree WHERE path = '100.600.3100.15600.78100.390600.1953100';
2-- "path"
3-- "100.600.3100.15600.78100.390600.1953100"
4-- took: 2ms

So if we have the full ID, the performance of looking up a row is similar.

ltree allows us to write much simpler queries, but there are also some additional features from using ltree.

§Example: Count LCA descendants

Let’s say we know the id of two rows:

100.600.3100.15600.65600.190600
100.600.3100.15600.78100.390600.1953100

And we are interested in finding the number of descendants contained by the longest common ancestor of those two rows.

With ltree, that’s really easy:

1SELECT count(*)
2FROM myltree
3WHERE lca('100.600.3100.15600.65600.190600', '100.600.3100.15600.78100.390600.1953100') @> path;
4-- "count"
5-- 156

§Example: Find all siblings:

1SELECT * FROM myltree
2WHERE subpath('100.600.3100.15600.65600.190600', 0, -1) @> path
3AND nlevel('100.600.3100.15600.65600.190600') = nlevel(path);
4-- "path"
5-- "100.600.3100.15600.65600.128100"
6-- "100.600.3100.15600.65600.190600"
7-- "100.600.3100.15600.65600.253100"
8-- "100.600.3100.15600.65600.315600"
9-- "100.600.3100.15600.65600.378100"

§Adding a row?

Adding a row to our mytree table is pretty easy. We need to know the parent we want to create a child for.

1INSERT INTO mytree (parent_id) VALUES (3);

Inserting into our myltree table is mostly the same.

However we don’t automatically get a new ID to use for this node. We could make up a value, or we could create a sequence to use each time we create a new row.

Using a sequence is most like the parent/child relationship in mytree so let’s assume we have a sequence available:

1CREATE SEQUENCE myltreeseq;

Inserting a new row then becomes

1-- '3'::ltree being the parent path
2INSERT INTO myltree (path) VALUES ('3'::ltree || text2ltree(nextval('myltreeseq')::text))

There is nothing to protect us against inserting a row where the parent no longer exists. If we wanted to ensure that, we’d need to write a trigger that blocks the insert if no parent exists in the table.

§Moving a row?

In our mytree table, moving a row so that it is parented to another row is trivial:

1UPDATE mytree SET parent_id = 3 WHERE id = 4;

The story is a bit different with an ltree.
Since all of our rows are independant of each other, if we need to move a row, we also need to move all of that rows descendants. Otherwise, their paths won’t change.

This is actually a feature of ltree, but in our specific scenario, we are not interested in having a subpath in our path that isn’t also a valid row on it’s own, so we need to take some care to move everything correctly:

1-- assuming we want to move 2.4 so that it becomes 1.3.4:
2UPDATE myltree SET path = '1.3' || subpath(path, nlevel('2.4') - 1) WHERE path <@ '2.4';

The obvious downside is that we need to update n rows instead of 1. For large trees this might get very expensive.

§Hybrid approach

We like having a single integer id for our row. We also like having a parent_id with a foreign key to that row, such that we can cascade deletes and ensure a parent exists on insert.

But what if we also like the features ltree provides? Could we build a table where we could maintain the relationships between the rows with our parent_id column, but also have a path column to query?

My first idea was a materialized view, but that since we can’t incrementally refresh those in postgres (yet), we’d have to rebuild the entire view every time we wanted to INSERT/UPDATE/DELETE.

My second idea was to write a trigger to maintain the ltree when a row changes. So that’s what I did:

 1-- hybrid table:
 2CREATE TABLE hybrid (
 3  id bigint PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
 4  parent_id bigint REFERENCES hybrid(id) ON DELETE CASCADE,
 5  path ltree
 6);
 7-- index on the path ltree
 8CREATE INDEX path_gist_idx ON hybrid USING GIST (path gist_ltree_ops(siglen=124));
 9
10-- trigger to update path if parent_id changes
11CREATE FUNCTION trigger_set_path_from_parent() RETURNS TRIGGER AS $$
12DECLARE
13  parentpath ltree;
14  currentnode ltree;
15BEGIN
16  -- On INSERT, nobody has seen this new row before, so we can safely just insert it
17  -- and construct the path from it's parent
18  IF TG_OP = 'INSERT' THEN
19    IF NEW.parent_id IS NULL THEN
20      NEW.path = text2ltree(NEW.id::text);
21    ELSE
22      SELECT ltree_addtext(path, NEW.id::text) INTO NEW.path FROM hybrid WHERE id = NEW.parent_id;
23    END IF;
24  -- If UPDATE, when a row's parent changes, we also need to also patch the path of all of the children.
25  ELSIF TG_OP = 'UPDATE' AND NEW.parent_id IS DISTINCT FROM OLD.parent_id THEN
26    -- we want to move the paths such that all descedants of NEW.id gets their path patched.
27    currentnode = text2ltree(NEW.id::text);
28    SELECT path into parentpath FROM hybrid WHERE id = NEW.parent_id;
29    IF index(parentpath, currentnode) > -1 THEN
30      RAISE EXCEPTION 'Circular reference. Tried to set parent to % for row %, but % is already in parent path %',
31        NEW.parent_id, NEW.id, NEW.id, parentpath;
32    ELSE
33      NEW.path = parentpath || currentnode;
34    END IF;
35    UPDATE hybrid
36      SET path = NEW.path || subpath(path, nlevel(OLD.path))
37      WHERE path <@ OLD.path AND id <> OLD.id;
38  END IF;
39  RETURN NEW;
40END;
41$$ LANGUAGE plpgsql;
42
43CREATE TRIGGER set_tree_path
44  BEFORE INSERT OR UPDATE ON hybrid
45  FOR EACH ROW
46  EXECUTE FUNCTION trigger_set_path_from_parent();

Now, let’s see how long it takes to generate the example dataset from earlier. Note that we now have a trigger that needs to run for each row, and a GIST index that needs to be maintained for the path column:

 1-- hybrid dataset
 2BEGIN;
 3TRUNCATE TABLE hybrid;
 4CREATE TEMPORARY SEQUENCE hybridid;
 5INSERT INTO hybrid (id, parent_id)
 6WITH RECURSIVE t(id, parent_id, lvl) AS
 7(
 8  SELECT nextval('hybridid')::bigint, null::bigint, 1 FROM generate_series(1,100)n
 9  UNION ALL
10  SELECT nextval('hybridid')::bigint, t.id, lvl+1 FROM t, generate_series(1,5)n
11  WHERE lvl < 7
12)
13SELECT id, parent_id FROM t;
14DROP SEQUENCE hybridid;
15COMMIT;
16-- took: 229.2s

Ok, so it took quite a bit longer to insert those rows. Let’s test out some queries.

 1EXPLAIN (VERBOSE, ANALYZE)
 2SELECT * FROM hybrid WHERE path <@ '11.311.2811.7811.20311';
 3-- Bitmap Heap Scan on public.hybrid  (cost=17.92..750.15 rows=195 width=78) (actual time=0.120..0.486 rows=31 loops=1)
 4--   Output: id, parent_id, path
 5--   Recheck Cond: (hybrid.path <@ '11.311.2811.7811.20311'::ltree)
 6--   Heap Blocks: exact=31
 7--   ->  Bitmap Index Scan on path_gist_idx  (cost=0.00..17.88 rows=195 width=0) (actual time=0.093..0.102 rows=31 loops=1)
 8--         Index Cond: (hybrid.path <@ '11.311.2811.7811.20311'::ltree)
 9-- Planning Time: 0.243 ms
10-- Execution Time: 0.896 ms

And the recursive CTE way:

 1EXPLAIN (VERBOSE, ANALYZE)
 2WITH RECURSIVE
 3  mydescendants AS (
 4    SELECT * FROM hybrid WHERE id = 20311
 5    UNION ALL
 6    SELECT hybrid.* FROM hybrid, mydescendants WHERE hybrid.parent_id = mydescendants.id
 7  )
 8SELECT * FROM mydescendants;
 9-- CTE Scan on mydescendants  (cost=533247.96..533257.98 rows=501 width=48) (actual time=0.079..84677.808 rows=31 loops=1)
10--   Output: mydescendants.id, mydescendants.parent_id, mydescendants.path
11--   CTE mydescendants
12--     ->  Recursive Union  (cost=0.43..533247.96 rows=501 width=78) (actual time=0.061..84677.223 rows=31 loops=1)
13--           ->  Index Scan using hybrid_pkey on public.hybrid  (cost=0.43..8.45 rows=1 width=78) (actual time=0.042..0.067 rows=1 loops=1)
14--                 Output: hybrid.id, hybrid.parent_id, hybrid.path
15--                 Index Cond: (hybrid.id = 20311)
16--           ->  Hash Join  (cost=0.33..53322.95 rows=50 width=78) (actual time=11531.735..28225.489 rows=10 loops=3)
17--                 Output: hybrid_1.id, hybrid_1.parent_id, hybrid_1.path
18--                 Hash Cond: (hybrid_1.parent_id = mydescendants_1.id)
19--                 ->  Seq Scan on public.hybrid hybrid_1  (cost=0.00..45998.00 rows=1953100 width=78) (actual time=0.017..14011.599 rows=1953100 loops=3)
20--                       Output: hybrid_1.id, hybrid_1.parent_id, hybrid_1.path
21--                 ->  Hash  (cost=0.20..0.20 rows=10 width=8) (actual time=85.103..85.110 rows=10 loops=3)
22--                       Output: mydescendants_1.id
23--                       Buckets: 1024  Batches: 1  Memory Usage: 9kB
24--                       ->  WorkTable Scan on mydescendants mydescendants_1  (cost=0.00..0.20 rows=10 width=8) (actual time=84.907..84.994 rows=10 loops=3)
25--                             Output: mydescendants_1.id
26-- Planning Time: 0.812 ms
27-- JIT:
28--   Functions: 11
29--   Options: Inlining true, Optimization true, Expressions true, Deforming true
30--   Timing: Generation 3.194 ms, Inlining 66.380 ms, Optimization 110.841 ms, Emission 77.258 ms, Total 257.673 ms
31-- Execution Time: 84805.831 ms

Ouch. 84.8s is a little long to wait for 31 rows. Looking closer, There is a Seq Scan in there returning the majority of our rows. That should have been an Index Scan. I forgot to create an index on parent_id.

1CREATE INDEX ON hybrid (parent_id);

And let’s try the above recursive query again

 1EXPLAIN (VERBOSE, ANALYZE)
 2WITH RECURSIVE
 3  mydescendants AS (
 4    SELECT * FROM hybrid WHERE id = 20311
 5    UNION ALL
 6    SELECT hybrid.* FROM hybrid, mydescendants WHERE hybrid.parent_id = mydescendants.id
 7  )
 8SELECT * FROM mydescendants;
 9-- CTE Scan on mydescendants  (cost=1928.42..1938.44 rows=501 width=48) (actual time=0.085..4.217 rows=31 loops=1)
10--   Output: mydescendants.id, mydescendants.parent_id, mydescendants.path
11--   CTE mydescendants
12--     ->  Recursive Union  (cost=0.43..1928.42 rows=501 width=78) (actual time=0.064..3.387 rows=31 loops=1)
13--           ->  Index Scan using hybrid_pkey on public.hybrid  (cost=0.43..8.45 rows=1 width=78) (actual time=0.045..0.064 rows=1 loops=1)
14--                 Output: hybrid.id, hybrid.parent_id, hybrid.path
15--                 Index Cond: (hybrid.id = 20311)
16--           ->  Nested Loop  (cost=0.43..191.00 rows=50 width=78) (actual time=0.476..0.858 rows=10 loops=3)
17--                 Output: hybrid_1.id, hybrid_1.parent_id, hybrid_1.path
18--                 ->  WorkTable Scan on mydescendants mydescendants_1  (cost=0.00..0.20 rows=10 width=8) (actual time=0.009..0.109 rows=10 loops=3)
19--                       Output: mydescendants_1.id, mydescendants_1.parent_id, mydescendants_1.path
20--                 ->  Index Scan using hybrid_parent_id_idx on public.hybrid hybrid_1  (cost=0.43..19.03 rows=5 width=78) (actual time=0.015..0.026 rows=1 loops=31)
21--                       Output: hybrid_1.id, hybrid_1.parent_id, hybrid_1.path
22--                       Index Cond: (hybrid_1.parent_id = mydescendants_1.id)
23-- Planning Time: 0.223 ms
24-- Execution Time: 4.796 ms

4.8ms, that seems more reasonable. It’s still a few times slower than querying by path, but probably not detrimentally so. Depending on your table depth and width, these differences will change as well, so remember to test with a dataset that reflects your domain.

Let’s try an update. We’ll move a single node with 780 descendants:

 1EXPLAIN (VERBOSE, ANALYZE)
 2UPDATE hybrid SET parent_id = 101 WHERE id = 1000;
 3-- Update on public.hybrid  (cost=0.43..8.45 rows=0 width=0) (actual time=54.660..54.685 rows=0 loops=1)
 4--   ->  Index Scan using hybrid_pkey on public.hybrid  (cost=0.43..8.45 rows=1 width=14) (actual time=0.108..0.144 rows=1 loops=1)
 5--         Output: '101'::bigint, ctid
 6--         Index Cond: (hybrid.id = 1000)
 7-- Planning Time: 0.074 ms
 8-- Trigger RI_ConstraintTrigger_c_20050 for constraint hybrid_parent_id_fkey: time=0.287 calls=1
 9-- Trigger set_tree_path: time=53.347 calls=1
10-- Execution Time: 55.168 ms

Contrast that with the same query on the table without the path and trigger:

1EXPLAIN (VERBOSE, ANALYZE)
2UPDATE hybrid SET parent_id = 101 WHERE id = 1000;
3-- Update on public.mytree  (cost=0.43..8.45 rows=0 width=0) (actual time=0.616..0.642 rows=0 loops=1)
4--   ->  Index Scan using mytree_pkey on public.mytree  (cost=0.43..8.45 rows=1 width=14) (actual time=0.175..0.195 rows=1 loops=1)
5--         Output: '101'::bigint, ctid
6--         Index Cond: (mytree.id = 1000)
7-- Planning Time: 0.558 ms
8-- Trigger RI_ConstraintTrigger_c_19896 for constraint mytree_parent_id_fkey: time=0.370 calls=1
9-- Execution Time: 1.122 ms

Our hybrid table takes 50 times longer to update in this example.

Finally, let’s take a look at the sizes of our relations:

§Regular Table:

	Size
table	82 MB
primary key index	42 MB
parent_id index	22 MB

§ltree Table:

	Size
table	177 MB
primary key index	229 MB
path GIST index (siglen=124)	268 MB

§Hybrid Table:

	Size
table	207 MB
primary key index	42 MB
parent_id index	22 MB
path GIST index (siglen=124)	250 MB

The ltree tables are significantly larger.

§Conclusion

This was a very unscientific dive into ltree, but I still think we can draw some conclusions from it.

ltree is a really cool extension!
If we want to exactly mirror the behavior of having a parent_id column, ltree alone is probably not the solution.
For a small set of read-heavy workloads, a hybrid approach might be worth investigating, especially if you have queries where you can take advantage of the ltree properties.
A pure table with a parent_id and using recursive CTE queries is probably still the best option for most, the added complexity of a hybrid approach, having to deal with triggers, the possibility of bugs in your triggers, etc, is probably not worth it.

There are other usecases for ltree where it would be a better fit.
A category tree where you can fit you categories directly into the constraint of what a label can contain, will probably be really neat.

Uses where a thing can be in many places at once is also a thing that ltree enables, like dependency trees, labels (obviously), etc.

Modelling tags/labels with ltree is something I’d like to explore in the future.

Hierarchical structures in PostgreSQL (with ltree)