tudborg.com

§Sidetracked?

Hi. This is a dive into GIST indexes on the ltree datatype in PostgreSQL.

I assume you came from the parent post, but if not, you probably want to read that first for context.

Done? OK.

§GIST index on ltree

Looking for the ltree row and much slower than the bigint row. Why is that?

Let’s ask the database:

1EXPLAIN SELECT * FROM mytree WHERE id = 1953100;
2-- Index Scan using mytree_pkey on mytree  (cost=0.43..8.45 rows=1 width=16)
3--   Index Cond: (id = 1953100)

Sure, using the primary key index to find the row by id seems reasonable enough.

1EXPLAIN SELECT * FROM myltree WHERE path ~ '*.1953100';
2-- Seq Scan on myltree  (cost=0.00..47037.75 rows=195 width=62)
3--   Filter: (path ~ '*.1953100'::lquery)

Oh! Turns out we are scanning through the entire table to find our one row (The Seq Scan part)

The path column does have a btree index (which is our primary key), but to use it you need to know the full path so we can use the = operator. If we use the full path of the row the query plan is practically identical to our mytree example, and runs in 2 ms.

1EXPLAIN (VERBOSE, ANALYZE) SELECT * FROM myltree WHERE path = '100.600.3100.15600.78100.390600.1953100';
2-- Index Only Scan using myltree_pkey on public.myltree  (cost=0.55..4.57 rows=1 width=62) (actual time=0.111..0.131 rows=1 loops=1)
3--   Output: path
4--   Index Cond: (myltree.path = '100.600.3100.15600.78100.390600.1953100'::ltree)
5--   Heap Fetches: 0
6-- Planning Time: 1.131 ms
7-- Execution Time: 0.294 ms

If you treat the path column as your id, then you will have the full path when you query the table, and this won’t really be a problem.

ltree also allows using an index on it, but we must create a GIST index on path:

1CREATE INDEX ON myltree USING GIST (path);
2-- took: 27s

and now our query from before, searching by the last label in path uses our new index, but it’s still pretty slow!

1EXPLAIN (VERBOSE, ANALYZE) SELECT * FROM myltree WHERE path ~ '*.1953100';
2-- Bitmap Heap Scan on public.myltree  (cost=17.92..746.05 rows=195 width=62) (actual time=460.318..460.344 rows=1 loops=1)
3--   Output: path
4--   Recheck Cond: (myltree.path ~ '*.1953100'::lquery)
5--   Heap Blocks: exact=1
6--   ->  Bitmap Index Scan on myltree_path_idx  (cost=0.00..17.88 rows=195 width=0) (actual time=460.291..460.297 rows=1 loops=1)
7--         Index Cond: (myltree.path ~ '*.1953100'::lquery)
8-- Planning Time: 0.209 ms
9-- Execution Time: 460.521 ms

That didn’t improve much. Turns out that for our specific table, Bitmap Index Scan is not much faster than the sequential scan, but that’s with the default GIST index.

A GIST index is a lossy index, meaning it can produce false matches. Rows returned by a lossy index needs to be checked against the actual table row before being returned as a result (Postgres does this for you).

We can tweak the siglen parameter of the GIST index to tweak how precise our search will be. A larger siglen means a more precise index, but also a larger index, so we need to balance these against each other.

See also ltree Indexes and Text Search Indexes from the manual.

Let’s do an experiment. We’ll create a few different indices and check their sizes and performance to validate that the above claim that larger siglen yields more precise results at the cost of larger index size:

1-- I'll drop the first GIST index first:
2DROP INDEX myltree_path_idx;

 1-- Then create our new ones:
 2CREATE INDEX myltree_path_siglen8_idx ON myltree USING GIST (path gist_ltree_ops(siglen=8));
 3CREATE INDEX myltree_path_siglen128_idx ON myltree USING GIST (path gist_ltree_ops(siglen=128));
 4CREATE INDEX myltree_path_siglen256_idx ON myltree USING GIST (path gist_ltree_ops(siglen=256));
 5CREATE INDEX myltree_path_siglen512_idx ON myltree USING GIST (path gist_ltree_ops(siglen=512));
 6CREATE INDEX myltree_path_siglen1024_idx ON myltree USING GIST (path gist_ltree_ops(siglen=1024));
 7
 8-- and check their sizes:
 9SELECT
10  pg_size_pretty(pg_relation_size('myltree')) as myltree,
11  pg_size_pretty(pg_relation_size('myltree_path_siglen8_idx')) myltree_path_siglen8_idx,
12  pg_size_pretty(pg_relation_size('myltree_path_siglen128_idx')) myltree_path_siglen128_idx,
13  pg_size_pretty(pg_relation_size('myltree_path_siglen256_idx')) myltree_path_siglen256_idx,
14  pg_size_pretty(pg_relation_size('myltree_path_siglen512_idx')) myltree_path_siglen512_idx,
15  pg_size_pretty(pg_relation_size('myltree_path_siglen1024_idx')) myltree_path_siglen1024_idx;
16
17-- myltree: 177 MB
18-- myltree_path_siglen8_idx: 275 MB
19-- myltree_path_siglen128_idx: 283 MB
20-- myltree_path_siglen256_idx: 289 MB
21-- myltree_path_siglen512_idx: 321 MB
22-- myltree_path_siglen1024_idx: 341 MB

Our table itself is 177 MB.
The smallest index is 275 MB at a siglen of 8, and grows larger as we increase siglen, so the manual is definitely correct in that regard.
Is the large index any faster though? Or maybe slower, because we have to look at more data?

Let’s test it. Wonder what index postgres will actually choose to use?

Same query as before:

1EXPLAIN (VERBOSE, ANALYZE) SELECT * FROM myltree WHERE path ~ '*.1953100';
2-- Bitmap Heap Scan on public.myltree  (cost=17.92..746.05 rows=195 width=62) (actual time=26.302..26.338 rows=1 loops=1)
3--   Output: path
4--   Recheck Cond: (myltree.path ~ '*.1953100'::lquery)
5--   Heap Blocks: exact=1
6--   ->  Bitmap Index Scan on myltree_path_siglen256_idx  (cost=0.00..17.88 rows=195 width=0) (actual time=26.274..26.283 rows=1 loops=1)
7--         Index Cond: (myltree.path ~ '*.1953100'::lquery)
8-- Planning Time: 0.200 ms
9-- Execution Time: 26.520 ms

Much faster at 26ms, and as we can se by Bitmap Index Scan on myltree_path_siglen256_idx, it choose the siglen=256 index. Not the smallest, not the largest.

What happens if we drop that index and try again?

 1DROP INDEX myltree_path_siglen256_idx;
 2EXPLAIN (VERBOSE, ANALYZE) SELECT * FROM myltree WHERE path ~ '*.1953100';
 3-- Bitmap Heap Scan on public.myltree  (cost=17.92..746.05 rows=195 width=62) (actual time=48.796..48.832 rows=1 loops=1)
 4--   Output: path
 5--   Recheck Cond: (myltree.path ~ '*.1953100'::lquery)
 6--   Heap Blocks: exact=1
 7--   ->  Bitmap Index Scan on myltree_path_siglen128_idx  (cost=0.00..17.88 rows=195 width=0) (actual time=48.770..48.778 rows=1 loops=1)
 8--         Index Cond: (myltree.path ~ '*.1953100'::lquery)
 9-- Planning Time: 0.203 ms
10-- Execution Time: 48.995 ms

It picks the siglen=128 index instead of siglen=256, and the execution time doubled. Half the siglen, double the execution time.

I’ve summarized the siglen and the execution time of our query

siglen	Execution time	Index size
8	508ms	275 MB
128	49ms	283 MB
256	27ms	289 MB
512	17ms	321 MB
1024	7ms	341 MB

As a sidenote, the planner actually choose to pick the siglen=8 index over the siglen=512 and siglen=1024 but did choose any index over a sequential scan.

Indexing ltree with GIST

§Sidetracked?

§GIST index on ltree