Indexing ltree with GIST
§Sidetracked?
Hi.
This is a dive into GIST indexes on the ltree
datatype in PostgreSQL.
I assume you came from the parent post, but if not, you probably want to read that first for context.
Done? OK.
§GIST index on ltree
Looking for the ltree
row and much slower than the bigint
row. Why is that?
Let’s ask the database:
1EXPLAIN SELECT * FROM mytree WHERE id = 1953100;
2-- Index Scan using mytree_pkey on mytree (cost=0.43..8.45 rows=1 width=16)
3-- Index Cond: (id = 1953100)
Sure, using the primary key index to find the row by id seems reasonable enough.
1EXPLAIN SELECT * FROM myltree WHERE path ~ '*.1953100';
2-- Seq Scan on myltree (cost=0.00..47037.75 rows=195 width=62)
3-- Filter: (path ~ '*.1953100'::lquery)
Oh! Turns out we are scanning through the entire table to find our one row (The Seq Scan
part)
The path
column does have a btree index (which is our primary key),
but to use it you need to know the full path
so we can use the =
operator.
If we use the full path of the row the query plan is practically identical to our mytree
example, and runs in 2 ms.
1EXPLAIN (VERBOSE, ANALYZE) SELECT * FROM myltree WHERE path = '100.600.3100.15600.78100.390600.1953100';
2-- Index Only Scan using myltree_pkey on public.myltree (cost=0.55..4.57 rows=1 width=62) (actual time=0.111..0.131 rows=1 loops=1)
3-- Output: path
4-- Index Cond: (myltree.path = '100.600.3100.15600.78100.390600.1953100'::ltree)
5-- Heap Fetches: 0
6-- Planning Time: 1.131 ms
7-- Execution Time: 0.294 ms
If you treat the path
column as your id, then you will have the full path when
you query the table, and this won’t really be a problem.
ltree
also allows using an index on it, but we must create a GIST
index on path
:
and now our query from before, searching by the last label in path
uses our new index,
but it’s still pretty slow!
1EXPLAIN (VERBOSE, ANALYZE) SELECT * FROM myltree WHERE path ~ '*.1953100';
2-- Bitmap Heap Scan on public.myltree (cost=17.92..746.05 rows=195 width=62) (actual time=460.318..460.344 rows=1 loops=1)
3-- Output: path
4-- Recheck Cond: (myltree.path ~ '*.1953100'::lquery)
5-- Heap Blocks: exact=1
6-- -> Bitmap Index Scan on myltree_path_idx (cost=0.00..17.88 rows=195 width=0) (actual time=460.291..460.297 rows=1 loops=1)
7-- Index Cond: (myltree.path ~ '*.1953100'::lquery)
8-- Planning Time: 0.209 ms
9-- Execution Time: 460.521 ms
That didn’t improve much. Turns out that for our specific table,
Bitmap Index Scan
is not much faster than the sequential scan, but that’s with the default GIST index.
A GIST index is a lossy index, meaning it can produce false matches. Rows returned by a lossy index needs to be checked against the actual table row before being returned as a result (Postgres does this for you).
We can tweak the siglen
parameter of the GIST index to tweak how precise our search will be.
A larger siglen means a more precise index, but also a larger index, so we need to balance
these against each other.
See also ltree Indexes and Text Search Indexes from the manual.
Let’s do an experiment. We’ll create a few different indices and check their sizes and performance
to validate that the above claim that larger siglen
yields more precise results at the cost of larger index size:
1-- Then create our new ones:
2CREATE INDEX myltree_path_siglen8_idx ON myltree USING GIST (path gist_ltree_ops(siglen=8));
3CREATE INDEX myltree_path_siglen128_idx ON myltree USING GIST (path gist_ltree_ops(siglen=128));
4CREATE INDEX myltree_path_siglen256_idx ON myltree USING GIST (path gist_ltree_ops(siglen=256));
5CREATE INDEX myltree_path_siglen512_idx ON myltree USING GIST (path gist_ltree_ops(siglen=512));
6CREATE INDEX myltree_path_siglen1024_idx ON myltree USING GIST (path gist_ltree_ops(siglen=1024));
7
8-- and check their sizes:
9SELECT
10 pg_size_pretty(pg_relation_size('myltree')) as myltree,
11 pg_size_pretty(pg_relation_size('myltree_path_siglen8_idx')) myltree_path_siglen8_idx,
12 pg_size_pretty(pg_relation_size('myltree_path_siglen128_idx')) myltree_path_siglen128_idx,
13 pg_size_pretty(pg_relation_size('myltree_path_siglen256_idx')) myltree_path_siglen256_idx,
14 pg_size_pretty(pg_relation_size('myltree_path_siglen512_idx')) myltree_path_siglen512_idx,
15 pg_size_pretty(pg_relation_size('myltree_path_siglen1024_idx')) myltree_path_siglen1024_idx;
16
17-- myltree: 177 MB
18-- myltree_path_siglen8_idx: 275 MB
19-- myltree_path_siglen128_idx: 283 MB
20-- myltree_path_siglen256_idx: 289 MB
21-- myltree_path_siglen512_idx: 321 MB
22-- myltree_path_siglen1024_idx: 341 MB
Our table itself is 177 MB.
The smallest index is 275 MB at a siglen
of 8, and grows larger as we increase siglen
,
so the manual is definitely correct in that regard.
Is the large index any faster though? Or maybe slower, because we have to look at more data?
Let’s test it. Wonder what index postgres will actually choose to use?
Same query as before:
1EXPLAIN (VERBOSE, ANALYZE) SELECT * FROM myltree WHERE path ~ '*.1953100';
2-- Bitmap Heap Scan on public.myltree (cost=17.92..746.05 rows=195 width=62) (actual time=26.302..26.338 rows=1 loops=1)
3-- Output: path
4-- Recheck Cond: (myltree.path ~ '*.1953100'::lquery)
5-- Heap Blocks: exact=1
6-- -> Bitmap Index Scan on myltree_path_siglen256_idx (cost=0.00..17.88 rows=195 width=0) (actual time=26.274..26.283 rows=1 loops=1)
7-- Index Cond: (myltree.path ~ '*.1953100'::lquery)
8-- Planning Time: 0.200 ms
9-- Execution Time: 26.520 ms
Much faster at 26ms, and as we can se by Bitmap Index Scan on myltree_path_siglen256_idx
, it choose
the siglen=256
index. Not the smallest, not the largest.
What happens if we drop that index and try again?
1DROP INDEX myltree_path_siglen256_idx;
2EXPLAIN (VERBOSE, ANALYZE) SELECT * FROM myltree WHERE path ~ '*.1953100';
3-- Bitmap Heap Scan on public.myltree (cost=17.92..746.05 rows=195 width=62) (actual time=48.796..48.832 rows=1 loops=1)
4-- Output: path
5-- Recheck Cond: (myltree.path ~ '*.1953100'::lquery)
6-- Heap Blocks: exact=1
7-- -> Bitmap Index Scan on myltree_path_siglen128_idx (cost=0.00..17.88 rows=195 width=0) (actual time=48.770..48.778 rows=1 loops=1)
8-- Index Cond: (myltree.path ~ '*.1953100'::lquery)
9-- Planning Time: 0.203 ms
10-- Execution Time: 48.995 ms
It picks the siglen=128
index instead of siglen=256
, and the execution time doubled.
Half the siglen
, double the execution time.
I’ve summarized the siglen
and the execution time of our query
siglen | Execution time | Index size |
---|---|---|
8 | 508ms | 275 MB |
128 | 49ms | 283 MB |
256 | 27ms | 289 MB |
512 | 17ms | 321 MB |
1024 | 7ms | 341 MB |
As a sidenote, the planner actually choose to pick the siglen=8
index over
the siglen=512
and siglen=1024
but did choose any index over a sequential scan.