Sunday, July 5, 2015

Postgres indexes

Recently, I had a situation where I needed to think how I was using Postgres indexes. I had a simple Book table with the following schema...
>\d book

                  Table "shopping.book"
       Column        |          Type          | Modifiers 
---------------------+------------------------+-----------
 id                  | uuid                   | not null
 version             | bigint                 | not null
 amount_minor_units  | integer                | not null
 currency            | character varying(255) | not null
 author       | character varying(255) | not null
 publisher           | character varying(255) |    
The author and publisher columns were just String pointers to actual Author and Publisher references that were on another system meaning that classical foreign keys couldn't be used and that these dudes were just normal columns.

I needed to get an idea how the table would perform with lots of data, so first up some simple SQL to put in lots of test data:

 
> CREATE EXTENSION "uuid-ossp";
> insert into book (id, version, amount_minor_units, currency, author, publisher) 
select uuid_generate_v4(), 2, 22, 'USD', 'author' || x.id, 'publisher' || x.id from generate_series(1,1000000) AS x(id); 
This table was going to be hit lots of times with this simple query:
select * from book where author = 'Tony Biggins' and publisher='Books unlimited';
To get the explain plain, I did:
dublintech=> EXPLAIN (FORMAT JSON) select * from book where author = 'Tony Biggins' and publisher = 'Books unlimited';
                                                            QUERY PLAN                                                             
-----------------------------------------------------------------------------------------------------------------------------------
 [                                                                                                                                +
   {                                                                                                                              +
     "Plan": {                                                                                                                    +
       "Node Type": "Seq Scan",                                                                                                   +
       "Relation Name": "book",                                                                                                   +
       "Alias": "book",                                                                                                           +
       "Startup Cost": 0.00,                                                                                                      +
       "Total Cost": 123424.88,                                                                                                   +
       "Plan Rows": 1,                                                                                                            +
       "Plan Width": 127,                                                                                                         +
       "Filter": "(((author)::text = 'Tony Biggins'::text) AND ((publisher)::text = 'Books unlimited'::text))"+
     }                                                                                                                            +
   }                                                                                                                              +
 ]
(1 row)
As can be seen, Postgres is doing a Seq Scan, aka a table scan. I wanted to speed things up. There was only one index on the table which was for the id. This was just a conventional B-Tree index which would be useless in this query since it wasn't even in the where clause. Some of options I was thinking about:
  • Create an index on author or publisher
  • Create an index on author and create an index on publisher
  • Create a combination index on both index and publisher.
Hmmm... let the investigations begin. Start by indexing just author.
dublintech=> create index author_idx on book(author);
dublintech=> EXPLAIN (FORMAT JSON) select * from book where publisher = 'publisher3' and author='author3';
                                  QUERY PLAN                                   
-------------------------------------------------------------------------------
 [                                                                            +
   {                                                                          +
     "Plan": {                                                                +
       "Node Type": "Index Scan",                                             +
       "Scan Direction": "Forward",                                           +
       "Index Name": "author_idx",                                  +
       "Relation Name": "book",                                               +
       "Alias": "book",                                                       +
       "Startup Cost": 0.42,                                                  +
       "Total Cost": 8.45,                                                    +
       "Plan Rows": 1,                                                        +
       "Plan Width": 127,                                                     +
       "Index Cond": "((author)::text = 'author3'::text)",+
       "Filter": "((publisher)::text = 'publisher3'::text)"                     +
     }                                                                        +
   }                                                                          +
 ]
(1 row)
As can be seen Postgres performs an index scan and the total cost is much lower than the same query which uses a table scan. What about the multiple column index approach? Surely, since both are used in the query it should be faster again, right?
dublintech=> create index author_publisher_idx on book(author, publisher);
CREATE INDEX
dublintech=> EXPLAIN (FORMAT JSON) select * from book where publisher = 'publisher3' and author='author3';
                                                        QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
 [                                                                                                                        +
   {                                                                                                                      +
     "Plan": {                                                                                                            +
       "Node Type": "Index Scan",                                                                                         +
       "Scan Direction": "Forward",                                                                                       +
       "Index Name": "author_publisher_idx",                                                                     +
       "Relation Name": "book",                                                                                           +
       "Alias": "book",                                                                                                   +
       "Startup Cost": 0.42,                                                                                              +
       "Total Cost": 8.45,                                                                                                +
       "Plan Rows": 1,                                                                                                    +
       "Plan Width": 127,                                                                                                 +
       "Index Cond": "(((author)::text = 'author3'::text) AND ((publisher)::text = 'publisher3'::text))"+
     }                                                                                                                    +
   }                                                                                                                      +
 ]
(1 row)
This time Postgres, uses the multi-index, but the query doesn't go any faster. Mai, pourquoi? Recall, how we populated the table.
insert into book (id, version, amount_minor_units, currency, author, publisher) 
select uuid_generate_v4(), 2, 22, 'USD', 'author' || x.id, 'publisher' || x.id from generate_series(1,1000000) AS x(id); 
There are lots of rows, but every row has a unique author value and a unique publisher value. That would mean the author index for this query should perform just as well. An analogy would be, you go into a music shop looking for a new set of loudspeakers someone has told you to buy that have a particular cost and a particular power output (number of watts). When you enter the shop, you see the speakers are nicely ordered by cost and you know what? No two sets of loudspeakers have the same cost. Think about it. Are you going to find the speakers any faster if you use just use the cost or you use the cost and the loudspeaker?

Now, imagine the case if lots of the loudspeakers were the same cost. Then of course using both the cost and the power will be faster.

Now, let's take this point to the extremes in our test data. Suppose all the authors were the same. The author index becomes useless and if we don't have the author / publisher combination index we would go back to table scan.

// drop combination index and just leave author index on table 
dublintech=> drop index author_uesr_ref_idx;
DROP INDEX
dublintech=> update book set author='author3';
dublintech=> EXPLAIN (FORMAT JSON) select * from book where publisher = 'publisher3' and author='author3';
                                                      QUERY PLAN                                                       
-----------------------------------------------------------------------------------------------------------------------
 [                                                                                                                    +
   {                                                                                                                  +
     "Plan": {                                                                                                        +
       "Node Type": "Seq Scan",                                                                                       +
       "Relation Name": "book",                                                                                       +
       "Alias": "book",                                                                                               +
       "Startup Cost": 0.00,                                                                                          +
       "Total Cost": 153088.88,                                                                                       +
       "Plan Rows": 1,                                                                                                +
       "Plan Width": 123,                                                                                             +
       "Filter": "(((publisher)::text = 'publisher3'::text) AND ((author)::text = 'author3'::text))"+
     }                                                                                                                +
   }                                                                                                                  +
 ]
(1 row)
So we can conclude from this that single column indexes for combination searches can perform as well as combinational indexes when there is a huge degree of variance in the data of that single column. However, when there isn't, they won't perform as well and a combinational index should be used. Yes, I have tested by going to extremes but that is the best way to make principles clear.And please note: For the case when there is maximum variance in data, adding another index to the other column in the where clause, publisher made no difference. This is as expected.

Ok, let's stick with the case when there is massive variance in data values in the column. Consider the case of maximum variance and the query only ever involves exact matching. In this case, all authors values are guaranteed to be unique and you never have any interest in doing anything like less than or greater than. So why not use a hash index instead of a B-Tree index?

dublintech=> create index author_hash on book using hash (author);
dublintech=> EXPLAIN (FORMAT JSON) select * from book where publisher = 'publisher3' and author='author3';
                                  QUERY PLAN                                   
-------------------------------------------------------------------------------
 [                                                                            +
   {                                                                          +
     "Plan": {                                                                +
       "Node Type": "Index Scan",                                             +
       "Scan Direction": "NoMovement",                                        +
       "Index Name": "author_hash",                                 +
       "Relation Name": "book",                                               +
       "Alias": "book",                                                       +
       "Startup Cost": 0.00,                                                  +
       "Total Cost": 8.02,                                                    +
       "Plan Rows": 1,                                                        +
       "Plan Width": 127,                                                     +
       "Index Cond": "((author)::text = 'author3'::text)",+
       "Filter": "((publisher)::text = 'publisher3'::text)"                     +
     }                                                                        +
   }                                                                          +
 ]
(1 row)
Interesting, we have gone faster again. Not a massive difference this time around but an improvement nonetheless that could be more relevant with more data growth and / or when a more complex query with more computation is required. We can safely conclude from this part that yeah if you are only interested in exact matches then the hash index beats the b-tree index. Until the next time take care of yourselves. References:
  • http://www.postgresql.org/docs/9.1/static/using-explain.html
  • http://www.postgresql.org/docs/9.3/static/indexes-bitmap-scans.html