Rules of thumb

From PTAGISWiki

Jump to: navigation, search

Contents

Ingres Performance Tips / Rules of Thumb

Storage Structures

  I use BTREE for almost everything. 

  • I've found that unless you are just writing to a table that HEAP tables tend to do full table scans, even with indexes. 
  • ISAM works well but you can easily run into locking issues and contention due to overflow pages.  I've also seen much more sort activity that with BTREE.
  • I use HASH when I need to separate people for concurrency reasons (e.g., a trouble ticket table with 200 people accessing / updating the table at any one time), but have found that even for things that should be a direct match there are often scans on the HASH table and sorting due to the query plan selected.  Not a big deal, but because the HASH tables are generally larger this increases I/O.

If there is a good "primary key" that has a lot of leading-edge granularity I will often use this for the storage structure.  Sometimes it makes sense to use a "clustering key" to localize data and minimize I/O.  For example, at an insurance company years ago the policy table was 4+ GB and almost everything caused a table scan.  I modified this to BTREE on company (there were three companies), branch (each company had ~50-100 branches), and state.  There was a unique index on policy # that was used both for access and to maintain uniqueness.  This approach improved query performance by something like two orders of magnitude because we were doing more direct access and much less I/O.  

Constraints

  I tend to avoid these since they are implemented as secondary indexes and sometimes redundant to the storage structures and/or indexes selected.  

Indexing

1. I start with the columns referenced in the "where" clause of the query.  Whenever possible we want the query qualification criteria to be satisfied by an index.

2. I then look at columns referenced in the "select" portion of the query.  If an index can satisfy the query needs then it won't be necessary to visit the base table - saving time and improving performance.

3. I look at a mix of queries (using printqry output or tools that my old company has for query analysis) and try to determine how many indexes are needed. From a performance perspective 1-3 indexes is ideal, 4-7 will work OK for a production environment, and 8+ tends to cause performance problems.  Each index is a table that needs to be updated, and with 8+ indexes you will see 30%+ write overhead, but the bigger delay seems to be with the time the optimizer takes to select a good plan or timeout.

4. My rule of thumb is that an index should ideally be no more than 20% of the size of the base table.  This is because you quickly minimize the benefit of indexes - even when everything works properly, when the I/O on the index starts matching the I/O on the base table.

5. I avoid indexes that start out with the same columns as the base tables as they seem to be ignored unless they have a different storage structure than the base table.  When I do include those columns I make sure that they are in a different order.

6. I will try to have as much leading-edge granularity in the index as possible, because if you don't the index might be ignored and will definitely use more I/O to find what is needed.

7. I'm totally OK with overlap between columns in indexes.  Ingres will only use one index for a table, so I try to ensure that whatever index is selected there is sufficient data to optimize access and minimize I/O.

8. I will not use "non-key" columns in indexes except with HASH indexes.  I have found the overhead difference in BTREE indexes to be extremely minimal, and from a usability perspective it makes sense to have those extra columns visible from "help table" and not just from "help index".

Statistics

1. "Optimizedb -zk -zu200 -zr200" has been my friend for many years and almost always produces better stats than the default status due to increased histograms.

2. Whenever the "average count per value" is 20% or greater on a column I will remove stats on that column (statdump -zdl -rTable -aBadColumn).  Even if the stats are accurate it tends to skew the optimizer towards plans that involve table scans.

Validation

  The fastest way that I've found to do this is to get the printqry data and then 'grep "^Query [SR]" iiprtqry.log | more' and quickly scan the timestamps looking for gaps.  This becomes very easy (your eye / mind tends to notice the patterns amazingly well), which helps you identify potential problem queries.  If a QEP is not using the index that I feel it should I ask why?  Did it timeout?   Are the estimates reasonable?  (if not, look at the stats)  What was the query trying to do?  Why might the optimizer feel that the other approach was superior?  Usually after 2-3 iterations you have these things nailed and performing very efficiently.   I will also use DM420 on a production system to gather metrics before and after schema changes.  You need to make sure that you have a large enough window of activity for this to be fairly valid (say, morning, noon, and quitting time the day before the change, and then the same afterwards), and also have to make sure that the counters don't overflow and roll-over.

Personal tools