Wednesday, February 14, 2007

Not All SQL Server Indexes Are Created Equal



Not All SQL Server Indexes Are Created Equal


by Brad M. McGehee





If you have much experience with indexes at all, you are probably already familiar with the difference between clustered and non-clustered indexes. But this article is not about them. This article is about whether or not the SQL Server Query Optimizer will use your carefully crafted indexes. You may not be aware of this, but just because a column has an index doesn뭪 mean the Query Optimizer will use it. As you can imagine, creating an index that never will be used is a waste of time, and in the worst cases, it can even reduce the performance of your application. Let뭩 learn why.



To start out, let뭩 look at a simple example. Let뭩 assume we have an accounting database. In that database is a table called 뱋rders? Among a number of different columns in this table, we are interested in two columns: 뱋rderid?and 밻mployeeid? This table has 150,000 rows and there is non-clustered index on the 밻mployeeid?table. Now let뭩 say we want to run the following query:



SELECT orderid FROM orders WHERE employeeid = 5



The first thing to notice about the query is that the 밻mployeeid?column used in the WHERE clause of the query has a non-clustered index on it. Because of this, you would most likely assume that when this query is run through the Query Optimizer, that the Query Optimizer would use the index to produce the requested results.



Unfortunately, you can뭪 automatically make this assumption. Just because there is an available index doesn뭪 necessarily mean that the Query Optimizer will use it. This is because the Query Analyzer always evaluates whether or not an index is useful before it is used. If the Query Analyzer examines an index and finds it not useful, it will ignore it, and if necessary, it will perform a table scan to produce the requested results.



So what is a useful index? In order to answer this question, we need to understand that one of the biggest underlying goals of the Query Optimizer is to reduce the amount of I/O, and the corresponding amount of time it takes to perform execute a query. In other words, the Query Optimizers evaluates many different ways a query can be executed, and finds the one it thinks will produce the least amount of I/O. But what may be surprising is that using an available index does not always mean that it will result is the least amount of I/O used. In many cases, especially with non-clustered indexes, a table scan can produce less I/O than an available index.



Before the Query Optimizer uses an index, the Query Optimizer evaluates the index to see if it is selective enough. What does this mean? Selectivity refers to the percentage of rows in a table that are returned by a query. A query is considered highly selective if it returns a very limited number of rows. A query is considered to have low selectivity if it returns a high percentage of rows. Generally speaking, if a query returns
less than 5% of the number of rows in a table, it is considered to have high selectivity, and the index will most likely be used. If the query returns from 5% - 10% of the rows, the index may or may not be used. If the query returns more than 10% of the rows, the index most likely will not be used. And assuming there are no other useful indexes for the query, a table scan will be performed.



Let뭩 return to our example query:



SELECT orderid FROM orders WHERE employeeid = 5



Just by looking at the query we don뭪 know if the available index on employeeid will be used or not. Let뭩 say that we know that of the 150,000 rows in the table, that 밻mployeeid = 5?is true for 5,000 of the records. If we divide 5,000 by 150,000 we get 3.3%. Since 3.3% is less than 5%, the Query Optimizer will most likely use the available index. But what if 밻mployeeid = 5?is true for 25,000 instead. In this case, we divide 25,000 by 150,000 and we get 16.6%. Since 16.6% is greater than 5%, or even 10%, the Query Optimizer most likely will not use the index and instead perform a table scan.



So how can a table scan use less I/O than using an index, such as the non-clustered index in our example? Non-clustered indexes are great if the index is highly selective, especially if you will be returning one record. But if many records will be returned, and the index is not very selective, it is very expensive in I/O to retrieve the data. The reason for this is that the Query Optimizer has to first go to the index to locate the data (using up I/O) and then it has to go to the table to retrieve it (more I/O). At some point, the Query Optimizer determines that it takes less I/O to scan an entire table than it does to go back and forth between the index and the table to retrieve the requested rows.



The example given above applies mostly to non-clustered indexes. If the available index is a clustered index, then the index may be used, even if there is low selectivity, because the index is the table and I/O operations can be very efficient.



So how does the Query Optimizer know if an available index is selective enough to be useful? It does this by maintaining index statistics on each index in every table. Index statistics are a histogram of values that are stored in the sysindexes table. These statistics are a sampling of the available rows that tells the Query Optimizer approximately how selective a particular index is.



Index statistics are created every time an index is created, rebuilt, when the UPDATE STATISTICS command is run, and automatically by the Query Optimizer as the need arises. Index statistics are not maintained in real time because that would put too much overhead on the server. But because index statistics are not real time, they can get out of date, and sometimes, the Query Optimizer can make a poor choice because the index statistics it uses are not current.



But just because statistics are current doesn뭪 mean that the Query Optimizer will use an available index. Remember, the Query Optimizer bases its decision on the selectivity of an index, and the Query Optimizer uses the index statistics to determine selectivity.

So if the Query Optimizer can check to see if a particular index is useful or not, how can we do the same thing? Fortunately, there is a command that let뭩 us examine an index and find out if a particular index is selective enough to be used.



The reason we want to know if an index is selective enough or not is because if it isn뭪, then it won뭪 be used. And if an index won뭪 be used, there is no point in having it. Most likely, dropping unnecessary indexes can boost the performance of your application because indexes, as you probably know, slow down INSERTs, UPDATEs, and DELETEs in a table because of the overhead of maintaining indexes. And if the table in questions is subject to a high level of database changes, index maintenance can be the cause of some bottlenecks. So our goal is to ensure that if we do have indexes, that they are selective enough to be useful. We don뭪 want to maintain indexes that won뭪 be used.



The command we will use to find the selectivity of an index is:



DBCC SHOW_STATISTICS (table_name, index_name)




When this command runs, it produces an output similar to the following. This is a real result based on one of the databases I maintain.



Statistics for INDEX 'in_tran_idx'.



Updated
Rows Rows Sampled Steps Density Average key length

-------------------- ------- ------------ ------ ------------ ------------------

Feb 24 2001 3:36AM 7380901 7163688
300 2.2528611E-5 0.0



(1 row(s) affected)



All density Columns

------------------------

2.2528611E-5 in_tran_key



Steps

-----------

1

283

301

340

371

403

456

?br>
44510



(300 row(s) affected)




DBCC execution completed. If DBCC printed error messages, contact your system administrator.



This result includes a lot of information, most of which is beyond the scope of this article. What we one to focus on is the density value ?.2528611E-5?under the 밃ll density?column heading.



Density refers to the average percentage of duplicate rows in an index. If an indexed column, such as employeeid, has much duplicate data, then the index is said the have high density. But if an indexed column has mostly unique data, then the index is said to have low density.



Density is inversely related to selectivity. If density is a high number, then selectively is low, which means an index may not be used. If density is a low number, then selectivity is high, and an index most likely will be used.



In the sample printout above, the density for the index is less than 1%. In turn, this means the selectivity of the table is over 99%, which means that the index is probably very useful for the Query Optimizer.



If you are an advanced DBA, you will probably have already noticed that I have oversimplified this discussion. Even so, the point I want to make in this article is still very valid, and my point is, is that not all indexes are equal. Just because an index is available to the Query Optimizer does not mean it will always be used.



For DBAs, this means you need to be wary of your table뭩 indexes. As time permits, you may want to run the DBCC SHOW_STATISTICS command and see how selective your indexes actually are. You may find that some of your indexes are not being used. And if this is the case, you may want to consider removing them, which in turn may speed up your application.



For new DBAs, removing, rather than adding indexes may seem a backward way to performance tune your database. But the more you learn about SQL Server works internally, the better you will understand the limits of using indexes to performance tune your applications.



To read about this subject in much greater depth, see chapter 15, 밫he Query Processor?in the book, Inside Microsoft SQL Server 2000, by Kalen Delaney.

No comments: