You're invited to submit your Oracle-related questions to us at askdbspecialists01@dbspecialists.com. Include your name, telephone number and problem description. If your question is answered in The Specialist next month, you will receive a free copy of our Oracle Wait Events reference poster. (US shipping addresses only).
This month’s question comes to us from Carl in Chicago: I’ve heard that Oracle recommends using DBMS_STATS instead of the ANALYZE TABLE command, but I’m wondering does it make any real difference in the statistics obtained?
Terry Sutton of the Database Specialists team responds:The ANALYZE command and the DBMS_STATS package offer two different ways to collect statistics about the objects in your database. Oracle’s query optimizer uses these statistics when determining the most efficient way to perform a query. For instance, if the statistics say one table has only 20 rows, then Oracle may opt for a full table scan instead of an index lookup when accessing that table. If the statistics on another table say a certain column has very few distinct values, Oracle may choose not to use an index on that column.
The DBMS_STATS package was introduced in Oracle 8i, and for the last few years Oracle Corporation has been strongly advising customers to use DBMS_STATS instead of ANALYZE. DBMS_STATS does have some useful features not available with ANALYZE, such as parallel statistics collection, gathering statistics only when existing statistics are stale, gathering partition-level and subpartition-level statistics, among others. One particularly valuable feature, available starting in Oracle 9i, is the GATHER_SYSTEM_STATS procedure. This enables the optimizer to include CPU costs in its calculations. This feature is best described in Jonathan Lewis’s article Understanding System Statistics.
But in addition to the new features, DBMS_STATS can sometimes produce better information. The statistics gathered by DBMS_STATS can be more accurate than those collected by ANALYZE, as we found recently at a client site. The following query:
SELECT ... FROM file_history WHERE fname = :b1
was performing full table scans of a table with 1.7 million rows, even though there was an index on the FNAME column and the dba_indexes view showed that the index was believed to have 1.7 million distinct values. But when we looked at the dba_tab_columns view, we saw:
COLUMN_NAME NUM_DISTINCT LAST_ANALYZED SAMPLE_SIZE
------------ ------------ ------------------ -----------
FNAME 1458 18-DEC-04 04:09:49 9050
A query of the FILE_HISTORY table showed 1,741,405 rows and 1,741,405 distinct values in the FNAME column. So the optimizer was choosing an execution plan for the query on the basis that there were only 1458 distinct values for the column in a 1.7 million row table. From this the optimizer determined that a full scan was the appropriate execution path. The client had been gathering statistics using the ANALYZE command with a sample size of 10,000 rows. Inaccurate statistics led to an inferior execution plan.
Using ANALYZE with a larger sample size helped:
ANALYZE TABLE file_history ESTIMATE STATISTICS SAMPLE 5 PERCENT;
The column statistics in dba_tab_columns now were:
COLUMN_NAME NUM_DISTINCT LAST_ANALYZED SAMPLE_SIZE
------------ ------------ ------------------ -----------
FNAME 10827 18-DEC-04 16:40:22 83096
An even larger sample size helped more:
ANALYZE TABLE file_history ESTIMATE STATISTICS SAMPLE 10 PERCENT;
COLUMN_NAME NUM_DISTINCT LAST_ANALYZED SAMPLE_SIZE
------------ ------------ ------------------ -----------
FNAME 17002 18-DEC-04 16:53:36 159672
But even using the ANALYZE command with a 10% sample gave us a fairly inaccurate count of 17,002 distinct values. So next we tried collecting statistics with the DBMS_STATS package:
EXECUTE dbms_stats.gather_table_stats (ownname=>'PROD', -
tabname=>'FILE_HISTORY',estimate_percent=>5,cascade=>true)
COLUMN_NAME NUM_DISTINCT LAST_ANALYZED SAMPLE_SIZE
------------ ------------ ------------------ -----------
FNAME 1737740 18-DEC-04 17:01:53 86887
We see that collecting statistics with the DBMS_STATS package and a sample size of 5% in this case yielded an estimate of distinct values within 0.2% of the actual figure. This is certainly good enough for the optimizer, while the statistics collected by the ANALYZE command were off by a couple orders of magnitude.
Please note that there are a few statistics that the DBMS_STATS package does not gather. The values in the dba_tables view for CHAIN_CNT, AVG_SPACE, and EMPTY_BLOCKS are not updated by DBMS_STATS, so if you want these statistics to be updated you’ll still have to use ANALYZE. But generally these statistics do not need to be as current as those used by the optimizer, so an occasional ANALYZE can be performed if you want these statistics, while DBMS_STATS can be used more frequently to provide the best information for the query optimizer.
In summary, the DBMS_STATS package offers many features not available with the ANALYZE command. Furthermore, there are times when DBMS_STATS has been seen to collect far more accurate statistics. For these reasons, moving to DBMS_STATS seems like a good idea. However, if you are interested in statistics on chained rows, average space within each block below the high water mark, or number of blocks in allocated extents above the high water mark, you’ll need to use the ANALYZE command to collect this information.