Notes on histograms

By Brian Fitzgerald

This blog post is a worked example of Oracle histograms. In this case, a frequency histogram is demonstrated in the context of a single table query.

Setup

cr.histodemo.sql

Table ords. Intended column data distribution:

10000 rows
10000 unique ordid
1000 categories, evenly distributed
2 distinct status, skewed
99.99% COMPLETE
0.01% PENDING

One index on each column, ORDID, CATEGORYID, and STATUS.

Not null constraints are there to simplify the demonstration.

whenever oserror exit 1
@ sqlerc.sql
drop table ords purge;
@ sqlerx.sql

create table ords
(
ordid number not null,
categoryid number not null,
status varchar2(10) not null
);

insert into ords ( ordid, categoryid, status )
select
level,
mod( level, 1000 ),
case mod( level, 10000) when 0 then 'PENDING' else 'COMPLETE' end
from dual
connect by level <= 10000;

alter table ords add constraint ords primary key ( ordid );
create index category_idx on ords ( categoryid );
create index status_idx on ords ( status );

quit

output:

[oracle@stormking db12201 stats]$ sqlplus u/u @ cr.histodemo.sql
SQL*Plus: Release 12.2.0.1.0 Production on Sun Feb 11 11:37:31 2018
Copyright (c) 1982, 2016, Oracle. All rights reserved.
Last Successful login time: Sun Feb 11 2018 00:10:42 -05:00
Connected to:
Oracle Database 12c Enterprise Edition 
Release 12.2.0.1.0 - 64bit Production
Table dropped.
Table created.
10000 rows created.
Table altered.
Index created.
Index created.
Disconnected from Oracle Database 12c Enterprise Edition 
Release 12.2.0.1.0 - 64bit Production

Execution scripts

query.ords.sql

Thwo queries on table ords.

  1. equality predicate on STATUS only
  2. equality predicate on STATUS and CATEGORYID

The 10053 trace shows optimizer’s rationale.

set linesize 100
@ sqlerx.sql
column plan_table_output format a80

@ set.tracefile.identifier.sql status
@ trace.10053.on.sql

select ordid from ords
where status = 'PENDING'
;
select * from table(dbms_xplan.display_cursor( format=>'basic' ));
@ trace.10053.off.sql

@ set.tracefile.identifier.sql cat
@ trace.10053.on.sql

select ordid from ords
where categoryid = 0
and status = 'COMPLETE'
;
select * from table(dbms_xplan.display_cursor( format=>'basic' ));
@ trace.10053.off.sql

rpt.col.usage.sql

Script to show what types of workload has run on the columns. Database monitoring is flushed for the sake of displaying the col_usage report before gathering stats in this demonstration. Gather stats automatically flushes column usagem so alling flush_database_monitoring_info is generally not required.

define ownname=&&1
define tabname=&&2
whenever oserror exit 1
@ sqlerx.sql
@ l32k.p50k.sql

set verify off

set long 2000000000
exec dbms_stats.flush_database_monitoring_info;
column rpt format a160
select dbms_stats.report_col_usage('&&ownname','&&tabname') rpt from dual;

gather.table.stats.sql

Invalidation, in this case, is to force a 10053 trace for the sake of this demonstration. The 10046 trace and tkprof is for checking what queries on ords get run during gather stats. With the exception of no_invalidate, dbms_stats is run with the default options to show how histograms are gathered automatically during routine database maintenance. Estimate_percent defaults to auto_sample_size. Gathering or deleting a histogram can be forced by using option method_opt.

define ownname=&&1
define tabname=&&2
whenever oserror exit 1
@ sqlerx.sql
set verify off

@ set.tracefile.identifier.sql stats
@ trace.10046.on.sql

begin
 dbms_stats.gather_table_stats (
 ownname => '&&ownname',
 tabname => '&&tabname',
 no_invalidate => false
 );
end;
/
prompt done gather stats on &&ownname..&&tabname

@ trace.10046.off.sql
@ tkprof.sql

user.tab.statistics.sql

Script to show the existence or nonexistence of a histogram

define tabname=&&1
whenever oserror exit 1
@ sqlerx.sql
@ l32k.p50k.sql
@ columnformat.sql

select
table_name,
blocks,
num_rows
from user_tab_statistics
where
table_name = '&&tabname';

select
index_name,
leaf_blocks,
num_rows,
distinct_keys
from user_ind_statistics
where table_name = '&&tabname';

select
column_name,
num_distinct,
density,
histogram,
num_buckets
from user_tab_col_statistics
where table_name = '&&tabname';

run.histodemo.sql

The demonstration script

whenever oserror exit 1
@ spoolfile.uniq.sql ords

@ rpt.col.usage.sql U ORDS
@ gather.table.stats.sql U ORDS
@ user.tab.statistics.sql ORDS
@ query.ords.sql

@ spooloff.sql
quit

Execution run 1

Files

ords.db12201.20180211.142321.84.txt

run.histodemo.sql output

db12201_ora_20225_STATS.trc

10043 trace of gather table stats

db12201_ora_20225_STATS.tkp

tkprof of gather table stats

db12201_ora_20225_STATUS.trc

10053 trace of “where status = ‘PENDING'”

db12201_ora_20225_CAT.trc

10053 trace of “where categoryid = 0 and status = ‘COMPLETE'”

Highlights

  • Because table ords is newly created, the column usage report shows no workload
  • Gather stats ran four queries on table ords, namely a full table scan and one per index.
$ grep -i '"U"."ORDS" t' db12201_ora_20225_STATS.tkp
 "U"."ORDS" t /* ACL,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL*/
 "U"."ORDS" t where "ORDID" is not null
 "U"."ORDS" t where "CATEGORYID" is not null
 "U"."ORDS" t where "STATUS" is not null
  • No histogram was gathered
  • ORDID is most selective column, having the highest num_distinct and lowest density
  • CATEGORYID is medium-selective
  • STATUS is not, in general, selective, having only 2 distinct values
COLUMN_NAME NUM_DISTINCT DENSITY HISTOGRAM NUM_BUCKETS
------------------------------ ------------ ---------- -------------------
ORDID 10000 .0001 NONE 1
CATEGORYID 1000 .001 NONE 1
STATUS 2 .5 NONE 1
  • No histogram is present on status. Density is computed:

density = 1 / num_distinct = 1/2 = .5

  • “where status = ‘PENDING'” catches a full table scan
----------------------------------
| Id | Operation | Name |
----------------------------------
| 0 | SELECT STATEMENT | |
| 1 | TABLE ACCESS FULL| ORDS |
----------------------------------
  • “where categoryid = 0 and status = ‘COMPLETE'” catches an index range scan
------------------------------------------------------------
| Id | Operation | Name |
------------------------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | TABLE ACCESS BY INDEX ROWID BATCHED| ORDS |
| 2 | INDEX RANGE SCAN | CATEGORY_IDX |
-----------------------------------------------------------

Execution run 2

Files

ords.db12201.20180211.142525.84.txt

run.histodemo.sql output for run 2.

db12201_ora_20607_STATS.trc

db12201_ora_20607_STATS.tkp

Gather stats 10046 trace and tkprof.

db12201_ora_20607_STATUS.trc

db12201_ora_20607_CAT.trc

10053 trace again for the queries, “where status” and “where categoryid …
and status”.

Highlights

  • The column usage report shows that equality predicates have run on columns CATEGORYID and STATUS.
COLUMN USAGE REPORT FOR U.ORDS
..............................

1. CATEGORYID : EQ
2. STATUS : EQ
  • This time, gather stats ran six queries on table ords.
$ grep -i '"U"."ORDS" t' db12201_ora_20607_STATS.tkp | wc -l
6
  • gather table stats ran this query on column STATUS:
select 
/*+ no_parallel(t) no_parallel_index(t) dbms_stats 
cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) 
no_monitoring xmlindex_sel_idx_tbl 
opt_param('optimizer_inmemory_aware' 'false') no_substrb_pad 
*/
 substrb(dump("STATUS",16,0,64),1,240) val,
 rowidtochar(rowid) rwid 
from "U"."ORDS" t 
where rowid in (
chartorowid('AAAWt1AAHAAAAFjAAA'),
chartorowid('AAAWt1AAHAAAAGCAAv')) 
order by "STATUS"

Clearly, the only purpose is to obtain the two unique STATUS values.

  • Gather stats ran this query on column CATEGORYID:
select 
substrb(dump(val,16,0,64),1,240) ep, 
freq, 
cdn, 
ndv,
(sum(pop) over()) popcnt,
(sum(pop*freq) over()) popfreq,
substrb(dump(max(val) over(),16,0,64),1,240) maxval,
substrb(dump(min(val) over(),16,0,64),1,240) minval
from
 (
select val, 
freq, 
(sum(freq) over()) cdn, 
(count(*) over()) ndv,
(case when freq > ((sum(freq) over())/254) then 1 else 0 end) pop
from (select
  /*+ no_parallel(t) no_parallel_index(t) dbms_stats
    cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) 
    no_monitoring xmlindex_sel_idx_tbl 
    opt_param('optimizer_inmemory_aware' 'false') no_substrb_pad
 */
  "CATEGORYID" val, 
  count("CATEGORYID") freq 
from "U"."ORDS" t 
where "CATEGORYID" is not null 
group by "CATEGORYID"
)) order by val

Inference: dbms_stats is collecting information needed to construct a histogram on STATUS and CATEGORYID.

Notice the column name “ep”, which could refer to a histogram endpoint value.

“pop” = 1 if popular, 0 if non-popular.

  • dbms_stats stored a histogram for column status but not for categoryid
COLUMN_NAME NUM_DISTINCT DENSITY HISTOGRAM NUM_BUCKETS
------------------------------ ------------ ---------- ------- -----------
ORDID 10000 .0001 NONE 1
CATEGORYID 1000 .001 NONE 1
STATUS 2 .00005 FREQUENCY 2

Inference: dbms_stats found skew in status but not in categoryid

  • The reported density on column status has changed:
82c84
< STATUS 2 .5 NONE 1
---
> STATUS 2 .00005 FREQUENCY 2

According to 2007 note New Density calculation in 11g by Alberto Dell’Era, density = 0.5 / num_rows for frequency histograms. Dell’Era’s paper refers only to not null values. In other words, for frequency histograms, density, as stored in the catalog, should be

density = .5 / num_values = .5 / ( num_rows – num_nulls )

I have found this to be accurate to within 2.5e-15 for 99% out of the 30,000 frequency histograms in one example database, where the histograms were mostly gathered with Oracle 11.1.0.7 binaries. After gathering statistics with Oracle 12.1.0.2 binaries, dba_tab_col_statistics density is found to agree with formula .5 / ( num_rows – num_nulls ) to within 2.5e-15 for 99.95% of all frequency histograms.This formula is found to apply only for the cases where num_distinct >1. For the remaining 0.05%, the density is within 2% of the formula. All the discrepant densities are IOTs; however, not all IOTs are discrepant, so I must be missing some detail.

In this example, num_values = 10000 – 0 = 10000. Density = .5 / 10000 = 0.00005.

The endpoint values are, roughly, using cre_hist_funcs.sql, function hist_numtochar, by Martin Widlake:

ENDPOINT_NUMBER VAL
--------------- ----------
 9999 COMPLEP
 10000 PENDIN<

Comment: density is used to cost predicates on non-popular values.

“where status = ‘COMPLETE'” remains non-selective.

“where status =’PENDING'” is selective.

Compare run without histogram vs run with histogram

where status =’PENDING’

diff run1/db12201_ora_20225_STATUS.trc run2/db12201_ora_20607_STATUS.trc

Output here: 10053 diff: where status = ‘PENDING’

There several points of interest in the 10053 trace comparison

  • With the histogram, the selectivity of status=’PENDING’ is 1.0000e-4, which is exactly correct.
875,876c877,879
< AvgLen: 9 NDV: 2 Nulls: 0 Density: 0.500000
< Estimated selectivity: 0.500000 , col: #3
---
> AvgLen: 9 NDV: 2 Nulls: 0 Density: 0.000050
> Histogram: Freq #Bkts: 2 UncompBkts: 10000 EndPtVals: 2 ActualVal: no
> Estimated selectivity: 1.0000e-04 , endpoint value predicate, col: #3
  • The plan changes thus:
1016,1021c1028,1034
< -------------------------------------+-----------------------------------+
< | Id | Operation | Name | Rows | Bytes | Cost | Time |
< -------------------------------------+-----------------------------------+
< | 0 | SELECT STATEMENT | | | | 11 | |
< | 1 | TABLE ACCESS FULL | ORDS | 5000 | 63K | 11 | 00:00:01 |
< -------------------------------------+-----------------------------------+
---
> ---------------------------------------------------------+-----------------------------------+
> | Id | Operation | Name | Rows | Bytes | Cost | Time |
> ---------------------------------------------------------+-----------------------------------+
> | 0 | SELECT STATEMENT | | | | 2 | |
> | 1 | TABLE ACCESS BY INDEX ROWID BATCHED | ORDS | 1 | 13 | 2 | 00:00:01 |
> | 2 | INDEX RANGE SCAN | STATUS_IDX| 1 | | 1 | 00:00:01 |
> ---------------------------------------------------------+-----------------------------------+

where categoryid = 0 and status = ‘COMPLETE’

diff run1/db12201_ora_20225_CAT.trc run2/db12201_ora_20607_CAT.trc

output here: 10053 diff: where categoryid and status

  • With a histogram, the estimated cost of using an index to access a popular value is higher. All the more reason not to use STATUS_IDX
 917 Access Path: index (AllEqRange)
 918 Index: STATUS_IDX
919,921c922,924
< resc_io: 30.000000 resc_cpu: 2414493
< ix_sel: 0.500000 ix_sel_with_filters: 0.500000
< Cost: 30.064126 Resp: 30.064126 Degree: 1
---
> resc_io: 60.000000 resc_cpu: 4827696
> ix_sel: 0.999900 ix_sel_with_filters: 0.999900
> Cost: 60.128217 Resp: 60.128217 Degree: 1
  • The cost of CATEGORY_IDX is unchanged.
 904 ****** Costing Index CATEGORY_IDX
 905 SPD: Return code in qosdDSDirSetup: NOCTX, estType = INDEX_SCAN
 906 SPD: Return code in qosdDSDirSetup: NOCTX, estType = INDEX_FILTER
 907 Estimated selectivity: 0.001000 , col: #2
 908 Access Path: index (AllEqRange)
 909 Index: CATEGORY_IDX
 910 resc_io: 11.000000 resc_cpu: 83586
 911 ix_sel: 0.001000 ix_sel_with_filters: 0.001000
 912 Cost: 11.002220 Resp: 11.002220 Degree: 1
  • CATEGORY_IDX is more selective and less costly than STATUS_IDX. The optimizer chooses CATEGORY_IDX.

Need for histogram in an application

The need for a histogram depends on the application. In this example, no histogram is needed on columns ORDID and CATEGORYID because the values are evenly distributed. There is no skew.

If the application code base consists of only this query, then no histogram is needed on STATUS:

select ordid from ords
where categoryid = 0
and status = 'COMPLETE'

If the application, additionally, has this query, then a histogram is needed on STATUS.

select ordid from ords
where status = 'PENDING'

What about “where status = ‘COMPLETE'”?

The question sometimes arises, what about a query with just “where status = ‘COMPLETE'”. Such a query would access a high number of blocks and return a high number of rows, and would be quickly identified as a poor performer. It is not clear what business value would be served by such a query. Such queries appear less commonly in practical applications.

What about two plans for the same statement?

Histograms are sometimes touted for their usefulness in finding more than one plan for the same sql. This blog post does not identify any such examples.

The hypothetical situation we are describing is:

  • The SQL is the same.
  • The plans are different.
  • In one case, the predicate is on a popular value and in the other case, a rare value.
  • In the popular case, the query may return a high number of rows, or a plan step may return a high number of rows to its antecedent.
  • In the non-popular case, the query accesses few blocks and returns quickly.
  • The popular and non-popular cases have business value.
  • The slower response time of the popular case is acceptable.

Findings

  • Column usage in predicates, as recorded in col_usage$, leads to analyzing for, but not necessarily storage of, histograms.
  • 10053 trace files show the optimizer’s rationale for choosing its final execution plan.
  • In some cases, a histogram helps the optimizer choose a less costly plan.
  • In some cases, a histogram improves the accuracy of a cost computation, but does not affect the final choice of plan.

One thought on “Notes on histograms

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s