ric van dyke know your histograms
Know your Histograms
Histograms have had a history of being both a silver bullet and a lead weight for your queries. Sometimes they make queries run amazingly better, other times not so much. Also there is an impact to collecting stats with them. Understanding the different types of histograms is the first step to getting a handle on using them effectively.
Before we jump into the types of histograms, why even have one to begin with? When the optimizer calculates the expected number of rows (cardinality), it assumes an even distribution of the distinct values within a column. This gives it the basic formula for the simple predicate of COLUMN = VALUE as (number of rows in the table) * (1/number of district values). In a simple case, if the table has 100 rows and there are 4 distinct values then it will estimate that there will be 25 rows for each. ¼ = .25, so our equation would look like 100 * .25 = 25.
This is all well and good as long as each value returns something at least close to that. It’s when there is skew in the data that things go a miss. What if for the 4 distinct values in the simple example, 1 of them occurred 97 times and the other 3 only once? Now the calculation isn’t very accurate. (With a 100 row table it may not really matter, I’m using that here to keep the math simple. I’m sure you can see that if we make the table with hundreds of thousands (or millions) of rows then it make a lot more difference.)
It’s in these cases that a histogram can help. If you know you have a column with a reasonable number of distinct values and the data is not even close to even between those values and the column is used in a predicate, those are the columns you should consider creating a histogram on. Notice that it’s important that the column is used in a predicate. A column with heavy skew that is never used as a predicate, don’t bother. (I’m not going to try to define “heavy skew” with definitive values. That is a science onto itself. You can research Pearson’s formulas for exact definitions on this if you so desire. For this discussion, if a column has a few of its distinct values account for most of the occurrences, that is “heavy skew”.)
Now that we know why we might want a histogram, it’s time to discuss the types. There are now four types of histograms starting with Oracle 12c. Prior to 12c, there were only two. They are Frequency, Height Balanced, and the new ones, Top Frequency and Hybrid. First up I’ll discuss each type so you have an understanding on each one.
When I use the term histogram most folks think of the first type, a frequency histogram. Also we tend to think of a bar chart for its representation. And that works quite well since they do “look like that”. When a histogram is created it has a number of buckets. Each of these buckets will represent two dimensions of the data. Some number of distinct values (the width of the bucket) and a number representing the number of occurrences of that set of value (the height of the bucket). These concepts are easiest to explain with the frequency histogram.
In a frequency histogram there is only one distinct value for each bucket. Hence the width is the same for all buckets. Each one represents just one value. It’s the height that is different for each bucket. The height of each bucket will represent how many times that distinct value appears in the column.
This diagram show for the OBJECT_TYPE column on the table BIG_TAB there is some pretty serious skew. Two values very obviously occur more often in the data than any other value. There are 40 distinct values in OBJECT_TYPE but JAVA CLASS and SYNONYM dominant the occurrences. (Some actual values are not shown in the X axis for readability; all 40 values are in the histogram.)
Note: the Y axis is in logarithmic scale. The value on the X axis is the ENDPOINT_ACTUAL_VALUE column.
With a frequency histogram the optimizer knows exactly how many times a given value appears in a particular column. Of course this exact number is based on the last time statistics were collected, so it’s possible that they could be incorrect. Assuming that stats are kept up to date, even if incorrect they shouldn’t be different enough to make a difference. Clearly these types of histograms are the best.
But what happens when you have more values the buckets? Up thru Oracle 11, the maximum number of buckets is 255. There are certainly many columns that have more than 255 distinct values that we’d like to have a histogram on. This is where the height balanced histogram comes in.
With a height balanced histogram think of the height of each bucket being the same and now the width is different. An important fact to keep in mind here is that a histogram is always in sorted order. As an example each bucket represents 1000 rows. The first bucket might represent the first five distinct values in the column. Then the next two buckets only represent the next one.
Using the same OBJECT_TYPE column on the BIG_TAB table, this is a representation of the height balanced histogram. Now each bucket will all values less than or equal to the value and greater than the bucket previous. This was collected with 30 buckets but you can see that only 10 are shown. The number along the x axis now tells us how many buckets are represented. Once again we can see the values JAVA CLASS and SYNONYM dominate the occurrences.
Note: the value on the Y axis is the ENDPOINT_ACTUAL_VALUE column.
Of course the optimizer doesn’t have an exact count for the first five values here. It will make some calculations to come up with a number of values, which will be something like 1/5 of 1000. The optimizer’s math has gotten better at using height balanced histograms and today the estimations are much closer to reality then they have been in earlier releases. In Oracle 9 for example it was questionable if height balanced histogram were worthwhile in many cases. I find it a little amusing that the math used with these histograms is better than ever today and with the release of Oracle 12, these histograms will likely disappear.
Given the historical problems with height balanced histograms, the developers at Redwood Shores worked feverishly to find a better solution. With Oracle 12 there are really three new developments with histograms that directly attack this issue. First off there are now two new types of histograms, top frequency and hybrid. Also the number of buckets goes from 255 to 2,048.
First the number of buckets now is more in the range of being useful to create more frequency histograms then before. The limit of 255 was really quite low. The reason it has been 255 likely has to do with when histograms were first created the developers need to keep it on the small side. Histograms were first created in Oracle 8, back late 1990s. Given that a histogram is a statistic the developers at that time needed to create something that would be fast for the optimizer to scan and use while it created a plan. Keeping it to 255 maximum buckets would do the trick.
Systems today are much faster than back in the late 90s. A processor speed was measured in megahertz (millions of cycles per second) back then and today are measured in gigahertz (billions of cycles per second). With this increase the need to keep histograms so small has diminished. It still needs to be a compact structure so the optimizer can quickly find information from, a limit of some sort is appropriate, but 255 is really way too small.
On a cautionary note, you may want to stick with 255 as a max for a little while. Going over 255 is new and as always something new might have some issues. Even oracle is being cautious about it. If using the SIZE AUTO it will still use 255 as a max.
Top frequency is one of the two new histograms and as the name implies it will keep track of the “top” occurring values. Others that are statically insignificant will not be in the histogram. The common example is a value that only occurs once wouldn’t be in the histogram.
Notice in this version of the histogram for the same column, there are only 30 buckets. And the smallest value we see is for WINDOW at 288. Only the top 30 values are actually tracked in this histogram, column values like DESTINATION with only 64 occurrences aren’t in the histogram.
Note: the Y axis is in logarithmic scale. The value on the X axis is the ENDPOINT_ACTUAL_VALUE column.
The other new histogram is the hybrid. This, again as the name implies, is a mix of frequency and height balanced. A key problem with the height balanced histogram was the “almost popular” values. The optimizer only considered values that occupied more than one bucket to be popular. A value that didn’t quite fill more than one would often get misrepresented. In this case the buckets can both vary in height and width.
For this example I used a different column on the same table, this is on TIMESTAMP. Try as I did, Oracle wouldn’t create a hybrid on OBJECT_TYPE. The blue line here represents the width of the bucket (as in how many values are in this bucket). This is just like the height balanced. The red is a new column in the histogram views called ENDPOINT_REPEAT_COUNT. This new column means what it says, “How many times is the end point repeated in the bucket”. This way the optimizer has a much better idea of those “almost popular” values in a bucket. (The first value in this graph has a height and repeat count both of 1.)
Note: the X axis is in logarithmic scale. The value on the Y axis is the ENDPOINT_ACTUAL_VALUE column.
One requirement for the two new types of histograms, you must use the AUTOSAMPLESIZE for the estimate percent when gathering the statistics. Using this setting is a good idea anyway so hopefully you are already using this for your stats gathering. If you’re not, now is the time to make the change.
We don’t create histograms in the same way we create a table or index. Many of us talk about histograms like we talk about tables, as if they were a database object. But they are really a statistic on a column, a fairly complex statistics as compared to say the number of rows in a table. Hence we don’t “create” a histogram; we collect them when we collect statistics on the table.
The METHOD_OPT clause is used to collect histograms on columns. The basic syntax looks like this:
METHOD_OPT => 'FOR COLUMNS <COLUMN LIST> SIZE
The “FOR COLUMNS” can also be “FOR ALL COLUMNS” or “FOR ALL INDEXED COLUMNS”.
The column list can be comma separated or space separated. Then there is the size option which lets you set how many buckets are going to be used. Where N is an integer value with a default of 75, max 255 in 11 and lower, 2048 in 12. The other options for size are as follows.
SIZE REPEAT – Do what was done "last time". While collecting statistics, the collection mechanism looks in the data dictionary for what columns had stats and how many buckets were used and does it the same thing. This is likely a good option to use once you have the histograms you want on a table. Setting up the preference for the table to be “for all columns size repeat” insures that you always get the same histograms on a table.
SIZE AUTO – Oracle determines the columns on which to collect histograms based on data distribution and the workload of the columns. This is the default for tables as of version 10. The intent here is that oracle will figure out which columns should have histograms and which shouldn’t. In the big picture this is a good idea. However it can cause problems where histograms “come and go” from one collection to the next, which of course can lead to unstable performance.
In a database with 100’s and 1000’s of tables, using this option sure is a time saver for you having to try to figure out which columns should or shouldn’t have a histogram. But for your key tables in an application I’d recommend you take control and know exactly which columns you have histograms on. This is likely about half a dozen tables in most systems. Once you have all the tables set with histograms, set the preference for all the tables to the repeat option.
SIZE SKEWONLY – This is like AUTO, but this time Oracle determines the columns on which to collect histograms based on the data distribution of the columns only. It doesn’t look at column usage. This might be a good idea in a brand new system, where it’s unknown what the column usage will be. However this certainly could create histograms on columns that don’t get used and that is just a waste of time and space.
The options for the “FOR COLUMNS” also add some interesting twists to the collection. The “FOR ALL COLUMNS” with a particular size (using an integer) is likely a bad idea except when you want to get rid of all histograms, and then doing “FOR ALL COLUMNS SIZE 1” is excellent.
The “FOR ALL INDEXED COLUMNS” on the surface sounds like a good idea. The main reason we create histograms is to help the optimizer choose between an index scan and a full tables scan after all. However this could create histograms on columns that do not have skewed data, which can cause for problems. Having a histogram on a column that isn’t skewed can actually miss-lead the optimizer and have it make bad decisions.
Also would it be a good idea to create a histogram on a column that doesn’t have an index? At first this appears to be a rather silly question of course not, but you might. If the column has skew and is used as a predicate then it might help. Because with being able to see the skew in the column, the optimizer might make a better choice on a join order or join type. This also may drive better use on an index on the other side of the join. The key here is “it might”.
Histograms aren’t free. They cost time and resources to collect. Here is a simple example to illustrate the point. In this case I have a table BIG_TAB with nearly 3 million rows, which occupies over 48 thousand blocks. There are 19 columns in the table. When I collect stat with no histograms it took about 6 second every time. When I did it on all columns (which, of course, I just told you not to do), it took about 9 seconds each time. Adding 50% more time to collect with histograms then without, is not something insignificant.
Of course the reason we collect histograms is to help performance and yes they sure can! Here is a simple query to illustrate the point. Notice that without the histogram the optimizer estimates the number of rows to be returned is 1/NDV * number of rows. This is the selectivity calculation for an equality predicate. The object type column has 40 distinct values and the table has 2,868,512 rows, using the formula:
.025000 * 2868512 = 71,712.8
Notice the E-Rows column with the number of rows estimated just as we did in the formula. And then the A-Rows being the actual number of rows returned. This query did 48,239 LIOs (the buffers column) to get the results. No surprise that how big the table is.
In this example we have a histogram on the object type column and now the optimizer looks up the value in this histogram. Notice the E-Rows and then the A-Rows now match. This query did 186 LIOs (the buffers column) to get the results. This query using the index did about less than 1% the number of LIOs of the full scan. Obviously this is a better plan to get the 320 rows from this table.
Of course histograms don’t work for everything. Some times when they might not help (or at least not much are:
All (or most) predicates on the column use bind variables (prior to 11). Up to 11 bind peeking only took place at hard parse time, so whatever value was look at on the very first run would be used to look into the histogram. This might be good for that run but what about others? In 11 adaptive cursor sharing comes along to help. This is a big topic on its own, so maybe another time I’ll talk about it in detail.
The column data is uniformly distributed. This seems like it shouldn’t be an issue, but the reality is that it can be. The math changes once there is a histogram on a column that that might be enough to make bad choices.
Data distribution changes frequently and statistics aren't collected to match. Here we just can’t collect the statistics fast enough. By the time the histogram is collected, it’s already out of date. Unfortunately for this kind of case you are likely better off without a histogram and just have to deal with performance issues other ways. It might be a case where hints (which shouldn’t be used as a general rule) might be the best way to get thru the issue.
I hope this helps you understand histograms a little better. New ones are coming and all indications are that they are a good thing.
OTECH MAGAZINE #7 spring 2015 copyright otech magazine 2015