- Regular statistics rows, one or more per AMP
- Null rows, zero or one per AMP
- All-null rows, zero or one per AMP
- Average rows per value, one per AMP

The information taken from these rows is used to populate the statistical interval histograms used by the Optimizer to make its initial cardinality estimates (see Interval Histograms).

- Number of null rows
- Number of all-null rows
- Overall average of the average number of rows per value per AMP

You can report these statistics using a HELP STATISTICS request (see Teradata Vantageā¢ - SQL Data Definition Language Syntax and Examples, B035-1144). HELP STATISTICS reports only summary statistics. To report detailed statistics, use a SHOW STATISTICS request.

## Null and All-Null Statistics and Demographics

Null and all-null statistics provide the following cardinality information:

Statistic | Definition |
---|---|

NumNulls | Number of rows in the column set for which one or more of the columns are null. |

NumAllNulls | Number of rows in a composite column set for which all of the columns are null. |

Vantage only collects the NumAllNulls statistic for composite column sets when all of the columns in the composite set are null. In other words, the only time it is possible to collect a NumAllNulls value is when you collect multicolumn statistics or collect statistics on a composite index, and none of the columns in that set have a value.

Consider the following example of what is meant by null and all-null rows. Suppose you create a 3-column USI for table t_ex on columns b, c, and d. When you collect statistics on the USI, one row is found in which all 3 columns making up the USI are null. This would be reported as an all-nulls instance in the statistics. This example is degenerate because there could never be more than one row having a USI with all null columns in a table, but it is useful as an example, even though it does not generalize.

The columns for this row would look something like the following, where nulls are represented by the QUESTION MARK character.

t_ex | |||||
---|---|---|---|---|---|

a | b | c | d | e | f |

PI | USI | ||||

355 | ? | ? | ? | 25000.00 | 375000.00 |

This contrasts with the case where statistics have been collected on a composite column set, and one or more, but not necessarily all, of the columns on which those statistics have been collected for a given row are null.

Assume the same table and index definitions as before. When you collect statistics on the USI, any row found where one or two, but not all, of the three columns making up the USI are null would be reported as an occurrence of a null column in the statistics.

All of the following rows in t_ex would be reported as null fields, because all have at least one null in one of the USI columns, while only the first row would be reported for the all-null fields statistic.

t_ex | ||||||
---|---|---|---|---|---|---|

a | b | c | d | e | f | |

PI | USI | |||||

355 | ? | ? | ? | 25000.00 | 375000.00 | <-- all-null fields (tan) |

685 | ? | 155 | ? | 45000.00 | 495000.00 | <-- null fields (orange) |

900 | 325 | ? | ? | 32000.00 | 400000.00 | |

275 | ? | ? | 760 | 55000.00 | 575000.00 | |

597 | ? | 254 | 893 | 10000.00 | 150000.00 | |

322 | 679 | ? | 204 | 75000.00 | 650000.00 | |

781 | 891 | 357 | ? | 32000.00 | 400000.00 |

All-nulls describes the case where statistics have been collected on a composite column set, and all the columns on which those statistics have been collected for a given row are null. With this information, the Optimizer can more accurately estimate the true number of unique values in a set, thus enabling more accurate join costing.

For example, suppose you have table t1 with columns x1 and y1, and values for x1 and y1.

t1 | ||||
---|---|---|---|---|

x1 | y1 | |||

10 | 10 | |||

20 | ? | <-- null fields (orange) | ||

? | 30 | |||

? | ? | <-- all-null fields (tan) | ||

? | ? |

Again, the all nulls composite fields are shaded orange and the partly null composite fields are shaded green.

If you could collect only the NumNulls statistic, then when you collected statistics on composite columns, and one or more of the columns was null, the row would be counted as a null row. This is not an issue for single-table cardinality estimation because a comparison against a null is always evaluated as FALSE and is treated as an unmatched row.

For example, if you could collect only the NumNulls statistic, the histogram on (x1, y1) would indicate that the number of nulls is 4, the number of unique values is 2, and the total number of rows is 5. If columns x and y are used as join columns, the Optimizer would evaluate the join costing by incorrectly assuming there are only 2 unique values in the data when there are actually 4. This can cause problems in scenarios such as redistribution costs, skew detection, and the like.

To circumvent this problem, Vantage computes the true number of distinct partial nulls and saves them in the histogram as NumPNulls.

The following equation calculates the true distinct values for partial nulls:

Equation Element | Description |
---|---|

NumUniqueValues | the number of true distinct values for partially null rows. |

NumValues | the number of non-distinct values for partially null rows. |

NumPNullValues | the number of true distinct values among the partially null rows. |

1 | an adjustment for the estimation error when the cardinality of all nulls > 0. |

To make this computation, Vantage injects a marker and uses the following aggregation query to collect statistics on (x1,y1). For the following request, the fieldID for null_marker is set to 2 and the fieldID for cnt is set to 3:

SELECT CASE WHEN x1 IS NULL AND y1 IS NULL THEN 1 ELSE IF x1 IS NULL OR y1 IS NULL THEN 2 ELSE 0 END AS null_marker ,x1, y1, COUNT(*) AS cnt FROM t_coll_stats GROUP BY 1, 2;

## Sampled Statistics

To collect sampled statistics, Vantage generates a retrieve request to do the sampling. It then passes the sampled rows to the subsequent aggregation step. The retrieve request can use TOPn operations to retrieve the rows if the table is hash distributed and either hard ordered or a NoPI table. For PPI tables that are sorted on their RowKey value, Vantage can generate a true sample step.

## PARTITION Statistics

The aggregation step builds a detailed row with the partition number and the cardinality for queries such as the following:

SELECT PARTITION, COUNT(*) FROM t_coll_stats;

## UDF Statistics

You can collect statistics on deterministic UDFs, but not on nondeterministic UDFs.

Consider the following deterministic UDF:

CREATE FUNCTION months_between(date1 TIMESTAMP, date2 TIMESTAMP) RETURNS FLOAT LANGUAGE C NO SQL DETERMINISTIC SPECIFIC months_between_tt EXTERNAL NAME 'CS!months_between_tt!$PGMPATH$/months_between_tt.c' PARAMETER STYLE SQL;

The following statement collects statistics on UDF months_between():

COLLECT STATISTICS COLUMN months_between(BEGIN(policy_duration), END(policy_duration)) AS Stats_MthsBetweenBegAndEnd ON policy_types;

- When Vantage
loads statistics header information from the DBC.StatsTbl for a table to handle a
user query or another statistics-related statement, the system detects nonvalid
expressions at the time the query is resolved.
For statistics collected on an unresolvable expression, the system updates the validStats field to FALSE for the corresponding statistics in DBC.StatsTbl.

- When retrieving histograms, Vantage only retrieves those for valid statistics. This minimizes the overhead of parsing nonvalid statistics during query processing time.
- The SHOW STATISTICS and HELP STATISTICS statements report only valid statistics.
- A DROP STATISTICS request on a table forces Vantage to drop all statistics, both valid and nonvalid.
- A SHOW STATISTICS COLUMN request returns a warning message if it is made on nonvalid statistics.
- A DROP STATISTICS COLUMN request returns a warning message if it is made on nonvalid statistics.
- Recollecting statistics on a UDF revalidates the statistics if the UDF exists at the time the statistics are recollected.
- You can query the DBC.StatsV system view to determine nonvalid
statistics by checking the validStats column
You cannot drop individual nonvalid statistics.

To drop nonvalid statistics, you must first drop all statistics on the table and then recollect valid statistics.

- If you drop a UDF and then recreate it to return a different type, Vantage ignores the existing statistics for that function during optimization.

## Average AMP Rows Per Value Statistic

The average AMP rows per value statistic is an exact system-wide average of the cardinality per value for each individual AMP over the number of rows per value for a NUSI column set on which statistics have been collected. This statistic is computed only for NUSI columns and is used for nested join costing.

## Example EXPLAIN Text for First Time Statistics Collection

The following EXPLAIN text reports the steps taken to build a histogram and collect statistics for the first time on column x1 from table t1 with embedded comments to clarify exactly what happens at critical steps in the process.

EXPLAIN COLLECT STATISTICS COLUMN x1 ON t1;

Explanation ----------------------------------------------------------------------- 1) First, we lock QSTATS.t1 for access. 2) Next, we do an all-AMPs SUM step to aggregate from DF2.t1 by way of an all-rows scan with no residual conditions, grouping by field1 ( RK.t1.x1). Aggregate Intermediate Results are computed locally, then placed in Spool 9. The size of Spool 9 is estimated with low confidence to be 2 rows (44 bytes). The estimated time for this step is 0.03 seconds. 3) We do an all-AMPs RETRIEVE step from Spool 9 (Last Use) by way of an all-rows scan into Spool 5 (all_amps), which is built locally on the AMPs. The size of Spool 5 is estimated with low confidence to be 2 rows (52 bytes). The estimated time for this step is 0.04 seconds. 4) Then we save the UPDATED STATISTICS from Spool 5 (Last Use) into Spool 3, which is built locally on the AMP derived from DBC.StatsTbl by way of the primary index. /*Hash of t1*/ 5) We lock DBC.StatsTbl for write on a RowHash. /*Raise a retryable error if the generated stats id is already used*/ /*by some other collect stats on the same table.*/ /*The error step can also be done by USI on (TableId, StatsId)*/ 6) We do a single-AMP ABORT test from DBC.StatsTbl by way of the primary index "DBC.StatsTbl.TableId = <t1 Id> with a residual condition of ("DBC.StatsTbl.StatsId = <generated stats id>"). /*Insert the histogram row with generated stats id*/ 7) We do a single-AMP MERGE into DBC.StatsTbl from Spool 3 (Last Use). The size is estimated with low confidence to be 1 row. The estimated time for this step is 0.31 seconds. /*Update or insert the master record with the updated table level demographics*/ 8) We do a single-AMP UPDATE from Spool 3 by way of the primary index "DBC.StatsTbl.TableId = <t1 Id>" with a residual condition of DBC.StatsTbl.StatsId = 0. The size is estimated with low confidence to be 1 row. The estimated time for this step is 0.02 seconds. If the row cannot be found, then we do an INSERT into DBC.StatsTbl. 9) We spoil the parser's dictionary cache for the table. 10) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> No rows are returned to the user as the result of statement 1.

- In step 7, Vantage inserts the histogram row with the generated StatsId.
- In step 8, Vantage updates the master record with new table-level demographics with a fixed StatsId of 0.
- Note that master records always have a fixed StatsId of 0.

## Example EXPLAIN Text for Recollecting Statistics

The following EXPLAIN text reports the steps taken to recollect statistics on column x1 from table t1 with embedded comments to clarify exactly what happens at critical steps in the process:

EXPLAIN COLLECT STATISTICS COLUMN x1 ON t1;

Explanation ----------------------------------------------------------------------- 1) First, we lock QSTATS.t1 for access. 2) Next, we do an all-AMPs SUM step to aggregate from RK.t1 by way of an all-rows scan with no residual conditions, grouping by field1 ( RK.t1.x1). Aggregate Intermediate Results are computed locally, then placed in Spool 9. The size of Spool 9 is estimated with low confidence to be 2 rows (44 bytes). The estimated time for this step is 0.03 seconds. 3) We do an all-AMPs RETRIEVE step from Spool 9 (Last Use) by way of an all-rows scan into Spool 5 (all_amps), which is built locally on the AMPs. The size of Spool 5 is estimated with low confidence to be 2 rows (52 bytes). The estimated time for this step is 0.04 seconds. 4) Then we save the UPDATED STATISTICS from Spool 1 (Last Use) into Spool 3, which is built locally on the AMP derived from DBC.StatsTbl by way of the primary index. 5) We lock DBC.StatsTbl for write on a RowHash. /*Hash of (t1)*/ 6) We do a single-AMP UPDATE from Spool 3 by way of the primary index "DBC.StatsTbl.TableId = <t1 Id>" with a residual condition of DBC.StatsTbl.StatsId = <existing stats id>. The size is estimated with low confidence to be 1 row. The estimated time for this step is 0.02 seconds. If the row cannot be found, then we do an INSERT into DBC.StatsTbl. 7) We do a single-AMP UPDATE from Spool 3 by way of the primary index "DBC.StatsTbl.TableId = <t1 Id>" with a residual condition of DBC.StatsTbl.StatsId = 0. The size is estimated with low confidence to be 1 row. The estimated time for this step is 0.02 seconds. If the row cannot be found, then we do an INSERT into DBC.StatsTbl. 8) We spoil the parser's dictionary cache for the table. 9) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> No rows are returned to the user as the result of statement 1.

- In step 6, Vantage updates the histogram row with the existing StatsId.
- In step 7, Vantage updates the master record with new table-level demographics with a fixed StatsId of 0.
- Master records always have a fixed statistics ID of 0.