Performance Tips for Profiles
You can improve the performance of your profiles by following these tips:
- Do a sample profile first to make a determination of what metrics to use for what attributes.
- The majority of repository storage space is used in storing the frequency distribution results.
- Be judicious in which columns need frequency distributions.
- Frequency distribution of unique attributes will grow your repository quickly.
- Pattern frequency distribution will grow your repository VERY quickly if there is a large variability of patterns.
- Avoid frequency distribution on primary key. These measurements will consume more disk space, and they will typically will not be very helpful due to their high variability.
- Avoid both types of frequency distributions if they yield results in excess of 40% of the data.
- Primary/foreign key analysis requires some of the primary keys to have frequency distribution performed.
- Redundancy analysis requires some of the primary keys to have frequency distribution performed.
- Allocate 10-15% of the total repository space needed, Do not allocate repository space up front.
- Archive profile reports after they are no longer needed.
- For large tables, use sampling (such as profile every 5th row ).
- Estimate the space required for a single profile report. For most tables, we can assume the breakdown of “uniqueness” of values is as follows:
- N = total number of columns
- U = number of columns that contain 100% unique data
- 50% of N columns contain 40%-70% (55% avg.) unique data
- 25% of N columns contain 10% or less unique data
- 25% of N columns contain 90% or greater unique data
- With those assumptions we can work out a calculation that the estimated space will be 52.5 + U/N = X, where X is the percent of the size of the source table.
- Enter a higher commit interval value in the Profile Options dialog. To enter this value, open the properties dialog for the affected profile. Then select Tools > Profile Options. On the General tab, select the "Commit every N rows" checkbox. Profiles that contain data from Teradata databases are particularly sensitive to the commit interval value.
- Recognize that profiling generates frequency distribution tables. The tables are created dynamically as we as each table is analyzed. These tables grow, but they are necessary for creating the profile reports.