Column Level Profiling Details
These details are displayed when you select a column () in the left pane.
Summary
- Completeness (%): The percentage of Complete, Null, and Blank detected in the column.
- Uniqueness: These statistics are displayed here:
- Unique: Records with no duplicates in the data source.
- Non-unique: Records having duplicates in the data source.
- Distinct: A list of all records present in your data source irrespective of those being unique or non-unique records.
- Frequency Analysis: Distribution of frequency of data value for any type of column. It shows the repetitions of the data value.
For example, your column contains these names:
Roger Gigi Gigi Gigi Garey Elena Brad Brad
Here:
- Roger, Garey, and Elena are unique records.
- Gigi and Brad are non-unique records.
- Roger, Gigi, Garey, Brad, and Elena are distinct records.
Further, the column-level details are displayed based on the data type in the column.
Profiling Results for Numerical Data
If the column data type is numerical, the following details are displayed:
- Numerical Analysis: These statistics are displayed here:
- Minimum: It is the minimum value for any numerical data and date.
- Maximum: It is the maximum value for any numerical data and date.
- Standard Deviation: It is a statistic that measures the dispersion of a dataset relative to its mean.
- Variance: It is the average of the squared deviation from its mean.
- Average: It is an average of the numerical values present in the column data.
- Percentile: It is a value where an observation falls in a range of other observations. For example, if a score falls in the 30th percentile, this means that 30 percent of all the scores recorded are lower.
- Histogram: Represents the distribution of data.Note: The histogram is not shown if the minimum and maximum values for the numerical column are equal.
Profiling Results for String Data
If the column data type is a string, the following details are displayed:
- String Analysis: These statistics are displayed here:
- Semantic Type: Detected semantic type. It will not be displayed if the semantic type is not detected in the data.
- Min Length: Minimum length of the string in the column.
- Max Length: Maximum length of the string in the column.
- Text Pattern: It shows whether a string contains a particular pattern of characters.
- String Length: The distribution of string lengths in the selected string field. String length is the number of characters in a string.
- Script Distribution: The scripts (alphabet) present in the selected string column. Characters common to scripts are categorized under common.
- Character Categories: Graphically displays the frequencies of Latin
character types detected in the selected string column. The various
categories are:
- Casing: Upper Case, Lower Case, and Mixed Case
- Character Data Types: Alphabetic, Numeric, and Alphanumeric
- Contains Spaces: Single Space, Multiple Spaces, and Trailing or Leading Spaces
- Special Character: Contains or Does not Contain.Note: Only the special characters defined during the configuration of Character Analysis rule will be considered here.
Semantic
The details are displayed when you select the Semantic tab.
- Semantic Type: Displays the list of detected semantic types in the selected column.
- Confidence: Displays the confidence level on the detected semantic type. It is the percentage of the surety or the possibility of the data present in the column.
For example, 98% confidence in the phone number means
that there is a 98% possibility that the data contains phone numbers.
- Validity: Displays the percentage of valid data for the
semantic type.Note: Validity is shown only in case you select the respective semantic rule under semantic analysis during profile creation. For details, refer to Semantic Analysis