Column Level Profiling Details

These details are displayed when you select a column () in the left pane.

Summary

  • Completeness (%): The percentage of Complete, Null, and Blank detected in the column.
  • Uniqueness: These statistics are displayed here:
    • Unique: Records with no duplicates in the data source.
    • Non-unique: Records having duplicates in the data source.
    • Distinct: A list of all records present in your data source irrespective of those being unique or non-unique records.
  • Frequency Analysis: Distribution of frequency of data value for any type of column. It shows the repetitions of the data value.
For example, your column contains these names:
Roger
Gigi
Gigi
Gigi
Garey
Elena
Brad
Brad

Here:

  • Roger, Garey, and Elena are unique records.
  • Gigi and Brad are non-unique records.
  • Roger, Gigi, Garey, Brad, and Elena are distinct records.

Further, the column-level details are displayed based on the data type in the column.

Profiling Results for Numerical Data

If the column data type is numerical, the following details are displayed:

  • Numerical Analysis: These statistics are displayed here:
    • Minimum: It is the minimum value for any numerical data and date.
    • Maximum: It is the maximum value for any numerical data and date.
    • Standard Deviation: It is a statistic that measures the dispersion of a dataset relative to its mean.
    • Variance: It is the average of the squared deviation from its mean.
    • Average: It is an average of the numerical values present in the column data.
  • Percentile: It is a value where an observation falls in a range of other observations. For example, if a score falls in the 30th percentile, this means that 30 percent of all the scores recorded are lower.
  • Histogram: Represents the distribution of data.
    Note: The histogram is not shown if the minimum and maximum values for the numerical column are equal.
Profiling Results for String Data

If the column data type is a string, the following details are displayed:

  • String Analysis: These statistics are displayed here:
    • Semantic Type: Detected semantic type. It will not be displayed if the semantic type is not detected in the data.
    • Min Length: Minimum length of the string in the column.
    • Max Length: Maximum length of the string in the column.
  • Text Pattern: It shows whether a string contains a particular pattern of characters.
  • String Length: The distribution of string lengths in the selected string field. String length is the number of characters in a string.
  • Script Distribution: The scripts (alphabet) present in the selected string column. Characters common to scripts are categorized under common.
  • Character Categories: Graphically displays the frequencies of Latin character types detected in the selected string column. The various categories are:
    • Casing: Upper Case, Lower Case, and Mixed Case
    • Character Data Types: Alphabetic, Numeric, and Alphanumeric
    • Contains Spaces: Single Space, Multiple Spaces, and Trailing or Leading Spaces
    • Special Character: Contains or Does not Contain.
      Note: Only the special characters defined during the configuration of Character Analysis rule will be considered here.

Semantic

The details are displayed when you select the Semantic tab.
  • Semantic Type: Displays the list of detected semantic types in the selected column.
  • Confidence: Displays the confidence level on the detected semantic type. It is the percentage of the surety or the possibility of the data present in the column.
For example, 98% confidence in the phone number means that there is a 98% possibility that the data contains phone numbers.
  • Validity: Displays the percentage of valid data for the semantic type.
    Note: Validity is shown only in case you select the respective semantic rule under semantic analysis during profile creation. For details, refer to Semantic Analysis