Profiling Rules

Profiling rules perform different types of analysis on your data. When setting up a profile, choose the profiling rules that perform the types of data analysis you are interested in.

This section describes the profiling rules supported in Metadata Insights.

Character Analysis

This rule identifies patterns, scripts, and character types in string fields. To configure this rule when creating a profile, click the Configure icon and add, edit, or delete special characters from the prepopulated list. When you enable this rule, you will get this information for the string fields in your data:

  • Frequency: The most-used phrases in the selected string column.
  • Scripts Distribution: The different scripts identified in the selected string column and their count.
  • String Lengths: The distribution of string lengths in the selected string column.
  • Character Categories: The types of characters in the selected column, such as letter, punctuation, and number.
  • Text Patterns: Converts data in the string column to pattern and displays the pattern, its count, and percentage occurrence. The pattern is determined using this rule:
    • Latin upper-case characters are replaced with "A"
    • Latin lower-case characters are replaced with "a"
    • Digits are replaced with "9"
    • Control characters are replaced with

Custom Pattern Analysis

This rule identifies any pattern in the string column. You can configure as many regular expressions as you want to match your data against. To configure this rule when creating a profile, click the Configure icon , and enter these details:

  1. Key: Name of the pattern to be identified
  2. Value: The regular expression for the pattern

    For example, if you want to identify email addresses in string columns, enter the regular expression ^(.+)@(.+)$ in the Value field and email in the Key field.

  3. To add another expression, click the Add icon and add details of the next key value pair. You can add as many expressions as you would like to match the data against.

When you enable this rule, you will get this information:

  • Validity: The values that matched at least one of the regular expression patterns in the rule.
  • Pattern Match Distribution: The distribution of records that matched the regular expressions.

Internal Rule

Determines statistics, such as completeness, uniqueness, frequency and outliers in the data set.

Semantic Analysis

  1. User Defined: This rule detects the user defined semantic types in the Define Glossary and uses it to profile the data set. The user-defined semantic types can find out data based on these parameters:
    1. If you defined an expression in the semantic type, it finds out its match in the data set and gets the result on the Data Profiling Results page.
    2. If you specified the data to be looked for, the user-defined semantic type looks for it in the data set and presents it on the Data Profiling Results page. For example, if Single, Married is specified in the Exists In option of the semantic type, it will fetch all the records that have these value.
  2. Credit Card Validation Analysis: Select this rule to detect and validate credit card numbers and identify credit card numbers as JCB, VISA, Diners Club (DINERS), MasterCard, Discover, or American Express (AMEX). If you select this rule, the Data Profiling Results page displays an additional Credit Card Summary tab showing these details:
    1. Validity: The valid and invalid credit card numbers.
    2. Credit Card Distribution: Category-wise distribution of the detected credit cards
  3. Date Analysis: This rule detects and validates dates in string columns. It also identifies date patterns in the columns and their distribution. This analysis can be useful in detecting date entries in erroneous columns, for example in email data. If you select this rule, Data Profiling Results page displays an additional Date Summary tab for the string columns that have dates. This tab shows these details:
    1. Validity: The valid and invalid values.
    2. Date Patterns: The date patterns detected in the selected columns, their total count, and percentage of that pattern in the data set.
  4. Email Analysis: This rule detects and validates the email addresses and determines the distribution of email domains in the selected data column. If you select this rule, Data Profiling Results page displays an additional Email Summary tab showing these details:
    1. Validity: The valid and invalid values.
    2. Domain Distribution: The top ten email domains in the selected data column.
  5. Phone Number Analysis: Select this rule to detect and validate phone numbers and identify phone numbers as fixed line numbers, mobile numbers, or any other type of number. This rule also gives the distribution of the phone numbers by country and region. You need to configure this rule to define the default country to use when a phone number does not have a country code. If you select this rule, Data Profiling Results page displays an additional Phone Number Summary tab showing these details:
    1. Validity: The valid and invalid phone numbers.
    2. Phone Number Types: The types of phone numbers, such as mobile, land line, fixed line, VOIP, Pager, voice mail, or toll-free.
    3. Phone Numbers by Country: The country-wise distribution of the detected phone numbers.
    4. Phone Numbers by Region: The region-wise distribution of the detected phone numbers
  6. Vehicle Identification Number (VIN) Analysis: Select this rule to detect and validate vehicle identification numbers. This rule also gives the distribution of Vehicle Identification Numbers by country. If you select this rule, the Data Profiling Results page displays an additional VIN Summary tab showing these details:
    1. Validity: The valid and invalid vehicle identification numbers.
    2. VIN Country Distribution: Country-wise distribution of the detected vehicle identification numbers.
  7. Social Security Number (SSN) Analysis: Select this rule to detect and validate social security numbers. If you select this rule, the Data Profiling Results page displays an additional SSN Summary tab showing the valid and invalid social security numbers.
  8. International Bank Account Number (IBAN) Analysis: Select this rule to detect and validate international bank account numbers. This rule also gives the distribution of International Bank Account Numbers by country. If you select this rule, the Data Profiling Results page displays an additional IBAN Summary tab showing these details:
    1. Validity: The valid and invalid international bank account numbers.
    2. IBAN Country Distribution: Country-wise distribution of the detected international bank account numbers.
  9. Semantic Analysis: Select this rule to detect semantic types, such as first name, city, country, ISO country code 2 and 3, last name (family name), and states. This rule can help find values in incorrect columns, such as city names in a Country column. If you select this rule, the Data Profiling Results page displays an additional Semantic Type tab showing the detected semantic types and their frequency.
  10. U.S. Address Analysis: This rule determines the quality of your address data using the U.S. database of Universal Addressing Module. To run this rule, you need to:
    1. Install the Universal Addressing Module U.S. database and define it as a resource in Management Console. For more information about adding this database resource, the Administration Guide.
    2. Configure the U.S. Address Analysis rule by clicking the Configure button and entering this information:
      • US Address Coder database: Select the Universal Addressing Module database resource configured in Management Console.
      • AddressLine1 field to AddressLine5 field: Map these fields to the columns of the table you are analyzing. You do not necessarily need to enter column names in all the fields. However, the more specific you are the better the matching score will be.
      • Map the columns from your table to the City, Country, USUrbanName, FirmName field, PostalCode, and StateProvince fields.

        If you select this rule, the Data Profiling Results page shows the Address Summary tab.

        • The legend below the chart shows the match score for the data along with the color coding.
        • Point anywhere in the graph area to view the match score. The scores are in ranges (0, 1-25, 26-50, 51-80, 81-99, and 100), with zero representing no match of the data to the database. The graph also shows the percentage of matching records detected (color coded).
        • Click the area in the graph to view the data that matched or did not match to the database.
  11. International Address Analysis: This rule determines the quality of your address data using the database of the Global Address Validation Module. To run this rule, you need to:
    1. Install the Global Address Validation database and define it as a resource in Management Console. For more information, see the Administration Guide.
    2. Configure the International Address Analysis rule by clicking the Configure button , and entering this information:
      • Addressing Engine Database: Select the Global Address Validation database resource configured in the Management Console.
      • Confidence Threshold: Enter the value of confidence threshold to detect fields falling below the specified value. The default value of this field is 80.
      • Table List: Select the table on which you would like to run this rule.
      • AddressLine1 field and Country field: Map these fields to the columns in the table you are analyzing.
      • Map the columns from your table to the LastLine, City, CitySubdivision, PostalCode, State, StateSubdivision, and FirmName fields. You do not necessarily need to enter column names in all the fields. However, the more specific you are the better the matching score will be.
        If you select this rule, the Data Profiling Results page shows and Address Summary tab, which displays:
        • International Address Confidence Distribution: The match score for the data. The score is color coded. Point anywhere in the donut chart to view the score range (0, 1-25, 26-50, 51-80, 81-99, and 100), with zero representing no match of the data in the database. Click the area in the chart to preview the matching or non-matching data.
        • International Address Precision: This distribution of validation levels of addresses, such as state, house, postal code, city, city sub-division, and street.
        • International Address Mismatched Fields Distribution: The distribution of mismatched fields of addresses, such as City Subdivision, State Province, Street Name, and Postal Code are displayed here. The confidence threshold of this address is less than the value defined by you while configuring the International Address Analysis rule.