Configuring Basic Options

  1. Leave Standardize input fields checked to standardize the numeric columns to have zero mean and unit variance.
    If you do not use standardization, the results may include components dominated by variables appearing to have larger variances relative to other attributes as a matter of scale rather than true contribution.
  2. Check Score input data to add a column for the model prediction (score) to the input data.
  3. Select a Link function from the drop-down list. This specifies the link between random and systematic components. It says how the expected value of the response relates to the linear predictor of explanatory variables.
    Link functionDescription
    Identity

    Predicts nonsense "probabilities" less than zero or greater than one; sometimes used for binomial data to yield a linear probability model.

    g(p) = p

    Inverse

    Computes the inverse of link functions for real estimates.

    g(μi)=1μi

    Log

    Counts occurrences in a fixed amount of time and space.

    g(μi)=log(μi)

  4. Specify how to handle missing data by checking Skip or Impute means, which will add the mean value for any missing data.
  5. Specify a value between 1 and 100 as the Percentage for training data when the input data is randomly split into training and test data samples.
  6. Enter the value of 100 minus the amount you entered in step 5 as the Percentage for test data.
  7. Enter a number as the Seed for sampling to ensure that when the data is split into test and train data it will occur the same way each time you run the dataflow. Uncheck this field to get a random split each time you run the flow.
  8. Click OK to save the model and configuration or continue to the next tab.