Configuring Advanced Options

  1. Leave Ignore constant fields checked to skip fields that have the same value for each record.
  2. Leave Seed for algorithm checked and enter a seed number to ensure that when the data is split into test and training data it will occur the same way each time you run the dataflow. Uncheck this field to get a random split each time you run the flow.
  3. Select the correct initialization mode in the Init dropdown.
    Initialization modeDescription
    Furthest Initializes the first centroid randomly, but then initializes the second centroid to be the data point farthest away from it. Initializes the centroids to be well spread-out from each other.
    Plus-Plus Initializes the cluster centers before proceeding with the standard k-means optimization iterations. With the k-means++ initialization, the algorithm is guaranteed to find a solution that is O(log k) competitive to the optimal k-means solution.
    Random Chooses K clusters from the set of N observations at random so that each observation has an equal chance of being chosen. This is the default initialization mode.
  4. Leave Seed for N fold checked and enter a seed number to ensure that when the data is split into test and train data it will occur the same way each time you run the dataflow. Uncheck this field to get a random split each time you run the flow.
  5. Check N fold and enter the number of folds if you are performing cross-validation.
  6. Check Fold assignment and select from the drop-down list if you are performing cross-validation.
    Fold assignmentDescription
    Auto Allows the algorithm to automatically choose an option; currently it uses Random. This is the default.
    Modulo Evenly splits the dataset into the folds and does not depend on the seed.
    Note: This field is applicable only if you entered a value in N fold.
  7. Check Maximum iterations and enter the number of training iterations that should take place.
  8. Click OK to save the model and configuration or continue to the next tab.