Configuring Advanced Options

  1. Leave Ignore constant fields checked to skip fields that have the same value for each record.
  2. Check Balance classes to balance the class distribution and either under sample the majority classes or over sample the minority classes.
  3. Select a Histogram type.
    Histogram typeDescription
    Auto Buckets are binned from minimum to maximum in steps of (max-min)/N. Use this option to specify the type of histogram for finding optimal split points.
    QuantilesGlobal Buckets have equal population. This computes nbins quantiles for each numeric (non-binary) column, then refines/pads each bucket (between two quantiles) uniformly (and randomly for remainders) into a total of nbins_top_level bins.
    Random The algorithm will sample N-1 points from minimum to maximum and use the sorted list of those to find the best split.
    RoundRobin The algorithm will cycle through all histogram types (one per tree).
    UniformAdaptive Each feature is binned into buckets of equal step size (not population). This is the quickest method but can lead to less accurate splits if the distribution is highly skewed.
  4. Select a Categorical encoding.
    Categorical encodingDescription
    Auto Automatically performs enum encoding.
    Binary

    Converts categories to integers, then to binary, and assigns each digit a separate column. Encodes the data in fewer dimensions but with some distortion of the distances.

    Note: No more than 32 columns can exist per categorical feature.
    Eigen k columns per categorical feature, keeping projections of one-hot-encoded matrix onto k-dim eigen space only.
    Enum Cycles through all histogram types (one per tree).
    OneHotExplicit One column exists per category, with "1" or "0" in each cell representing whether the row contains that column’s category.
  5. Leave Seed for algorithm and N fold checked and enter a seed number to ensure that when the data is split into test and training data it will occur the same way each time you run the dataflow. Uncheck this field to get a random split each time you run the flow.
  6. If you are performing cross-validation, check N fold and enter the number of folds.
  7. If you are performing cross-validation, check Fold assignment and select from the dropdown list.
    Fold assignmentDescription
    Auto Allows the algorithm to automatically choose an option; currently it uses Random.
    Modulo Evenly splits the dataset into the folds and does not depend on the seed.
    Random Randomly splits the data into nfolds pieces; best for large datasets.
    Stratified Stratifies the folds based on the response variable for classification problems. Evenly distributes observations from the different classes to all sets when splitting a dataset into train and test data. This can be useful if there are many classes and the dataset is relatively small.
    This field is applicable only if you entered a value in N fold and Fold field is not specified.
  8. If you are performing cross-validation, check Fold field and select the field that contains the cross-validation fold index assignment from the drop-down list.
    This field is applicable only if you did not enter a value in N fold and Fold assignment.
  9. Check Stopping rounds to end training when the Stopping_metric option does not improve for the specified number of training rounds and enter the number of unsuccessful training rounds to occur before stopping. To disable this feature, specify 0.
    The metric is computed on the validation data (if provided); otherwise, training data is used.
  10. Select a Stopping metric to determine when to quit creating new trees.
    Stopping metricDescription
    AUC

    Area under ROC curve.

    Note: Applicable only to binomial models.
    Auto Defaults to deviance.
    Lifttopgroup Top 1%.
    Logloss Logarithmic loss.
    Meanperclasserror The average misclassification rate.
    Misclassification The value of (1 - (correct predictions/total predictions)) * 100.
    MSE Mean squared error; incorporates both the variance and the bias of the predictor.
    RMSE Root mean square error; measures the differences between values (sample and population values) predicted by a model or an estimator and the values actually observed. Also the square root of MSE.
  11. Check Stopping tolerance and enter a value to specify the relative tolerance for the metric-based stopping to end training if the improvement is less than this value.
    This field is enabled only if you checked Stopping rounds.
  12. Check Minimum split improvement and enter a value to specify the minimum relative improvement in squared error reduction in order for a split to happen.
    When properly executed, this option can help reduce overfitting. Optimal values would be in the 1e-10...1e-3 range. This field is enabled only if you checked Stopping rounds.
  13. Click OK to save the model and configuration or continue to the next tab.