Configuring Advanced Options

Leave Ignore constant fields checked to skip fields that have the same value for each record.

Check Balance classes to balance the class distribution and either under sample the majority classes or over sample the minority classes.

Select a Histogram type.

Histogram type	Description
Auto	Buckets are binned from minimum to maximum in steps of (max-min)/N. Use this option to specify the type of histogram for finding optimal split points.
QuantilesGlobal	Buckets have equal population. This computes `nbins` quantiles for each numeric (non-binary) column, then refines/pads each bucket (between two quantiles) uniformly (and randomly for remainders) into a total of `nbins_top_level` bins.
Random	The algorithm will sample N-1 points from minimum to maximum and use the sorted list of those to find the best split.
RoundRobin	The algorithm will cycle through all histogram types (one per tree).
UniformAdaptive	Each feature is binned into buckets of equal step size (not population). This is the quickest method but can lead to less accurate splits if the distribution is highly skewed.

Select a Categorical encoding.

Categorical encoding	Description
Auto	Automatically performs `enum` encoding.
Binary	Converts categories to integers, then to binary, and assigns each digit a separate column. Encodes the data in fewer dimensions but with some distortion of the distances. Note: No more than 32 columns can exist per categorical feature.
Eigen	`k` columns per categorical feature, keeping projections of one-hot-encoded matrix onto `k`-dim eigen space only.
Enum	Cycles through all histogram types (one per tree).
OneHotExplicit	One column exists per category, with "1" or "0" in each cell representing whether the row contains that column’s category.

Leave Seed for algorithm and N fold checked and enter a seed number to ensure that when the data is split into test and training data it will occur the same way each time you run the dataflow. Uncheck this field to get a random split each time you run the flow.

If you are performing cross-validation, check N fold and enter the number of folds.

If you are performing cross-validation, check Fold assignment and select from the dropdown list.

Fold assignment	Description
Auto	Allows the algorithm to automatically choose an option; currently it uses Random.
Modulo	Evenly splits the dataset into the folds and does not depend on the seed.
Random	Randomly splits the data into nfolds pieces; best for large datasets.
Stratified	Stratifies the folds based on the response variable for classification problems. Evenly distributes observations from the different classes to all sets when splitting a dataset into train and test data. This can be useful if there are many classes and the dataset is relatively small.

This field is applicable only if you entered a value in N fold and Fold field is not specified.

If you are performing cross-validation, check Fold field and select the field that contains the cross-validation fold index assignment from the drop-down list.

This field is applicable only if you did not enter a value in N fold and Fold assignment.

Check Stopping rounds to end training when the Stopping_metric option does not improve for the specified number of training rounds and enter the number of unsuccessful training rounds to occur before stopping. To disable this feature, specify 0.

The metric is computed on the validation data (if provided); otherwise, training data is used.

Select a Stopping metric to determine when to quit creating new trees.

Stopping metric	Description
AUC	Area under ROC curve. Note: Applicable only to binomial models.
Auto	Defaults to `deviance`.
Lifttopgroup	Top 1%.
Logloss	Logarithmic loss.
Meanperclasserror	The average misclassification rate.
Misclassification	The value of (1 - (correct predictions/total predictions)) * 100.
MSE	Mean squared error; incorporates both the variance and the bias of the predictor.
RMSE	Root mean square error; measures the differences between values (sample and population values) predicted by a model or an estimator and the values actually observed. Also the square root of MSE.

Check Stopping tolerance and enter a value to specify the relative tolerance for the metric-based stopping to end training if the improvement is less than this value.

This field is enabled only if you checked Stopping rounds.

Check Minimum split improvement and enter a value to specify the minimum relative improvement in squared error reduction in order for a split to happen.

When properly executed, this option can help reduce overfitting. Optimal values would be in the 1e-10...1e-3 range. This field is enabled only if you checked Stopping rounds.

Click OK to save the model and configuration or continue to the next tab.