Configuring Basic Options
-
Leave Standardize input fields checked to standardize
the numeric columns to have zero mean and unit variance.
If you do not use standardization, the results may include components dominated by variables appearing to have larger variances relative to other attributes as a matter of scale rather than true contribution.
- Check Estimate number of clusters to have the K-Means algorithm attempt to determine the number of clusters that your model will contain. Even though you designate the number of desired clusters on the Model Properties tab, the routine may discover in its processing that a different number of clusters is more appropriate given the data.
- Specify a value between 1 and 100 as the Percentage for training data when the input data is randomly split into training and test data samples.
- Enter the value of 100 minus the amount you entered in step 3 as the Percentage for test data.
- Enter a number as the Seed for sampling to ensure that when the data is split into test and train data it will occur the same way each time you run the dataflow. Uncheck this field to get a random split each time you run the flow.
- Click OK to save the model and configuration or continue to the next tab.