Configuring Options

This involves creation of a Training Options file that contains information about your model and the options to be applied for training the model. This file must be in XML format with UFT-8 encoding and must include these header and the required training features:

Header in the Training Options File

The header mentions details of the model, its type, and the path of the test and input files.

  • modelName: Name of the model
  • modelType: The type of model (which is TC, meaning text categorization in this case)
  • modelDescription: Description of the model
  • inputFilePath: Location of the input file used for training the model
  • testFilePath: Location of the test file
    Note:

    The test file measures the effectiveness of a model. It determines the behavior of the custom model with various training parameters. As a best practice you should use different input and test files in training or evaluating your custom models.

    algorithm: The machine learning algorithm used for training the model (default is MaxEnt)

Training Features

These are the training features you can use to create a new category.
Note: If you use multiple features, those can be placed in any order within the file.
  • Linguistic feature: To specify the language properties
    • Stemming: Reduces words to their stem, or root. For example, "insurer", "insured", and "insures" can all be reduced to the root "insure".
      <trainingFeature>
      		<featureName>Stemming</featureName>
      </trainingFeature>
  • Keyword features: To define the list of keywords
    • IgnoreWords: Also known as stop words, this feature filters out common words that have no effect on categorization, such as "the", "and", and "but". These words should be separated only by a comma, not by spaces. You can also use the Append key with this feature, which when set to "True", will be added to the existing list of stopwords.
      <trainingFeature>
      		<featureName>IgnoreWords</featureName>
      		<featureParams>
      			<entry>
      				<key>WordList</key>
      				<value>
      					and,the,for,with,still,tri,rep,cust,keep,get,req,call
      				</value>
      			</entry>
      			<entry>
      				<key>Append</key>
      				<value>True</value>
      			</entry>
      		</featureParams>
      	</trainingFeature>
    • CategoryKeywords: Identifies a category for a list of keywords belonging to multiple custom lists. For example, Weekdays in CategoryKeywords list contains Keywords as Monday, Tuesday, Wednesday, Thursday, and Friday.

      This feature can optionally specify if the match should be case sensitive. When used, the default is true.

      <trainingFeature>
      	<featureName>CategoryKeywords</featureName>
      	<featureParams>
      		<entry>
      			<key>Weekdays</key> 
                          <!-- List of weekdays -->
      			<value>Monday,Tuesday,Wednesday,Thursday,Friday</value>
      		</entry>
      		<entry>
      			<key>WeekendDays</key>
                          <!-- List of weekend days -->
      			<value>Saturday,Sunday</value>
      		</entry>
      		<entry>
      			<key>CaseSensitive</key>
                          <value>True</value>
      		</entry>
      	</featureParams>
      </trainingFeature>
    • KeyWords: Searches for words that you have specified as belonging to a custom list, such as DaysOfWeek or Month. Also optionally specifies whether the match should be case sensitive, which, when used, has "true" as default.
      <trainingFeature>
      	<featureName>KeyWords</featureName>
      	<featureParams>
      		<entry>
      			<key>KeyWordList</key>
      			<value>Monday,Tuesday</value>
      		</entry>
      		<entry>
      			<key>CaseSensitive</key>
      			<value>False</value>
      		</entry>
      	</featureParams>
      </trainingFeature>
  • Lexical feature: To specify the lexeme properties
    • NGram: Searches for a portion of a longer string, with "n" representing the number of words to look for. For example, if you are looking for the phrase "to be or "not to be", you might search for a unigram of "to" or "be", or a bigram of "to be" or "or not", or a trigram of "to be or" or "not to be".
      <trainingFeature>
      		<featureName>NGram</featureName>
      		<featureParams>
      			<entry>
      				<key>Count</key>
      				<value>3</value>
      			</entry>
      		</featureParams>
      </trainingFeature>
A sample training options file:
<trainingOptions>
	<modelName>modelone</modelName>
	<modelType>TC</modelType>
	<modelDescription>modelOne</modelDescription>
	<inputFilePath>C:/SpectrumIE/textclassification/train_Input.csv</inputFilePath>
	<testFilePath>C:/SpectrumIE/textclassification/train_Test.txt</testFilePath>
    <algorithm>SVM</algorithm>
	
	<trainingFeatures>

	<!-- Keyword features -->	
	<trainingFeature>
		<featureName>IgnoreWords</featureName>
		<featureParams>
			<entry>
				<key>WordList</key>
				<value>
					and,the,for,with,still,tri,rep,cust,keep,get,req,call
				</value>
			</entry>
			<entry>
				<key>Append</key>
				<value>True</value>
			</entry>
		</featureParams>
	</trainingFeature>

	<trainingFeature>
		<featureName>CategoryKeywords</featureName>
		<featureParams>
			<entry>
				<key>Category1/key>
				<value>CategoryKeyword1,CategoryKeyword2</value>
			</entry>
			<entry>
				<key>Category2/key>
				<value>CategoryKeyword3,CategoryKeyword4</value>
			</entry>                                                  
             </featureParams>
	</trainingFeature>

	<trainingFeature>
            <featureName>KeyWords</featureName>
	      <featureParams>
			<entry>
				<key>KeyWordList</key>
				<value>
					jam,misfeed,install,help,mechanical,failure,jam,pc,connection
				</value>
			</entry>
		</featureParams>
	</trainingFeature>

	<!-- Linguistic feature -->
	<trainingFeature>
		<featureName>Stemming</featureName>
	</trainingFeature>
	
	<!-- Lexical feature -->
	<trainingFeature>
		<featureName>NGram</featureName>
		<featureParams>
			<entry>
				<key>Count</key>
				<value>3</value>
			</entry>
		</featureParams>
	</trainingFeature>

	</trainingFeatures>
	</trainingOptions>