Configuring Options for Custom Entities

This involves creation of a Training Options file that contains information about your model and the options to be applied for training the model. This file must be in XML format with UFT-8 encoding and must include these header and the required training features:

Header in the Training Options File

The header mentions details of the model, path of the test and input files, and the keyword for annotating the custom entities.

  • modelName: Name of the custom model
  • modelType: The type of the custom model (which is CustomEntity).
  • modelDescription: Description of the custom model
  • inputFilePath: Path of the tagged file used for training the model (input file)
  • testFilePath: Path of the file used for testing the model
  • magicWord: Keyword used to annotate the custom entities
  • language: The language used in the text.
    Note: English is supported. Dutch, French, German, and Spanish are in the beta phase.

Training Features

You can use these training features to create the custom entities.

  • Linguistic features: To specify the language properties
    • POSTagger: Tagging to identify parts of speech, such as nouns, pronouns, adjectives, and verb.
      <trainingFeature>             
         <featureName>POSTagger</featureName>
      </trainingFeature>
  • Orthographic features: To specify the structural properties
    • CaseIdentifier: Identifies whether the custom entities are all capital letters, lower-cased, or a mix of both.
      <trainingFeature>
      	<featureName>CaseIdentifier</featureName>
      </trainingFeature>
    • NumericIdentifier: Identifies whether the custom entities are numeric or alphanumeric.
      <trainingFeature>
      	<featureName>NumericIdentifier</featureName>
      </trainingFeature>
    • 1st2ndIdentifier: Identifies whether the custom entities are ordinals, such as 1st, 2nd, and 3rd.
      <trainingFeature>
      	<featureName>1st2ndIdentifier</featureName>
      </trainingFeature>
    • PatternMatcher: Matches words against one or more patterns using regular expressions. When multiple expressions are provided, includes join condition AND for all expressions or OR (default) for any expression.
      <trainingFeature>
      	<featureName>PatternMatcher</featureName>
      		<featureParams>
      			<entry>
      				<key>RegEx1</key>
      				<value>b[aeiou]t</value>
      			</entry>
      			<entry>
      				<key>RegEx2</key>
      				<value>b[xyz]t</value>
      			</entry>
      			<entry>
      				<key>JoinCondition</key>
      				<value>AND</value>
      			</entry>
      		</featureParams>
      </trainingFeature>
  • Keyword features: To define the list of keywords
    • CategoryKeywords: Identifies a category for a list of keywords belonging to multiple custom lists. For example, Weekdays in CategoryKeywords list contains Keywords as Monday, Tuesday, Wednesday, Thursday, and Friday.

      This feature can optionally specify if the match should be case sensitive. When used, the default is true.

      <trainingFeature>
      	<featureName>CategoryKeywords</featureName>
      	<featureParams>
      		<entry>
      			<key>Weekdays</key> 
                          <!-- List of weekdays -->
      			<value>Monday,Tuesday,Wednesday,Thursday,Friday</value>
      		</entry>
      		<entry>
      			<key>WeekendDays</key>
                          <!-- List of weekend days -->
      			<value>Saturday,Sunday</value>
      		</entry>
      		<entry>
      			<key>CaseSensitive</key>
                          <value>True</value>
      		</entry>
      	</featureParams>
      </trainingFeature>
    • KeyWords: Searches for words that you have specified as belonging to a custom list, such as DaysOfWeek or Month. Also optionally specifies whether the match should be case sensitive, which, when used, has "true" as default.
      <trainingFeature>
      	<featureName>KeyWords</featureName>
      	<featureParams>
      		<entry>
      			<key>KeyWordList</key>
      			<value>Monday,Tuesday</value>
      		</entry>
      		<entry>
      			<key>CaseSensitive</key>
      			<value>False</value>
      		</entry>
      	</featureParams>
      </trainingFeature>
    • Substring: Extracts part of a string as specified in the parameters. Can also be used for prefix and suffix extractions.
      • StartLocation: Left or right. Position where substring should be extracted. Default is Left.
      • StartPosition: Start position for the substring. The default is 0.
      • EndPosition: End position for the substring. Default is 3.
      • MinLength: Minimum length of word to which this feature should apply. Default is 3.
      <trainingFeature>
      	<featureName>Substring</featureName>
      		<featureParams>
      			<entry>
      				<key>StartLocation</key>
      			</entry>
      			<entry>
      				<key>StartPosition</key>
      				<value>1</value>
      			</entry>
      			<entry>
      				<key>EndPosition</key>
      				<value>4</value>
      			</entry>
      			<entry>
      				<key>MinLength</key>
      		</featureParams>
      </trainingFeature>
  • Lexical Features: To specify the lexeme properties
    • FeatureWindow: Specifies the window for feature generation
      <trainingFeature>
      	<featureName>FeatureWindow</featureName>
      	<!-- Number of preceding tokens used to create the feature set. Default is 3 -->
      		<entry>
      			<key>Before</key>
      			<value>1</value>
      		</entry>
      	<!-- Number of succeeding tokens used to create the feature set. Default is 3 -->
      		<entry>
      			<key>After</key>
      			<value>2</value>
      		</entry>
      </trainingFeature>
      
A complete sample training options file for custom entities is shown below:
<trainingOptions>
	<modelName>CustomModel</modelName>
	<modelType>CustomEntity</modelType>
	<modelDescription>CustomDiagnosesModel</modelDescription>
	<inputFilePath>C:/SpectrumIE/custom_model/Custom_Input.csv</inputFilePath>
	<testFilePath>C:/SpectrumIE/custom_model/Custom_Test.txt</testFilePath>
       <magicWord>DIAGNOSIS</magicWord>
       <language>English</language>
	
      <trainingFeatures>
	
	<!-- Lexical features-->
	<trainingFeature>
		<featureName>FeatureWindow</featureName>
		<featureParams>
			<entry>
				<key>Before</key>
				<value>1</value>
			</entry>
			<entry>
				<key>After</key>
				<value>2</value>
			</entry>
		</featureParams>
	</trainingFeature>

	<!-- Orthographic features-->
	<trainingFeature>
		<featureName>CaseIdentifier</featureName>
      </trainingFeature>

	<trainingFeature>
		<featureName>NumericIdentifier</featureName>
	</trainingFeature>
	</trainingFeatures>
 </trainingOptions>