Tokenize

%Tokenize([token set],[preserved set]); 

This is an optional command. If not specified, the default is [\s], which is the regular expression default for white space characters such as a space, tab, or line break.

Defines the characters that are used to tokenize a field and sets the characters to preserve.

[token set] is a list of characters used to automatically tokenize a field. Tokenizing refers to breaking up a field using delimiters.

Example

%Tokenize([-\s],[-];

Tokenizes on white space and dashes, preserving the dash as a token.

Note: %Tokenize follows the Java RegEx syntax rules. Use the backslash character "\" to force Open Parser to treat the hyphen and other metacharacters as ordinary characters. For example, the hyphen character (-) can be used to specify either a literal hyphen or a range of characters. If you set the value of %Tokenize to [(-)], Open Parser will interpret that to mean the range of characters between the open parentheses "("and close parentheses")" characters. See Command Metacharacters for a complete list of reserved characters.

[preserved set] is a regular expression definition of a character set of those tokens in a token set that are retained and will appear in the list of tokens. For example, if token set is space and hyphen, and preserved set is hyphen, "before-after this" would be broken down into 4 tokens: 'before', '-', 'after' and 'this.

To use this command:

  1. Position the cursor where you want the command inserted.
  2. Double-click %Tokenize in the Commands list.
  3. Click the Token Set arrow to select a RegEx value or type values in the Token Set text box.

    There are several predefined RegEx tags that you can use to define the token set. For more information, see Defining a Culture-Specific Parsing Grammar.

  4. Optionally, select Characters to preserve check box.
  5. Click the Token set characters to preserve arrow and select a value or type values in the text box.
  6. Click OK.