Working with UTF-8 Data
This topic explains how Trillium handles UTF-8 encoding and presents solutions to the possible issues arising from the UTF-8 data processing.
What is UTF-8?
UTF-8 (8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode. UTF-8 can represent every character in the Unicode character set and is compatible with ASCII.
Each character in UTF-8 is represented by one to four bytes. The ASCII characters are represented by one byte. Most Latin characters including Greek and Arabic are represented by two bytes. Three bytes are required for the rest of the Basic Multilingual Plane which contains all characters in common use, including the CJK (Chinese, Japanese, and Korean) characters. Four bytes are used for characters in the other planes of Unicode such as various historic scripts.
Note: | For delimited files, null characters in UTF-8 will be converted to spaces. |
Multiplication Rule
Due to its variable-length characteristics, Trillium treats UTF-8 differently than other encodings. The Control Center converts UTF-8 data to UCS2, but when converting from UCS2 back to UTF8 during an export, it multiplies the character length by 3 to obtain the byte length for each UTF-8 attribute in the ddx.
For example, if you have a UTF-8 attribute called Line 1 that is 10 character long in the Control Center, Line 1 will become 30 bytes (10 x 3) in the ddx when exported to batch or real-time. This is because TS Quality processes data in fixed-length and must allow for the possibility that each UTF-8 character may require three bytes to avoid data truncation.
Guidelines
Note the following guidelines:
■ The multiplication rule is applied to the user-defined UTF-8 attributes for all country projects.
■ The rule is also used when you save a file in the external Schema Editor.
■ The rule is not applied to the standard TSQ output attributes with UTF-8 encoding in an SAP or ZZ project.
■ The rule is not applied to the attribute length in the input schema file(s) when the input data is fixed length. See Enabling UTF-8 Multiplication Rule for All Files and Projects for a workaround.
■ The rule is not applied to legacy projects created prior to V13.0. See Enabling UTF-8 Multiplication Rule for All Files and Projects for a workaround.
While the multiplication rule ensures that UTF-8 data is properly exported without truncation, it may negatively affect your project.
■ It can take up more disk space than necessary by tripling the character length and cause slower performance.
■ Processes based on the character position or number of characters, such as schema redefinitions and attribute scans, may fail because the position may become out-of-sync with the ddx after multiplication.
■ An issue can also occur when the UTF-8 attribute is referencing the non-UTF-8 attribute or vice versa in the process (for example, Relationship Linker link file setting).
Solutions to the Multiplication Issues
Depending on your data and processes, you can use the following procedures to solve the issues arising from the multiplication rule.
Multiplication Factor
If you know that you have only single-byte or double-byte UTF-8 characters, you can reduce the multiplication factor from 3 to either 1 or 2 by editing the configuration file.
Note: | You should only do this if you are sure that the data will not be truncated. |
► To change the multiplication factor
1. | Close the Control Center. |
2. | Go to the Server installation directory and look for the etc folder. For example, if you selected the default directory during the Server installation, the etc folder is in C:/Program Files/Trillium Software/MBSW/16. Within that folder is the config.txt file. |
3. | Open the config.txt file in a text editor. |
4. | Locate the following settings: |
Key public {
Value default_encoding ascii
Value utf8_Length 3
value str_base 1
value_business_group_limit 0
value_use_fixed_point_maths 1
}
5. | Using the following table, change the default value of ‘3’ for the utf8_Length parameter to ‘1’ or ‘2.’ |
Value |
Description |
---|---|
1 |
No multiplication is applied. Use this setting when the data includes only single-byte UTF-8 (ASCII) characters. |
2 |
Multiply the UTF-8 data by 2. Use this setting when the data includes double-byte UTF-8 characters (Latin, Greek, Arabic, etc). |
3 |
Multiply the UTF-8 data by 3. Use this setting when the data includes 3-byte UTF-8 characters (Asian characters and others). |
Note: | If any other value (for example ‘4’) is specified, it defaults to ‘3.’ |
6. | Save the file. |
7. | Restart the Control Center. |
Next time you export the project, Trillium will multiply the character length for the UTF-8 attributes by the number specified in the configuration file.
Schema Redefinitions
An issue arises when UTF-8 attributes are redefined as sub-attributes of the source attribute and the source attribute’s value is copied to each sub-attribute based on offset and number of characters. While this works within the Control Center, once the data is exported and the character lengths are multiplied, the redefinition will no longer work.
Example
You have a 8-character UTF-8 input attribute, Line 1. It is redefined into two UTF-8 sub-attributes in the Schema Editor: Sub 1 (offset = 0, width = 6 characters) and Sub 2 (offset = 6, width = 2 characters). The attributes have the following values.
Line 1: ABCDEFGH
Sub 1: ABCDEF
Sub 2: GH
When exported, lengths of all three attributes are multiplied by 3; Line 1 is now 24 bytes (8 x 3), Sub 1 is 18 bytes (6 x 3), and Sub 2 is 6 bytes (2 x 3). Since Sub 1 takes all 8 characters within 18 bytes, both redefined attributes end up in Sub 1 and there is no data in Sub 2.
Line 1: ABCDEFGH
Sub 1: ABCDEFGH
Sub 2:
There are two ways to solve the issue:
■ Using the Schema Editor, convert any UTF-8 attributes that are redefined to UCS2 in the first process (Transformer) and then convert it back to UTF-8 in the last process in the flow. See Modifying Attributes for the procedure to change the encoding for the attribute. This method is recommended.
■ If there is no risk of data truncation, change the multiplication factor to 1 in the configuration file. To change the multiplication factor, see the procedure above.
Relationship Linker Link Files
Another possible issue is the link file setting in the Relationship Linker. If you use a UTF-8 attribute as the "Attribute to write to Link file" (LINK_SOURCE_ID, Relationship Linker > Process > Advanced >Linking), it will work within the Control Center, but you will see an error when you run it after an export. The error message would look like this:
14041E ERROR: DDL file: <C:\project1\batch\ddl/e62_us_srtforrl_p7.ddx> DDL field(length): <FROM_LINK>(<24>) is less than Settings file: <C:\project1\batch\settings\e63_usrellink_p8.stx> parameter: <LINK_SOURCE_ID> DDL field(length): <INPUT_LINE_01>(<72>). Occurred in CMatcher::InitMatcher - (CMatcher::InitPrmVals)
This is because the length of the link source attribute is multiplied but FROM_LINK and TO_LINK, the attributes that store the value of the link source attribute, are fixed as NOTRANS and not multiplied. To avoid this issue, make sure to use a non-UTF-8 attribute for the "Attribute to write to Link file" (LINK_SOURCE_ID) setting.
Enabling UTF-8 Multiplication Rule for All Files and Projects
Generally the UTF-8 multiplication rule applies only to delimited files, not fixed-length files.
► To enable the UTF-8 multiplication rule for all types of files and projects
1. | Close the Control Center. |
2. | Go to the Server installation directory and look for the etc folder. For example, if you selected the default directory during the Server installation, the etc folder is in C:/Program Files/Trillium Software/MBSW/16. Within that folder is the config.txt file. |
3. | Open the config.txt file in a text editor. |
4. | Locate the following section of the settings: |
key public {
value max_string_size 32767
value utf8_length 3
value str_base 1
value_business_group_limit 0
value_use_fixed_point_maths 1
}
5. | Add the following setting at the end of the section and value "1" to enable the UTF-8 multiplication rule: |
value multiply_utf8_all 1
6. | The section would look like this: |
key public {
value max_string_size 32767
value utf8_length 3
value str_base 1
value_vusiness_group_limit 0
value_use_fixed_point_maths 1
value multiply_utf8_all 1
}
7. | Save the file. |
8. | Stop and restart the Scheduler. |
9. | Restart the Control Center. Next time you export the project, Trillium will apply the UTF-8 multiplication rule to the UTF-8 attributes. |
Note: | If you do not specify this setting or specify "value multiply_utf8_all 0", the existing limitations to the rule are applied. |