10.14.4.2 LDML Syntax Supported in MySQL
This section describes the LDML syntax that MySQL recognizes. This is a subset of the syntax described in the LDML specification available at http://www.unicode.org/reports/tr35/ , which should be consulted for further information. MySQL recognizes a large enough subset of the syntax that, in many cases, it is possible to download a collation definition from the Unicode Common Locale Data Repository and paste the relevant part (that is, the part between the
</rules> tags) into the MySQL
Index.xml file. The rules described here are all supported except that character sorting occurs only at the primary level. Rules that specify differences at secondary or higher sort levels are recognized (and thus can be included in collation definitions) but are treated as equality at the primary level.
The MySQL server generates diagnostics when it finds problems while parsing the
Index.xml file. See Section 10.14.4.3, “Diagnostics During Index.xml Parsing”.
Characters named in LDML rules can be written literally or in
\u format, where
nnnn is the hexadecimal Unicode code point value. For example,
á can be written literally or as
\u00E1. Within hexadecimal values, the digits
F are not case-sensitive;
\u00e1 are equivalent. For UCA 4.0.0 collations, hexadecimal notation can be used only for characters in the Basic Multilingual Plane, not for characters outside the BMP range of
FFFF. For UCA 5.2.0 collations, hexadecimal notation can be used for any character.
Index.xml file itself should be written using UTF-8 encoding.
LDML has reset rules and shift rules to specify character ordering. Orderings are given as a set of rules that begin with a reset rule that establishes an anchor point, followed by shift rules that indicate how characters sort relative to the anchor point.
<reset>rule does not specify any ordering in and of itself. Instead, it “resets” the ordering for subsequent shift rules to cause them to be taken in relation to a given character. Either of the following rules resets subsequent shift rules to be taken in relation to the letter
<t>shift rules define primary, secondary, and tertiary differences of a character from another character:
Use primary differences to distinguish separate letters.
Use secondary differences to distinguish accent variations.
Use tertiary differences to distinguish lettercase variations.
Either of these rules specifies a primary shift rule for the
<i>shift rule indicates that one character sorts identically to another. The following rules cause
'b'to sort the same as
Abbreviated shift syntax specifies multiple shift rules using a single pair of tags. The following table shows the correspondence between abbreviated syntax rules and the equivalent nonabbreviated rules.
Table 10.5 Abbreviated Shift Syntax
Abbreviated Syntax Nonabbreviated Syntax
An expansion is a reset rule that establishes an anchor point for a multiple-character sequence. MySQL supports expansions 2 to 6 characters long. The following rules put
'z'greater at the primary level than the sequence of three characters
A contraction is a shift rule that sorts a multiple-character sequence. MySQL supports contractions 2 to 6 characters long. The following rules put the sequence of three characters
'xyz'greater at the primary level than
Long expansions and long contractions can be used together. These rules put the sequence of three characters
'xyz'greater at the primary level than the sequence of three characters
Normal expansion syntax uses
<extend>elements to specify an expansion. The following rules put the character
'k'greater at the secondary level than the sequence
'ch'. That is,
'k'behaves as if it expands to a character after
This syntax permits long sequences. These rules sort the sequence
'ccs'greater at the tertiary level than the sequence
The LDML specification describes normal expansion syntax as “tricky.” See that specification for details.
Previous context syntax uses
<context>elements to specify that the context before a character affects how it sorts. The following rules put
'-'greater at the secondary level than
'a', but only when
Previous context syntax can include the
<extend>element. These rules put
'def'greater at the primary level than
'aghi', but only when
Reset rules permit a
beforeattribute. Normally, shift rules after a reset rule indicate characters that sort after the reset character. Shift rules after a reset rule that has the
beforeattribute indicate characters that sort before the reset character. The following rules put the character
'a'at the primary level:
<reset before="primary">a</reset> <p>b</p>
beforeattribute values specify the sort level by name or the equivalent numeric value:
<reset before="primary"> <reset before="1"> <reset before="secondary"> <reset before="2"> <reset before="tertiary"> <reset before="3">
A reset rule can name a logical reset position rather than a literal character:
<first_tertiary_ignorable/> <last_tertiary_ignorable/> <first_secondary_ignorable/> <last_secondary_ignorable/> <first_primary_ignorable/> <last_primary_ignorable/> <first_variable/> <last_variable/> <first_non_ignorable/> <last_non_ignorable/> <first_trailing/> <last_trailing/>
These rules put
'z'greater at the primary level than nonignorable characters that have a Default Unicode Collation Element Table (DUCET) entry and that are not CJK:
Logical positions have the code points shown in the following table.
Table 10.6 Logical Reset Position Code Points
Logical Position Unicode 4.0.0 Code Point Unicode 5.2.0 Code Point
<collation>element permits a
shift-after-methodattribute that affects character weight calculation for shift rules. The attribute has these permitted values:
simple: Calculate character weights as for reset rules that do not have a
beforeattribute. This is the default if the attribute is not given.
expand: Use expansions for shifts after reset rules.
'1'have weights of
0E2Aand we want to put all basic Latin letters between
For simple shift mode, weights are calculated as follows:
'a' has weight 0E29+1 'b' has weight 0E29+2 'c' has weight 0E29+3 ...
However, there are not enough vacant positions to put 26 characters between
'1'. The result is that digits and letters are intermixed.
To solve this, use
shift-after-method="expand". Then weights are calculated like this:
'a' has weight [0E29][233D+1] 'b' has weight [0E29][233D+2] 'c' has weight [0E29][233D+3] ...
233Dis the UCA 4.0.0 weight for character
0xA48C, which is the last nonignorable character (a sort of the greatest character in the collation, excluding CJK). UCA 5.2.0 is similar but uses
3ACA, for character
MySQL-Specific LDML Extensions
An extension to LDML rules permits the
<collation> element to include an optional
version attribute in
<collation> tags to indicate the UCA version on which the collation is based. If the
version attribute is omitted, its default value is
4.0.0. For example, this specification indicates a collation that is based on UCA 5.2.0:
<collation id="nnn" name="utf8_xxx_ci" version="5.2.0"> ... </collation>