For the ucs2_unicode_ci and utf8_unicode_ci collations,
MySQL uses version 4.0.0 of the Unicode Collation Algorithm
(UCA). It was superseded by UCA version 4.1.0, and now
UCA version 5.2.0 exists. We should keep up with
standard requirements.
Background: BUG#11369 was a complaint about a change that occurred
due to a Unicode committee vote in 2004 (IBM and Oracle in favour,
Microsoft and Sybase against). The version-4.1 UCA changes include:
- Support for new Unicode 4.1.0 characters
- Change in weights for existing characters
0C6 LATIN CAPITAL LETTER AE
01FC LATIN CAPITAL LETTER AE WITH ACUTE
01E2 LATIN CAPITAL LETTER AE WITH MACRON
00D8 LATIN CAPITAL LETTER O WITH STROKE
01FE LATIN CAPITAL LETTER O WITH STROKE AND ACUTE
0110 LATIN CAPITAL LETTER D WITH STROKE
0141 LATIN CAPITAL LETTER L WITH STROKE
013F LATIN CAPITAL LETTER L WITH MIDDLE DOT
00D0 LATIN CAPITAL LETTER ETH
0126 LATIN CAPITAL LETTER H WITH STROKE
0600 Arabic number sign and similar characters
- Addition of contractions for Thai
For example, the weight change means that "L with stroke"
was formerly greater than "L" (as in Polish), but now it's
equal to "L" (as in any non-Polish directory where a Polish
name might appear). This change would affect the
Unicode collations in MySQL -- ucs2_unicode_ci, utf8_unicode_ci,
utf16_unicode_ci, utf8mb3_unicode_ci -- and all collations
which are tailorings of the Unicode collations,
except the locale-specific collations ucs2_polish_ci,
utf8_polish_ci, utf16_polish_ci, utf8mb3_polish_ci.
The general collations -- ucs2_general_ci, utf8_general_ci,
utf16_general_ci, utf8mb3_general_ci -- will not be regarded
as tailorings of the Unicode collations. They will not be
affected by the change in weights for existing characters.
They may be affected by the support for new characters,
but the effects do not matter, just do what's convenient.
Collation names
===============
We had a choice between changing current collations
(making upgrades harder) or adding new collations (making
many new collations which are incompatible with current ones).
Due to bad experience with "Sharp S", we now believe that
we should always try to avoid changing current collations.
The proposal is: add new collations.
New collation names will use this name style notation:
- utf8_unicode_500_ci
- utf8_polish_500_ci
- utf8_czech_500_ci
and so on, where "500" is Unicode version.
Language collations
===================
WL#2673 will add only _unicode_500_ci collations.
Language collations will be implemented in a separate WL.
utf8_general_ci never
utf8_bin never
utf8_unicode_500_ci now
utf8_icelandic_500_ci later
utf8_latvian_500_ci later
utf8_romanian_500_ci later
utf8_slovenian_500_ci later
utf8_polish_500_ci later
utf8_estonian_500_ci later
utf8_spanish_500_ci later
utf8_swedish_500_ci later
utf8_turkish_500_ci later
utf8_czech_500_ci later
utf8_danish_500_ci later
utf8_lithuanian_500_ci later
utf8_slovak_500_ci later
utf8_spanish2_500_ci later
utf8_roman_500_ci later (grumble grumble ...)
utf8_persian_500_ci later
utf8_esperanto_500_ci later
utf8_hungarian_500_ci later
utf8_sinhala_500_ci later
Contractions
============
As of version 6.0, MySQL supports only
contractions consisting of two ASCII letters a-zA-Z.
This WL will also make it possible to use contractions
between any two letters (i.e. two characters that Unicode
describes as letters). This will make possible
to add some missing collations, for example Croatian
which requires "D" to be a contraction.
utf8mb3
=======
We won't add new collations to utf8mb3. It is considered
as obsolete and will likely be removed on some later MySQL major
version.
non-BMP characters
==================
The existing implementations of UCA collations sort all
non-BMP characters all together. The new UCA collations
will use real weights for non-BMP characters,
from Unicode-5.0.0 allkeys.txt.
LDML
====
LDML will recognize an optional UCA version attribute
when creating a user defined collation:
This definition will create a 5.0.0 based collation:
<collation id="xxx" name="utf8_name_ci" version="5.0.0">
</collation>
These definitions will create 4.0.0 based collations:
<collation id="xxx" name="utf8_name_ci" version="4.0.0">
</collation>
<collation id="xxx" name="utf8_name_ci">
</collation>
Case folding in the new collations
==================================
LOWER() and UPPER() functions should also follow the new 5.0.0 rules.
For example, Unicode-5.0.0 introduces these characters:
2C00;GLAGOLITIC CAPITAL LETTER AZU;Lu;0;L;;;;;N;;;;2C30;
2C30;GLAGOLITIC SMALL LETTER AZU;Ll;0;L;;;;;N;;;2C00;;2C00
Source: http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
These code points were unassigned in Unicode-4.0.0.
Therefore in MySQL 6.0 we get:
mysql> select hex(lower(_ucs2 0x2c00 collate ucs2_unicode_ci));
+--------------------+
| hex(lower(0x2c00)) |
+--------------------+
| 2C00 |
+--------------------+
1 row in set (0.00 sec)
But in the new version we expect to get
mysql> select hex(lower(_ucs2 0x2c00 collate ucs2_unicode_500_ci));
+--------------------+
| hex(lower(0x2c00)) |
+--------------------+
| 2C30 |
+--------------------+
1 row in set (0.00 sec)
Indeed collation affects folding, as already happens with Turkish I.
For any Unicode collation, the results for LOWER() should match
the recommendation for simple case folding in
http://unicode.org/Public/UNIDATA/CaseFolding.txt
Case folding in _general_ci and _bin collations
===============================================
There are a few choices what to do with case folding
in the old Unicode collations _general_ci and _bin
collations.
1. Don't change case folding in the old collations.
WL#2673 will add new _general_500_ci and _500_bin collations,
with Unicode-5.0.0 case folding rules.
2. WL#2673 will change case folding rules in the existing
_general_ci and _bin collations to conform to Unicode-5.0.0 rules.
3. WL#2673 will neither change the existing case folding rules
in _general_ci and _bin collations, nor add new _general_500_ci
and _500_bin collations with Unicode-5.0.0 folding rules.
This will mean that we don't have "fresh" _general_ci and _bin
collations.
We'll go the third way.
Applying the algorithm in non-Unicode character sets
====================================================
We could apply UCA rules to non-Unicode character sets.
For example "latin1_swedish_500_ci" would use the UCA
algorithm plus Swedish tailoring, and incidentally fix
some problems that exist in latin1_swedish_ci.
See also the comments on BUG#46659.
This would require a separate worklog task.
Simple collation
----------------
A simple collation has one-to-one mappings for
each of the possible byte values, with no allowance
for ignorables or expansions. So most 8-bit
collations will differ from the Unicode collations:
(1) The ignorables will have weights less than space.
(2) The expansions will stop after the first letter.
But for all characters a simple collation has the
same ordering as a Unicode collation, with UCA rules.
Expansions actually depend on character type:
(a) For a letter, take the first letter of the
expansion, and the weight is equal to that.
For example, SHARP S expands to 'SS'.
The DUCET UCA 5.2 weight of 'S' is 1410.
We want SHARP S to be equal to 'S'.
Therefore the weight of SHARP S is 1410.
All this means for a simple collation is;
S = SHARP S. Letters subject to
exspansion, if there's no tailoring, include:
LATIN SMALL LETTER SHARP S, LATIN SMALL LIGATURE OE,
LATIN CAPITAL LIGATURE OE, LATIN CAPITAL LETTER AE,
LATIN SMALL LETTER AE.
(b) For a non-letter, take the first letter of the
expansion, and the weight is just after that.
For example TRADE MARK SIGN expands to 'TM'.
Stop after the first letter of expansion, 'T'.
The DUCET UCA 5.2 weight of 'T' is 1433.
We want TRADE MARK SIGN to be just after 'T'.
Therefore the weight of TRADE MARK SIGN is 1434.
All this means for a simple collation is:
T < TRADE MARK SIGN < U. Non-letters subject to
exspansion, if there's no tailoring, include:
HORIZONTAL ELLIPSIS, VULGAR FRACTION ONE QUARTER,
VULGAR FRACTION ONE HALF, VULGAR FRACTION THREE QUARTERS,
TRADE MARK SIGN.
The ignorable characters, which for simple collations will
always be the control characters, will be sorted before space, thus:
Characters which are ignorable should be in their code
point order before 'space'. (These characters are 0x00 to 0x1f,
0x7f, 0x81, 0x8d, and 0x8f,). Mr Barkov's proposition is:
8-bit characters from 0x00 to 0x1f -- including 0x00! -- will have
weights equal to encodings, and weights for 0x7f,81,8d,8f,90,9d
will be 0x20, 0x21, 0x22, 0x23, 0x24, 0x25 respectively;
however, we want characters 0x9a 0x9b 0x9c 0x9d to equal space.
So, in a non-tailored simple collation, these are the code points
and weights for the characters as far as SPACE:
8bit ucs2 weight name
00 0000 ! 0000 (control) NULL
01 0001 ! 0001 (control)
02 0002 ! 0002 (control)
03 0003 ! 0003 (control)
04 0004 ! 0004 (control)
05 0005 ! 0005 (control)
06 0006 ! 0006 (control)
07 0007 ! 0007 (control)
08 0008 ! 0008 (control) BACKSPACE
0e 000e !*000e (control)
0f 000f !*010f (control)
10 0010 ! 0010 (control)
11 0011 ! 0011 (control)
12 0012 ! 0012 (control)
13 0013 ! 0013 (control)
14 0014 ! 0014 (control)
15 0015 ! 0015 (control)
16 0016 ! 0016 (control)
17 0017 ! 0017 (control)
18 0018 ! 0018 (control)
19 0019 !*0019 (control)
1a 001a !*001a (control)
1b 001b !*001b (control)
1c 001c !*001c (control)
1d 001d !*001d (control)
1e 001e !*001e (control)
1f 001f !*001f (control)
7f 007F !*0020 DELETE
81 0081 !*0021 (control)
8d 008d ! 0022 PARTIAL LINE FEED
8f 008f ! 0023 SINGLE SHIFT THREE
90 0090 ! 0024 DEVICE CONTROL STRING
9d 009d ! 0025 OPERATING SYSTEM COMMAND
09 0009 !*0201 (control) HORIZONTAL TABULATION
0a 000a !*0202 (control) LINE FEED
0b 000b !*0203 (control) VERTICAL TABULATION
0c 000c !*0204 (control) FORM FEED
0d 000d !*0205 (control) CARRIAGE RETURN
20 0020 *020a SPACE
a0 00a0 *020A NO-BREAK SPACE
These rules are arbitrary. If somebody wants to object,
because they want more uniqueness or because they have
some intuition about what makes sense or whatever, they
should object now. It's too late to object about any
collation once it's in a GA version of MySQL.
References
==========
- See the example in BUG#11369, and the correspondence in dev-bugs
or dev-public with subject heading
"BUG#11369 [Ver->Doc]: Unicode ucs2_unicode_ci collation
isn't following Unicode guidelin".
- See Bar's comments about UCA in this docs-private email:
[ mysql internal intranet ] /secure/mailarchive/mail.php?folder=101&mail=6639
- This task is on the roadmap for version 6.x
[ mysql internal intranet 'inside' ] /wiki/MySQL_Server_Future.
- PS: Now Unicode 5 exists. The announcement about 5.0
http://www.unicode.org/versions/Unicode5.0.0/
doesn't indicate that UCA has changed for existing
characters. However, some primary-key weights differ,
as these three lines (from Unicode 4.00, 4.1.0, 5.0.0)
show (for LATIN SMALL LETTER L WITH STROKE):
ftp://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt
0142 ; [.0F36.0020.0002.0142]
ftp://www.unicode.org/Public/UCA/4.1.0/allkeys.txt
0142 ; [.1077.0020.0002.006C][.0000.008D.0002.0335]
ftp://www.unicode.org/Public/UCA/5.0.0/allkeys.txt
0142 ; [.10E5.0020.0002.006C][.0000.008D.0002.0335]
- A question about "L with stroke" appeared on a MySQL forum:
http://forums.mysql.com/read.php?103,115664,115664#msg-115664
- And BUG#27877 incorrect german order in utf8_general_ci
http://bugs.mysql.com/bug.php?id=27877
said utf8_general_ci should be fixed: "SHARP S" should be
changed to be equal to "s", like the manual says.
- WL#4475 Unicode 5.1.
- Thread in dev-private email "Re: WL#1213 and Glagolitic Azu"
[ mysql internal intranet ] /secure/mailarchive/mail.php?folder=4&mail=20614
You must be logged in to tag this worklog