WL#896: Primary, Secondary and Tertiary Sorts

Affects: Server-6.x — Status: Assigned — Priority: Low

The Unicode Collation Algorithm has guidelines for sorting  
characters. An introductory article is:  
"Collations"
http://web.archive.org/web/20030401193640/www.dbazine.com/gulutzen1.html
Our authoritative text is:  
"Unicode Technical Standard #10 Unicode Collation Algorithm"  
http://www.unicode.org/reports/tr10/  
which we'll call "The UCA document".  
  
Trying to summarize in one sentence:  
For sort keys and indexes, instead of storing the characters  
themselves, store the character weights, so that collation  
will be possible (approximately) as the UCA document requires.  
  
Alexander Barkov (Bar) and Sergei Golubchik believed this task
(WL#896) could be in 5.0, which was the original plan. Clearly
it will be later.  
  
This has some possible interest for a Japanese customer  
[ name deleted, see Progress Reports ] which suggested a patch
for MySQL 4.1, but (as Sergei points out) they can use their own
patch with 4.1 while we work toward WL#896.  

Sergei, Bar, and I (Peter Gulutzan) agree that the Japanese-customer
patch (also known as utf8_general_cs)  
is good for their particular purpose but not for our usual  
customers. They do two passes: sort according to the MySQL  
official collation (which I suppose is sjis_japanese_ci),  
then for all values which are "equal" in that collation,  
compare using memcmp. This means that a level-3 difference  
trumps a level-2 difference, e.g. 'A' < 'ae'. The  
Japanese-customer solution is good -- these are intelligent
people and the solution works -- but we merely say that our
plan must fit other, non-Japanese, situations as well.

WL#896 is a prerequisite for new Hungarian collations
(WL#2993) and standard Japanese (WL#2555).


The "allkeys.txt" file discussed in the HLS is here:
http://www.unicode.org/Public/UCA/latest/allkeys.txt

See also:
BUG#34130 incorrect french order in utf8_unicode_ci
The name of the new collation should be      
ucs2_unicode_w3 etc., not ucs2_unicode_cs etc.
and not utf8_general_cs etc. The rationale for this is:
The collation is not case sensitive at level-1, and the
suffix _cs should be reserved for case sensitive at
level-1. The suggestion _w3 means "weight 3".
This is vaguely like the mimer convention,
SWEDISH_3 means "Swedish with level 3":
http://developer.mimer.com/features/feature_21.htm.

Two new collations will be ucs2_unicode_w3
and utf8_unicode_w3. They will follow exactly the
allkeys.txt guideline for level-1. They should
follow allkeys.txt for level-3. The implementor
can decide what version of allkeys.txt to use.

UNDECIDED ISSUE: in allkeys.txt, lower case precedes
upper case. Sergei finds this "non-intuitive" but
does not oppose. Bar was "surprised" and therefore
coded with upper case first, which I guess means he
opposes. (Bar points that the Japanese-customer patch would
also put capital before small.) I (Peter Gulutzan) am
"shocked" but say we must follow allkeys.txt, and behaviour
should not even be optional.
(Update: 2006-03-09: we'll go with "upper case first".)

DEFERRED ISSUE: in allkeys.txt there is no specification for:
NULL when NULL sorts low, as it always does now
NULL when NULL sorts high, as it might someday (WL#913)
End-of-array-element
Bar regards the decision about this as a separate task.

It is possible to define any column with a _w3
collation. However, for conditional expressions and
distinct and grouping, there is no
difference between ucs2_unicode_w3 and ucs2_unicode_ci.
You only see a difference with ORDER BY expression,
where expression -- either because a column is defined
with a _w3 collation or because there is a COLLATE
clause -- has a _w3 collation.

There is no two-level collation, and there is no
four-level collation outside Japanese.
Maybe there should someday be a one-level collation,
it might be faster but might need more space.
(One-level collations are the norm now, but there
is a distinction between a one-level index of
characters and a one-level index of weights.)

A _w3 collation is always compatible with a _ci
collation, because we will always use the _ci collation.
For example, one can concatenate a _w3 value
with a _ci value, the result will have a _ci collation.

Bar will try compression tricks to keep the sort key
length reasonable, but we realize
that lots of memory is necessary. One suggestion is:
make an initial assumption "multiplier = 6", try to
make keys, if overflow occurs say "multiplier++" and
try again or else flag the keys that will need special
handling after the sort pass. Details are not part of
this description but should be in low-level architecture
or in a separate worklog task. Another suggestion is:
since the later levels are usually only one or two
bits, and the first level is less than 16 bits, we
could try to store all levels in one 16-bit word,
and do multiple passes with appropriate ANDs / ORs.

UNDECIDED ISSUE: The new collations are NO PAD.
I have stated this without justification. and I
do not believe that there is agreement about it.
(Update: 2006-03-09: we'll go with "PAD SPACE".)

I and Bar agree that we should follow the UCA document
"1.6 Interleaved Levels" for "ORDER BY a,b" by
comparing all level-1 weights of a and b, then
comparing all level-2 weights of a and b.

Supersets
---------

Originally there was an idea:

- If column1 is defined as CHARACTER SET latin1, then
ORDER BY column1 COLLATE utf8_unicode_w3
is legal.

In other words, we want a collation of any superset
to be applicable to a subset. E.g.: a latin1_xxx
collation to 'ascii' character set, and so on.
However, I back away from this proposition.
It's a separate task, I suppose.

w3a
----

In _w3 collations the level-2 and level-3
weights are applicable only for ORDER BY.

Variant collations which have the suffix _w3a
("weight 3 always" or "weight 3 strong") are applicable for all
operations, including UNIQUE, > and = and <
comparisons (but not LIKE), GROUP BY, ORDER BY.

The UCA document allows the option, saying:
" For comparison, most of the time a three-level strength
will need to be used. In some cases, a larger number of
levels will be needed, while in others - especially in
searching - fewer levels will be desired."
...
"Users of search algorithms should be allowed to modify
the comparison strength, thus excluding differences at
less significant levels. This is especially useful for
searching, but can also apply to comparison."

Bar points out that latin2_czech_ci already works
like a _w3a collation, and the code that he already
has done is for a _w3a collation, and this would be
compatible with the Japanese-customer patch.

The standard Japanese collation (WL#2555) is _w4a.

Alternate weighting
-------------------

There will be no alternate weighting of punctuation
characters. I throw this in because mimer has an
option to give punctuation alternate weighting,
which seems to mean that it's ignored till level 4.

With respect to this, Bar says:
"At the first stage [no alternate weighting].
But in the future, why not?"

WEIGHT_STRING
-------------

The WEIGHT_STRING function will return a binary string
of weights. Comparable functions are Oracle's NLSSORT
and Microsoft's tertiary_weights.

This is now a separate completed task, WL#3716 WEIGHT_STRING function.
See also WL#3804 WEIGHT_STRING deferred features.

By default, there will be no compression, we will not
implement any of the suggestion in the UCA description
http://www.unicode.org/reports/tr10/tr10-12.html
section 6.1 "Reducing sort key lengths".

Indexing
--------

I would be happy if we did WL#896 with no indexes,
but apparently Bar has already done something, and
I suppose that the Japanese customer won't be happy unless we
have a solution with indexing.

I asked a few questions, of varying quality.

PETER: "If we have a partial index, are all levels partial?"

SERG: "Why is it a question ? It's an index on LEFT(col, N).
Definition of a character does not depend on collation, so
for WL#896 it doesn't matter whether an index is partial
or not. Thus - not relevant."

BAR: "Yes. I agree with Serg: Just think of a partial
index as of LEFT(x, N)."

[added by Sergei at 2007-02-14:
As Bar pointer out later, partial index can not be an
index on LEFT(col, N) - as it could produce an order
different from the ordering by col. We cannot allow that,
optimizer could not use such an index. So, a partial index
must have first N weight of a string, not N characters]

The final revised answer to the question is in the email
"Re: WL#3664 - strnxfrm() and UNIQUE keys" from Bar, 2007-02-15:
"
1. all levels for non-prefix non-unique keys
2. only "compare" levels for prefix non-unique keys
3. only "compare" levels for non-prefix unique keys
4. only "compare" levels for prefix unique keys

For the case N1 engines will silently force strnxfrm to use all levels.
For the cases N2-N4, engines will ask collations which levels to use.
"

PETER: "Will the optimizer give priority to this
[i.e. _w3] index type?"

SERG: "optimizer is not part of WL#896
not relevant."

BAR: "We cannot have indexes with different collations on
the same column now. But I think we should add this
possibility. It should be done this way: If we have
a _ci and _w3 index on the same column, then _w3
index should be picked up in the optimizer for ORDER BY,
but a _ci index for GROUP/DISTINCT/WHERE optimization.
What do you think?"

PETER: "Can the index be used as a covering index?"

SERG: "not relevant for WL#896"

BAR: "What does it mean?"

PETER: "It would mean that, if you could derive the
original character value from the weight (which is
not always guaranteed by the UCA document), then you
could answer SELECT x_w3 FROM t WHERE x_w3>'a';
using the index alone, without reading the data file."

PETER: "Is LIKE 'A_' going to use the index?"

BAR: "LIKE uses first level weights only.
If we have only _w3 index, then yes, it is fine
to use this index for LIKE comparison as well.
If we have a choice between _ci and _w3, then _ci should
be used. Do you agree?"

PETER: "Is a UNIQUE index compared for level-1 only?"

SERG: "As I understand - yes, of course."

BAR: "Yes."

[Note added 2007-01-29] the new answer is "Yes, except for
Hungarian and Japanese".

PETER: "If final results for "column='x'" won't be in order by
row number (assuming there is no ORDER BY clause),
will that affect optimizer merging / some applications?"

SERG: "Pardon me ?"

BAR: "As we introduce a new collation, we cannot break anything :)"

PETER: "I withdraw the question."

Trailing spaces
---------------

Trailing spaces (U+0020) will be represented at level-1.
Trailing spaces will not be represented at level-2 or level-3.
Trailing spaces will be represented at level-4.


Prefix keys
-----------
It was decided not to implement prefix keys for the
multi-level collations in 6.1:
- They require significant implementation efforts
- Functional keys will eventually replace prefix keys:
CREATE INDEX i1 ON t1 (LEFT(c1))


Email archives
--------------

The contents of this task come from some emails
exchanged in November 2004 between Alexander Barkov,
Sergei Golubchik, and Peter Gulutzan, threads
"MySQL 4.1 for [name of another Japanese customer deleted,
see progress reports]" and "MySQL 4.1 case sensitive patch for
[name of another Japanese customer deleted, see progress reports]".
Earlier discussions are in scrum-tasks, scrum-daily, and support archives:

Thread "Character sets" [January 2004]
https://intranet.mysql.com/secure/mailarchive/mail.php?folder=76&mail=1080

Thread "Bar's proposed scrum todo for February" [February 2004]
https://intranet.mysql.com/secure/mailarchive/mail.php?folder=78&mail=898

Thread "[req] utf8_general_cs patch" [July 2004]
https://intranet.mysql.com/secure/mailarchive/mail.php?folder=12&mail=71247

"What the @!#$% is the TERTIARY_WEIGHTS() function for?" (Microsoft blogger)
https://blogs.msdn.com/michkap/archive/2006/06/02/615571.aspx
Detailed design discussinos have taken place via
email; no further LLD needed.

You must be logged in to tag this worklog

No Comments yet

Votes

Not yet rated.
You must be logged in to vote.

Watches

1 members are watching this worklog
You must be logged in to track this worklog.

Provide Feedback

Please note:
HTML will be purified, but we allow for a number of HTML tags so that you have the flexibility to decorate your comment text to some extent. The comments allow the following HTML tags:

strong, b, em, blockquote, a, code, pre

To put code into your comment, simply encapsulate your code with
[code language="XXX"][/code], where XXX is any common language, for instance "PHP", "SQL", "C", etc.



You must be logged in to comment