WL#4164: Two-byte collation IDsAffects: Server-5.5 — Status: Complete — Priority: MediumAfter adding UTF32, UTF16 and a new version of UTF8 (with MB4 support), we have used up almost all charset+collation IDs which can fit into 1 byte. We need to switch to two-byte IDs. Character set + collation encoding ================================== It would be convenient to encode separately the character set ID and collation ID into these two bytes. It would help to maintain connectors. Because only the character set is usually important on the client side, and collation does not really matter. Adding new collations to the server won't cause an urgent need to recompile all connectors to understand a new charset/collation pair. It will be enough to put new collation ID into the same range with all collations for the same character set. Charset+collation IDs can be encoded as: Proposal 1: - 7 bits to encode character set (128 character sets) - 9 bits to encode collation (512 collations) Proposal 2: - 6 bits to encode character set (64 character sets) - 10 bits to encode collation (1024 collations) Proposal 3: - 8 bits to encode character set (256 character sets) - 8 bits to encode collation (256 collations) Proposal 4: Or it can be floating encoding: - 32 character sets with 1024 collations (32768 charset+collation pairs total) - 64 character sets with 512 collations (32768 charset+collation pairs total) Floating encoding is preferable, because many character sets have only a limited number of collations. They are 8bit character sets, Eastern Asian character sets (Big5, GB2312, SJIS, UJIS, CP932), and some other ones. Only a few character sets can have many collations. They are Unicode character sets: UTF8, UCS2, UTF16, UTF32. User-defined collations ======================= A special ID range for user defined collations which can be added by editing PREFIX/share/charsets/*.xml files is defined as 1024..2047. A user-defined ID range should guarantee that IDs for built-in collations don't conflict with IDs for user-defined collations. The affected code parts ======================= - FRM files - client-server protocol - connectors - other parts (TODO: list all other parts here) Binary log and replication already use 2 bytes per charset+collation ID. Most likely they won't need any changes. Old client compatibility ======================== Upgrading client part is usually painful because it causes a need for recompiling of user applications. One should be able to upgrade server without having to upgrade client. So old clients should be able to understand server with new charset+collation ID encoding, at least on the old single byte ID range. To reach this goal, new server can still send old IDs using single byte, and send only new IDs using two bytes. A special prefix can designate a two-byte sequence. The NULL byte 0x00 can work as this prefix. For example: 0xAB - old ID 171, one byte total, 0x00 0x01 0x01 - new ID 257, three bytes total. Other considerations during upgrade: The version number byte of .frm file will change. The implementor will investigate whether the bytes' previous values were = 0, or were undefined. We want to know whether there are other items which might need these spare bytes soon because their ranges are almost completely used, e.g. flags; maybe the connectors people can tell us. Time estimate by Bar ==================== FRM and protocol should not take much time. Maybe one week. But I'm afraid that some of the engines will be affected, if they use only one byte for IDs. Related code parts: No Comments yet |
VotesWatches0 members are watching this worklog
You must be logged in to track this worklog.
Provide Feedback
You must be logged in to comment
|