WL#3090: Japanese Character Set Adjustments
Affects: Server-5.4 — Status: Assigned — Priority: Medium
For conversion between one Japanese character set and
another Japanese character set, use a JIS table (based
on JIS-X-0201 + JIS-X-0208 + JIS-X-0213 code points)
rather than a Unicode table. This approach will be faster.
Speed up conversions between Japanese character sets.
Since ujis + eucjpms + sjis + cp932 are all based on JIS
(Japanese Industrial Standard) repertoires, conversion
should be possible with algorithms or JIS-table lookups,
without requiring Unicode-table lookups.
This affects CAST(), CONVERT(), and any automatic
conversion due to assignment. It does not mean you can
compare sjis and cp932 strings without explicit conversion.
The preferred plan is in sections "Main Proposal: JIS Table ...".
The possible alternative plans are in sections "Another Proposal ...".
The implementor will pick only one plan, after testing all plans.
Pick Pairs
----------
The possible pairs are:
ujis to eucjpms
ujis to sjis
ujis to cp932
eucjpms to ujis
eucjpms to sjis
eucjpms to cp932
sjis to ujis
sjis to eucjpms
sjis to cp932
cp932 to ujis
cp932 to eucjpms
cp932 to sjis
This worklog description only has sections for ujis to sjis
and sjis to cp932. However, given those pairs, the rest are
straightforward. The implementor should make some effort to
handle all pairs.
WL#1820 mentions 4 more Japanese character sets, and there
might be 4 more due to JIS X 0213:2004. So in theory someday
144 possible pairs. However, we will make no effort here to
allow for possible future character sets.
Requirement
-----------
For all characters, the results must be the same as
the results we get currently for conversion via Unicode.
And if a conversion is impossible then there will be
a warning or error, just as there is now in version 5.1.
This requirement can be discussed. If we have to bend it
for the sake of efficiency, we need to know how bad that
will be.
Main Proposal: JIS table
------------------------
The current loop in strings/ctype-*.c looks like this:
while (!EOF)
{
cs1->cset->wb_wc(&code); // scan Unicode character from src
cs2->cset->wc_mb(&code); // put Unicode character to dst
}
The proposed loop looks like this:
while (!EOF)
{
cs1->cset->wb_jis(&code); // scan JIS character from src
cs2->cset->jis_mb(&code); // put JIS character to dst
}
That is, each Japanese character set handler will have a new
function wb_jis "scan a character from an sjis/ujis/eucjpms/cp932
tring and return its JIS code", and a new function jis_mb "put a
character with the given JIS code into an sjis/ujis/eucjpms/cp932
string".
The "JIS code" is in a table which contains values defined
by the various JIS standards. For example: 0xdf from JIS-X-0201,
0x2121 from JIS-X-0208, 0x???? from JIS-X-0213.
Since we avoid JIS-to-Unicode and Unicode-to-JIS table lookups,
performance is about twice as fast, according to some early tests.
The functions that should become faster are:
sql/strfunc.cc strconvert().
sql/sql_string.cc copy_and_convert().
sql/sql_string.cc well_formed_copy_nchars().
Main Proposal: JIS Table: Examples
----------------------------------
For example, when you convert from sjis to ujis:
1a. my_mb_wc_sjis() scans an SJIS representation of JIS-X-0208 code
1b. my_mb_wc_sjis() converts JIS-X-0208 code (in SJIS form) to Unicode
using func_sjis_uni_onechar(), which is slow (uses table lookups)
Then the found Unicode character code is returned.
then
2a. my_wc_mb_euc_jp() gets a Unicode code and converts it to JIS-X-0208
using my_uni_jisx0208_onechar(), which is slow (uses table lookups)
2b. my_wc_mb_euc_jp() puts the found JIS-X-0208 character
in the result string.
The slowest steps here are func_sjis_uni_onechar() and
my_uni_jisx0208_onechar(). I.e. conversion from JIS-X-0208 to Unicode,
and then conversion from Unicode back to JIS-X-0208.
If we use JIS-X-0208 instead of Unicode as intermediary, then
these two slow steps are not necessary.
An example file, jp.txt, attached to this worklog task,
demonstrates what sjis_jis() and jis_cp932() could look like.
Another proposal: Big Unicode Table
-----------------------------------
With a JIS table we can avoid Unicode intermediary lookups,
and thus save time. But there is another way to save time:
make the Unicode intermediary lookups faster. The point is
that there are many "if" statements in the Unicode lookups,
because we only have mappings for certain characters (the
other characters are either invalid or deducible). If we
expanded the table so that it included all possible characters
rather than just certain characters, we'd be able to eliminate
the "if"s and just do one table-lookup statement.
Specifically:
In functions like func_sjis_uni_onechar() or my_uni_jisx0208_onechar(),
replace "if"s like these:
...
if ((code>=0x00A1)&&(code<=0x00DF))
return(tab_cp932_uni0[code-0x00A1]);
if ((code>=0x8140)&&(code<=0x84BE))
return(tab_cp932_uni1[code-0x8140]);
if ((code>=0x8740)&&(code<=0x879C))
return(tab_cp932_uni2[code-0x8740]);
...
with a single "array operation" like this:
...
code= tab_co932_uni[code - something];
...
The implementor will test whether "another proposal"
is at least as fast as "...". If so, there will be
no need to add mb_jis() or jis_mb() for each character set.
This will a very big table when support is added for
WL#1213 Supplementary Characters.
The "performance point of view" mainly depends on how lucky
we are with caching with this huge table.
But the invalid SJIS/UJIS values should be very rare (indeed
they shouldn't exist at all in a clean database). Therefore
they will never be looked up. Therefore, although the total
table size is much larger if we allow for all invalid values,
the amount that's actually used in lookups (and therefore the
amount that's cached) is not larger at all.
So it seems to Peter that Alexander Barkov's "Big Unicode Table
proposal" is always going to be faster with realistic data,
(He's also assuming that invalid values are clumped together,
rather than distributed evenly in the table, but that too
seems realistic to him.)
Another Proposal: Algorithm for ujis/eucjpms to sjis/cp932
----------------------------------------------------------
The algorithm for moving from an EUC encoding (ujis or
eucjpms) to an SJIS encoding (sjis or cp932) is well known.
There is a description in Wikipedia:
http://en.wikipedia.org/wiki/Shift_JIS
It's possible because, although the encodings are different,
the underlying JIS "code points" are the same.
But the result can be an unassigned / reserved
SJIS character, that is, well-formed but invalid.
Example: _ujis aaaa = JIS 2a2a = _sjis 85a8.
The only ways around this difficulty are:
(1) ignore it, assume that is the ujis value was
acceptable then the sjis value must be good too
(2) use a lookup table with one bit for each JIS
value, with 0 = valid or 1 = invalid, so the
table is only 1/16 as large
(3) strip the UJIS value to get the JIS value, but
then do a lookup from the JIS value to the SJIS
value.
(4) add more "if" statements, for example "if the
first byte of the SJIS result is 0x85, it's bad".
The implementor will test to see whether the algorithm
is faster than table lookup. If so, we will then have
to choose one of the above "ways around this difficulty".
Another proposal: sjis to cp932
-------------------------------
The idea here never became a clear proposal.
We may remove this section after 2009-12-31.
Since cp932 is merely the Microsoft variant of sjis,
many characters are the same in both character sets,
and therefore need no conversion. For example,
_sjis 0x8ec7 = _cp932 0x8ec7.
Effectively the sjis-to-cp932 conversion can
happen, for some character strings, by just renaming.
There are 4408 sjis characters which currently cannot be converted
to cp932. This happens for three reasons:
1. The character is illegal in sjis. MySQL should never have accepted it.
Bar suggested that MySQL should start rejecting such characters,
but Peter resisted, saying that's a change in behaviour, a new task.
2. The character is legal in some sjis variant, but is not in
the sjis-to-Unicode table.
3. The character is legal in sjis, and is in the sjis-to-unicode table,
but sjis-to-unicode value differs from cp932-to-unicode value.
Example: 815F, 8160, 8161, 817C, 8191, 8192, 81CA.
There are seven characters in this category, the differences are
halfwidth versus fullwidth etc., and they are mentioned in the MySQL
Reference Manual
http://dev.mysql.com/doc/refman/5.1/en/charset-asian-sets.html
Specifically for 81CA (NOT SIGN) see the FAQ for the manual:
http://dev.mysql.com/doc/refman/5.1/en/faqs-cjk.html
We still need a table. But it can be only a small table, containing
only the characters which cause conversion difficulty.
There is a test for the 4408 characters which cannot be
converted from sjis to cp932, in this email thread:
[ mysql intranet ] /secure/mailarchive/mail.php?folder=4&mail=30324
Feedback
--------
We asked for feedback re WL#3090 requirements several months ago.
What we got (excluding comments from Dean + Susanne) was this:
"Yoshinori says: Must have the part for sjis-cp932 conversion"
"[Bar says] "Not important. Postpone for a higher 6.x or 7.x"
and Shinobu Matsuzuka rated the task "P2" with no further comment.
Cancelled subtasks
------------------
This section is obsolete and may be removed.
It concerns subtasks which were part of the original
proposal, which we decided against for reasons given here.
We may remove this section after 2009-12-31.
1. Change --skip-character-set-client-handshake
using my.cnf.
CANCELLED. See progress notes for original description.
Shuichi didn't get feedback from Japanese community.
2. The following should be an error, not a warning:
mysql> create table tj (s1 char(10) character set sjis);
Query OK, 0 rows affected (0.52 sec)
mysql> set sql_mode=ansi;
Query OK, 0 rows affected (0.03 sec)
mysql> insert into tj values (0x8080);
Query OK, 1 row affected, 1 warning (0.00 sec)
(The above causes a warning 1366 "Incorrect
string value ..." which is an error if strict.)
What they really want is that we accept the
junk character, but I can't think of a good
way, and they indicated that "at least" an
error should be there.
MOVED. WL#5083 Error for character set conversion failure.
4. Allow Shift_JIS as an alias for sjis
Allow EUC-JP as an alias for eucjp
CANCELLED. See progress notes for original description.
Shuichi didn't get feedback from Japanese community.
References
----------
There was previous discussion of this worklog task
in email thread "Feedback and Requests from Japanese
users community" with participants: shuichi, pgulutzan,
bar, and others.
There is also a dev-private thread "Re: WL#3090 Japanese
Character Set Adjustments".
Character set components for ujis
---------------------------------
0. ASCII [00-7F] ASCII 127 characters
1. JIS-X-0208 [A1-FE][A1-FE] 8836 characters
2. Half-Width-Kana [8E][A1-DF] 63 characters
3. JIS-X-0212 [8F][A1-FE][A1-FE] 8836 characters
Character set components for eucjpms
------------------------------------
0. ASCII [00-7F] ASCII 127 characters
1. JIS-X-0208 [A1-FE][A1-FE] 8836 characters
2. Half-Width-Kana [8E][A1-DF] 63 characters
3. JIS-X-0212 [8F][A1-FE][A1-FE] 8836 characters
Character set components for sjis
---------------------------------
0. ASCII [00-7F] 127 characters
1. JIS-X-0208 [81-9F,E0-EF][40-7E,80-FC] 8836 characters
[81-84][40-7E,80-FC] JIS-X-0208 Rows 1..8 (8 rows)
[85-87][40-7E,80-FC] JIS-X-0208 Rows 9..14 (6 rows) UNASSIGNED
[88][40-7E,80-9E] JIS-X-0208 Row 15..15 (1 rows) UNASSIGNED
[88][9F-FC] JIS-X-0208 Row 16..16 (1 rows)
[89-8F][40-7E,80-FC] JIS-X-0208 Rows 17..30 (14 rows)
[90-9F][40-7E,80-FC] JIS-X-0208 Rows 31..62 (32 rows)
[E0-EA][40-7E,80-FC] JIS-X-0208 Rows 63..84 (22 rows)
[EB-EF][40-7E,80-FC] JIS-X-0208 Rows 85..94 (10 rows) UNASSIGNED
1a. JIS-X-0208 Extension [F0-FC][40-7E,80-FC] 2444 characters
[F0-FC][40-7E],80-FC] JIS-X-0208 Rows 95..121 (26 Rows) UNASSIGNED
2. Half-Width-Kana [A1-DF] 63 characters
3. JIS-X-0212 -----
Character set components for cp932
----------------------------------
0. ASCII [00-7F] 127 characters
1. JIS-X-0208 [81-9F,E0-EF][40-7E,80-FC] 8836 characters
[81-84][40-7E,80-FC] JIS-X-0208 Rows 1..8 (8 rows)
[85-86][40-7E,80-FC] JIS-X-0208 Rows 9..12 (4 rows) UNASSIGNED
[87][40-7E,80-9E] JIS-X-0208 Rows 13..13 (1 rows) NEC special characters
[87][9F-FC] JIS-X-0208 Rows 14..14 (1 rows) UNASSIGNED
[88][40-7E,80-9E] JIS-X-0208 Row 15..15 (1 rows) UNASSIGNED
[88][9F-FC] JIS-X-0208 Row 16..16 (1 rows)
[89-8F][40-7E,80-FC] JIS-X-0208 Rows 17..30 (14 rows)
[90-9F][40-7E,80-FC] JIS-X-0208 Rows 31..62 (32 rows)
[E0-EA][40-7E,80-FC] JIS-X-0208 Rows 63..84 (22 rows)
[EB-EC][40-7E,80-FC] JIS-X-0208 Rows 85..94 (4 rows) UNASSIGNED
[ED-EE][40-7E,80-FC] JIS-X-0208 Rows 85..94 (4 rows) NEC selection of IBM
[EF][40-7E,80-FC] JIS-X-0208 Rows 85..94 (2 rows) UNASSIGNED
1a. JIS-X-0208 Extension [F0-FC][40-7E,80-FC] 2444 characters
[F0-F9][40-7E,80-FC] JIS-X-0208 Rows 95..114 (20 rows) UNASSIGNED
[FA][40-7E,80-FC] JIS-X-0208 Rows 115..116 (2 rows) IBM extensions
[FB][40-7E,80-FC] JIS-X-0208 Rows 117..118 (2 rows) IBM extensions
[FC][40-7E,80-9E] JIS-X-0208 Rows 119..120 (2 rows) IBM extentions
2. Half-Width-Kana [A1-DF] 63 characters
3. JIS-X-0212 -----
Character set components for JIS-X-0208
---------------------------------------
JIS-X-0208 will only be used as an intermediary encoding.
[21-28][21-7F] - Rows 1..8 8 rows
[29-2f][21-7F] - Rows 9..15, 7 rows, Unassigned
[30-74][21-7F] - Rows 16..84, 69 rows
[75-7F][21-7F] - Rows 85..94, 10 rows, Unassigned
"SJIS vs CP932" and "UJIS vs EUCJPMS"
-------------------------------------
+------+------+---------+------+
| SJIS | UJIS | EUCJPMS | UCS2 |
+------+------+---------+------+
| 815F | 5C | 5C | 005C | REVERSE SOLIDUS
| 8160 | A1C1 | 3F | 301C | WAVE DASH
| 8161 | A1C2 | 3F | 2016 | DOUBLE VERTICAL LINE
| 817C | A1DD | 3F | 2212 | MINUS SIGN
| 8191 | A1F1 | 3F | 00A2 | CENT SIGN
| 8192 | A1F2 | 3F | 00A3 | POUND SIGN
| 81CA | A2CC | 3F | 00AC | NOT SIGN
+------+------+---------+------+
+-------+------+---------+------+
| cp932 | UJIS | EUCJPMS | UCS2 |
+-------+------+---------+------+
| 815F | 3F | A1C0 | FF3C | FULLWIDTH REVERSE SOLIDUS
| 8160 | 3F | A1C1 | FF5E | FULLWIDTH TILDE
| 8161 | 3F | A1C2 | 2225 | PARALLEL TO
| 817C | 3F | A1DD | FF0D | FULLWIDTH HYPHEN-MINUS
| 8191 | 3F | A1F1 | FFE0 | FULLWIDTH CENT SIGN
| 8192 | 3F | A1F2 | FFE1 | FULLWIDTH POUND SIGN
| 81CA | 3F | A2CC | FFE2 | FULLWIDTH NOT SIGN
+-------+------+---------+------+
ujis vs eucjpms vs cp932 for JIS-X-0212 character block
-------------------------------------------------------
The JIS-X-0212 Block differs in ujis and eucjpms in 108 characters.
The script to fetch different characters:
mysql> drop table t1; create table t1 (a char(1)); insert into t1 values
('0'),('1'),('2'),('3'),('4'),('5'),('6'),('7'),('8'),('9'),('A'),('B'),('C'),('D'),('E'),('F');
select hex(@a:=unhex(concat('8f',t11.a, t12.a, t13.a, t14.a))) as c,
hex(convert(convert(@a using ujis) using ucs2)) as ujis_ucs2,
hex(convert(convert(@a using eucjpms) using ucs2)) as eucjpms_ucs2,
hex(@b:=convert(convert(@a using eucjpms) using cp932)) as eucjpms_cp932,
hex(convert(@b using eucjpms)) as cp932_eucjpms from t1 t11, t1 t12, t1 t13, t1
t14 where t11.a>='a' and t13.a>='a' having ujis_ucs2 != eucjpms_ucs2 order by
t11.a,t12.a,t13.a,t14.a;
Query OK, 0 rows affected (0.00 sec)
- TILDE vs FULLWIDTH TILDE
+--------+-----------+--------------+---------------+---------------+
| c | ujis_ucs2 | eucjpms_ucs2 | eucjpms_cp932 | cp932_eucjpms |
+--------+-----------+--------------+---------------+---------------+
| 8FA2B7 | 007E | FF5E | 8160 | A1C1 |
+--------+-----------+--------------+---------------+---------------+
ujis 8FA2B7 == U+007E TILDE
eucjpms 8FA2B7 == U+FF5E FULLWIDTH TILDE
Note, roundtrip eucjpms->ucs2->eucjpms does not work.
eucjpms encodes "U+FF5E FULLWIDTH TILDE" twice:
In JIS-X-0208 block, as 0xA1C1.
In JIS-X-0212 block, as 0x8FA2B7
cp932 encodes U+FF5E once, in JIS-X-0208 block.
- BROKEN BAR vs FULLWIDTH BROKEN BAR
+--------+-----------+--------------+---------------+---------------+
| c | ujis_ucs2 | eucjpms_ucs2 | eucjpms_cp932 | cp932_eucjpms |
+--------+-----------+--------------+---------------+---------------+
| 8FA2C3 | 00A6 | FFE4 | FA55 | 8FA2C3 |
+--------+-----------+--------------+---------------+---------------+
ujis 8FA2C3 == U+00A6 BROKEN BAR
eucjpms 8FA2C3 == U+FFE4 FULLWIDTH BROKEN BAR
cp932 encodes U+FFE4 twice:
EEFA (in NEC selection of IBM characters, rows 89..92)
FA55 (in IBM Extensions, rows 115..116)
- IBM Extensions Rows (as in cp932, Rows 115..116)
SMALL ROMAN NUMERAL ONE .. SMALL ROMAN NUMERAL TEN
+--------+-----------+--------------+---------------+---------------+
| c | ujis_ucs2 | eucjpms_ucs2 | eucjpms_cp932 | cp932_eucjpms |
+--------+-----------+--------------+---------------+---------------+
| 8FF3F3 | 003F | 2170 | FA40 | 8FF3F3 |
| 8FF3F4 | 003F | 2171 | FA41 | 8FF3F4 |
| 8FF3F5 | 003F | 2172 | FA42 | 8FF3F5 |
| 8FF3F6 | 003F | 2173 | FA43 | 8FF3F6 |
| 8FF3F7 | 003F | 2174 | FA44 | 8FF3F7 |
| 8FF3F8 | 003F | 2175 | FA45 | 8FF3F8 |
| 8FF3F9 | 003F | 2176 | FA46 | 8FF3F9 |
| 8FF3FA | 003F | 2177 | FA47 | 8FF3FA |
| 8FF3FB | 003F | 2178 | FA48 | 8FF3FB |
| 8FF3FC | 003F | 2179 | FA49 | 8FF3FC |
+--------+-----------+--------------+---------------+---------------+
eucjpmps uses only JIS-X-0212 block for SMALL ROMAN NUMERAL.
cp932 encodes U+2170 SMALL ROMAN NUMERAL twice:
0xEEEF = U+2170 (in NEC selection of IBM characters, rows 89..92)
0xFA40 = U+2170 (in IBM Extensions, rows 115..116)
- NEC special characters (as in cp932, row 13)
ROMAN NUMERAL ONE .. ROMAN NUMERAL TEN
+--------+-----------+--------------+---------------+---------------+
| c | ujis_ucs2 | eucjpms_ucs2 | eucjpms_cp932 | cp932_eucjpms |
+--------+-----------+--------------+---------------+---------------+
| 8FF3FD | 003F | 2160 | 8754 | ADB5 |
| 8FF3FE | 003F | 2161 | 8755 | ADB6 |
| 8FF4A1 | 003F | 2162 | 8756 | ADB7 |
| 8FF4A2 | 003F | 2163 | 8757 | ADB8 |
| 8FF4A3 | 003F | 2164 | 8758 | ADB9 |
| 8FF4A4 | 003F | 2165 | 8759 | ADBA |
| 8FF4A5 | 003F | 2166 | 875A | ADBB |
| 8FF4A6 | 003F | 2167 | 875B | ADBC |
| 8FF4A7 | 003F | 2168 | 875C | ADBD |
| 8FF4A8 | 003F | 2169 | 875D | ADBE |
+--------+-----------+--------------+---------------+---------------+
eucjmps encodes ROMAN NUMERAL twice:
ADB5 = U+2160 ROMAN NUMERAL ONE (in JIS-X-0208 block)
8FF3FD = U+2160 ROMAN NUMERAL ONE (in JIS-X-0212 block)
cp932 encodes these character twice, too:
8754 = U+2160 ROMAN NUMERAL ONE (in NEC special characters, row 13)
FA4A = U+2160 ROMAN NUMERAL ONE (in IBM Extensions, rows 115..116)
iconv converts EUC-JP-MS 0xADB5 to cp932 0x8754
iconv converts EUC-JP-MS 0xFA4A to cp932 0x8754
iconv converts cp932 0x8754 to ADB5
iconv converts cp932 0xFA4A to ADB5
- IBM Extensions (as in cp932, Rows 115..116)
+--------+-----------+--------------+---------------+---------------+
| c | ujis_ucs2 | eucjpms_ucs2 | eucjpms_cp932 | cp932_eucjpms |
+--------+-----------+--------------+---------------+---------------+
| 8FF4A9 | 003F | FF07 | FA56 | 8FF4A9 |
| 8FF4AA | 003F | FF02 | FA57 | 8FF4AA |
+--------+-----------+--------------+---------------+---------------+
eucjpms encodes U+FF02 and U+FF07 only once, in JIS-X-0212 block.
cp932 encodes these character twice:
EEFC = U+FF02 FULLWIDTH QUOTATION MARK (NEC selected IBM chars, rows 89..92)
EEFB = U+FF07 FULLWIDTH APOSTROPHE (NEC selected IBM chars, rows 89..92)
FA57 = U+FF02 FULLWIDTH QUOTATION MARK (IBM Extensions, Rows 115..116)
FA56 = U+FF07 FULLWIDTH APOSTROPHE (IBM Extensions, Rows 115..116)
- NEC special characters (as in cp932, Row 13)
+--------+-----------+--------------+---------------+---------------+
| c | ujis_ucs2 | eucjpms_ucs2 | eucjpms_cp932 | cp932_eucjpms |
+--------+-----------+--------------+---------------+---------------+
| 8FF4AB | 003F | 3231 | 878A | ADEA |
| 8FF4AC | 003F | 2116 | 8782 | ADE2 |
| 8FF4AD | 003F | 2121 | 8784 | ADE4 |
+--------+-----------+--------------+---------------+---------------+
eucjpms encodes U+3231, U+2116 and U+2121 twice:
8FF4AB = U+3231 (in JIS-X-0212 block)
ADEA = U+3231 (In JIS-X-0208 block)
cp932 encodes these characters twice, too:
878A = U+3231 PARENTHESIZED IDEOGRAPH STOCK (NEC special characters, row 13)
FA58 = U+3231 PARENTHESIZED IDEOGRAPH STOCK (IBM Extensions, Rows 115..116)
8782 = U+2116 NUMERO SIGN (NEC special characters, row 13)
FA50 = U+2116 NUMERO SIGN (IBM Extensions, Rows 115..116)
8784 = U+2121 TELEPHONE SIGN (NEC special characters, row 13)
FA5A = U+2121 TELEPHONE SIGN (IBM Extensions, Rows 115..116)
- IBM Extensions (as in cp932, Rows 115..116)
and
NEC selected IBM characters (as in cp932, rows 89..92)
+--------+-----------+--------------+---------------+---------------+
| c | ujis_ucs2 | eucjpms_ucs2 | eucjpms_cp932 | cp932_eucjpms |
+--------+-----------+--------------+---------------+---------------+
| 8FF4AE | 003F | 70BB | FA62 | 8FF4AE | CJK UNIF
| 8FF4AF | 003F | 4EFC | FA6A | 8FF4AF | CJK UNIF
| 8FF4B0 | 003F | 50F4 | FA7C | 8FF4B0 | CJK UNIF
| 8FF4B1 | 003F | 51EC | FA83 | 8FF4B1 | CJK UNIF
| 8FF4B2 | 003F | 5307 | FA8A | 8FF4B2 | CJK UNIF
| 8FF4B3 | 003F | 5324 | FA8B | 8FF4B3 | CJK UNIF
| 8FF4B4 | 003F | FA0E | FA90 | 8FF4B4 | CJK COMPAT
| 8FF4B5 | 003F | 548A | FA92 | 8FF4B5 | CJK UNIF
| 8FF4B6 | 003F | 5759 | FA96 | 8FF4B6 | CJK UNIF
| 8FF4B7 | 003F | FA0F | FA9B | 8FF4B7 | CJK COMPAT
| 8FF4B8 | 003F | FA10 | FA9C | 8FF4B8 | CJK COMPAT
| 8FF4B9 | 003F | 589E | FA9D | 8FF4B9 | CJK UNIF
| 8FF4BA | 003F | 5BEC | FAAA | 8FF4BA | CJK UNIF
| 8FF4BB | 003F | 5CF5 | FAAE | 8FF4BB | CJK UNIF
| 8FF4BC | 003F | 5D53 | FAB0 | 8FF4BC | CJK UNIF
| 8FF4BD | 003F | FA11 | FAB1 | 8FF4BD | CJK COMPAT
| 8FF4BE | 003F | 5FB7 | FABA | 8FF4BE | CJK UNIF
| 8FF4BF | 003F | 6085 | FABD | 8FF4BF | CJK UNIF
| 8FF4C0 | 003F | 6120 | FAC1 | 8FF4C0 | CJK UNIF
| 8FF4C1 | 003F | 654E | FACD | 8FF4C1 | CJK UNIF
| 8FF4C2 | 003F | 663B | FAD0 | 8FF4C2 | CJK UNIF
| 8FF4C3 | 003F | 6665 | FAD5 | 8FF4C3 | CJK UNIF
| 8FF4C4 | 003F | FA12 | FAD8 | 8FF4C4 | CJK COMPAT
| 8FF4C5 | 003F | F929 | FAE0 | 8FF4C5 | CJK COMPAT
| 8FF4C6 | 003F | 6801 | FAE5 | 8FF4C6 | CJK UNIF
| 8FF4C7 | 003F | FA13 | FAE8 | 8FF4C7 | CJK COMPAT
| 8FF4C8 | 003F | FA14 | FAEA | 8FF4C8 | CJK COMPAT
| 8FF4C9 | 003F | 6A6B | FAEE | 8FF4C9 | CJK UNIF
| 8FF4CA | 003F | 6AE2 | FAF2 | 8FF4CA | CJK UNIF
+--------+-----------+--------------+---------------+---------------+
eucjpms encodes these characters only once, in JIS-X-0212 block.
cp932 encodes these characters twice:
ED46 = U+70BB CJK (in NEC selection on IBM characters, rows 89..92)
FA62 = U+70BB CJK (in IBM Extensions, Rows 115-116)
See CP932-NEC-IBM.txt in attachment.
- IBM Extensions (as in cp932, Rows 117..118)
and
NEC selected IBM characters (as in cp932, rows 89..92)
+--------+-----------+--------------+---------------+---------------+
| c | ujis_ucs2 | eucjpms_ucs2 | eucjpms_cp932 | cp932_eucjpms |
+--------+-----------+--------------+---------------+---------------+
| 8FF4CB | 003F | 6DF8 | FB43 | 8FF4CB | CJK UNIF
| 8FF4CC | 003F | 6DF2 | FB44 | 8FF4CC | CJK UNIF
| 8FF4CD | 003F | 7028 | FB50 | 8FF4CD | CJK UNIF
| 8FF4CE | 003F | FA15 | FB58 | 8FF4CE | CJK COMPAT
| 8FF4CF | 003F | FA16 | FB5E | 8FF4CF | CJK COMPAT
| 8FF4D0 | 003F | 7501 | FB6E | 8FF4D0 | CJK UNIF
| 8FF4D1 | 003F | 7682 | FB70 | 8FF4D1 | CJK UNIF
| 8FF4D2 | 003F | 769E | FB72 | 8FF4D2 | CJK UNIF
| 8FF4D3 | 003F | FA17 | FB75 | 8FF4D3 | CJK COMPAT
| 8FF4D4 | 003F | 7930 | FB7C | 8FF4D4 | CJK UNIF
| 8FF4D5 | 003F | FA18 | FB7D | 8FF4D5 | CJK COMPAT
| 8FF4D6 | 003F | FA19 | FB7E | 8FF4D6 | CJK COMPAT
| 8FF4D7 | 003F | FA1A | FB80 | 8FF4D7 | CJK COMPAT
| 8FF4D8 | 003F | FA1B | FB82 | 8FF4D8 | CJK COMPAT
| 8FF4D9 | 003F | 7AE7 | FB85 | 8FF4D9 | CJK UNIF
| 8FF4DA | 003F | FA1C | FB86 | 8FF4DA | CJK COMPAT
| 8FF4DB | 003F | FA1D | FB89 | 8FF4DB | CJK COMPAT
| 8FF4DC | 003F | 7DA0 | FB8D | 8FF4DC | CJK UNIF
| 8FF4DD | 003F | 7DD6 | FB8E | 8FF4DD | CJK UNIF
| 8FF4DE | 003F | FA1E | FB92 | 8FF4DE | CJK COMPAT
| 8FF4DF | 003F | 8362 | FB94 | 8FF4DF | CJK UNIF
| 8FF4E0 | 003F | FA1F | FB9D | 8FF4E0 | CJK COMPAT
| 8FF4E1 | 003F | 85B0 | FB9E | 8FF4E1 | CJK UNIF
| 8FF4E2 | 003F | FA20 | FB9F | 8FF4E2 | CJK COMPAT
| 8FF4E3 | 003F | FA21 | FBA0 | 8FF4E3 | CJK COMPAT
| 8FF4E4 | 003F | 8807 | FBA1 | 8FF4E4 | CJK UNIF
| 8FF4E5 | 003F | FA22 | FBA9 | 8FF4E5 | CJK COMPAT
| 8FF4E6 | 003F | 8B7F | FBAC | 8FF4E6 | CJK UNIF
| 8FF4E7 | 003F | 8CF4 | FBAE | 8FF4E7 | CJK UNIF
| 8FF4E8 | 003F | 8D76 | FBB0 | 8FF4E8 | CJK UNIF
| 8FF4E9 | 003F | FA23 | FBB1 | 8FF4E9 | CJK COMPAT
| 8FF4EA | 003F | FA24 | FBB3 | 8FF4EA | CJK COMPAT
| 8FF4EB | 003F | FA25 | FBB4 | 8FF4EB | CJK COMPAT
| 8FF4EC | 003F | 90DE | FBB6 | 8FF4EC | CJK UNIF
| 8FF4ED | 003F | FA26 | FBB7 | 8FF4ED | CJK COMPAT
| 8FF4EE | 003F | 9115 | FBB8 | 8FF4EE | CJK UNIF
| 8FF4EF | 003F | FA27 | FBD3 | 8FF4EF | CJK COMPAT
| 8FF4F0 | 003F | FA28 | FBDA | 8FF4F0 | CJK COMPAT
| 8FF4F1 | 003F | 9592 | FBE8 | 8FF4F1 | CJK UNIF
| 8FF4F2 | 003F | F9DC | FBE9 | 8FF4F2 | CJK COMPAT
| 8FF4F3 | 003F | FA29 | FBEA | 8FF4F3 | CJK COMPAT
| 8FF4F4 | 003F | 973B | FBEE | 8FF4F4 | CJK UNIF
| 8FF4F5 | 003F | 974D | FBF0 | 8FF4F5 | CJK UNIF
| 8FF4F6 | 003F | 9751 | FBF2 | 8FF4F6 | CJK UNIF
| 8FF4F7 | 003F | FA2A | FBF6 | 8FF4F7 | CJK COMPAT
| 8FF4F8 | 003F | FA2B | FBF7 | 8FF4F8 | CJK COMPAT
| 8FF4F9 | 003F | FA2C | FBF9 | 8FF4F9 | CJK COMPAT
| 8FF4FA | 003F | 999E | FBFA | 8FF4FA | CJK UNIF
| 8FF4FB | 003F | 9AD9 | FBFC | 8FF4FB | CJK UNIF
+--------+-----------+--------------+---------------+---------------+
eucjpms encodes these characters only once, in JIS-X-0212 block.
cp932 encodes these characters twice:
EDE4 = U+6DF8 CJK (in NEC selection of IBM characters, rows 89..92)
FB43 = U+64F8 CJK (IBM Extensions, rows 117..118)
See CP932-NEC-IBM.txt in attachment.
- IBM Extensions (as in cp932, Row 119)
and
NEC selected IBM characters (as in cp932, rows 89..92)
+--------+-----------+--------------+---------------+---------------+
| c | ujis_ucs2 | eucjpms_ucs2 | eucjpms_cp932 | cp932_eucjpms |
+--------+-----------+--------------+---------------+---------------+
| 8FF4FC | 003F | 9B72 | FC42 | 8FF4FC |
| 8FF4FD | 003F | FA2D | FC49 | 8FF4FD |
| 8FF4FE | 003F | 9ED1 | FC4B | 8FF4FE |
+--------+-----------+--------------+---------------+---------------+
108 rows in set, 1140 warnings (0.04 sec)
eucjpms encodes these characters only once, in JIS-X-0212 block.
cp932 encodes these characters twice:
EEE3 = U+9B72 CJK UNIFIED IDEOGRAPH (NEC selected IBM characters, rows 89..92)
FC42 = U+9B72 CJK UNIFIED IDEOGRAPH CJK (IBM Extensions, row 119)
EEEA = U+FA2D CJK COMPATIBILITY IDEOGRAPH (NEC selected IBM characters, rows 89..92)
FC40 = U+FA2D CJK COMPATIBILITY IDEOGRAPH (IBM Extensions, row 119)
EEEC = U+9ED1 CJK UNIFIED IDEOGRAPH (NEC selected IBM characters, rows 89..92)
FC4B = U+9ED1 CJK UNIFIED IDEOGRAPH CJK (IBM Extensions, row 119)
You must be logged in to tag this worklog
|
Votes
Not yet rated.
You must be logged in to vote.
Watches
1 members are watching this worklog
You must be logged in to track this worklog.
Provide Feedback
|