terça-feira, 10 de fevereiro de 2009

MySQL and UTF8: troubles?

2004/11/28




MySQL and UTF8: troubles?






It is unbelievable how hard to get MySQL to work with UTF8 through JDBC. So I though I jot down some notes to help me n future ...



Easy as pie: convert LATIN1 to UTF8


Or more like going from first circle of hell (LATIN1) to almost nine (UTF8) ...


First of all if you use any collation with _ci at the end
(such as utf8_general_ci, or latin1_general_ci that is not the same as utf_bin!!!)
and you have strings as key then do not be surprised when "Music" is the same as "music" (I was!).


That broke lot of things in Is2.
So I had to fix it.
First I noticed that ominous
'latin1_swedish_ci' is used.



The first impulse was to change it to something more suitable for Java like say UTF-8?! Not good idea ...
First I was happy to notice that MySQL 4.1 finally has UTF8 on the

list of supported character sets and collation.
To set your own default server wide modify my.ini


set-variable=default-character-set=utf8
set-variable=default-collation=utf8_bin
set-variable=collation_server=utf8_bin



However doing this turned out has some side effects.
Server default may change some table columns char sets and collation and that may break apps that used databases assuming _ci for columns and of course it happened ...


One sad effect of it is you get this really useless:
General error message from server: "Illegal mix of collations (latin1_general_ci,IMPLICIT) and (utf8_bin,IMPLICIT) for operation '='"; nested exception is: General error message from server: "Illegal mix of collations (latin1_general_ci,IMPLICIT) and (utf8_bin,IMPLICIT) for operation '='"
.
More on defaults in notes at the end of this page.


JDBC is root of all evil



So I got this fixed after recreating all tables (what fun). Then I was hit but something even more fun and what looks like serious
violations of abstractions ("Leaky Abstractions" anybody).
If you have one of latin1 or similar 8bit encodings for VARCHAR or MEDIUMTEXT
they are seamlessly mapped to String. So far so good.
But try "utf8" and getObject will hit you with byte[] (and lovely ClassCastException: [B). I tried different versions of Connector/J
including 3.1.4-beta. No luck.

And of course
documentation is silent on this issue
and no luck chekcing mailing list archives or
MySQL Connectors/JDBC forum
...
Solution?
The only way around is bite the bullet and convert byte array yourself:


private static String convertToString(Object v)
throws IceException
{
if(v == null) {
return null;
}
if(v instanceof String) {
return v.toString();
}
byte[] barr = (byte[]) v;
try {
return new String(barr, "UTF-8");
} catch (UnsupportedEncodingException e) {
throw new IceException(
"could not convert UTF-8 string ", e);
}
}



Also ti seems to help to use in JDBC URL
useUnicode=true and
characterEncoding=utf8
leading to

really long JDBC URLs ...


There is also small bug is in
MySql Control Center 0.9.4-beta
- if you have utf8_bin for MEDIUMTEXT it tries to display it as image. As typically MEDIUMTEXT
is not used as key in database (haha) so it is safe to make it utf8_general_ci to make MySql CC happy ...




Additional notes


All this nonsense about default charset and collating per server, database, and table is useless. It matters only when database or table is created. That is it!


The database character set and collation are used as default values if the table character set and collation are not specified in CREATE TABLE statements. They have no other purpose. ---

doc



For table columns you need to be ultra careful - they may or may not be specified. And if not specified changing server default (like moving your data to other server) can have dire consequences ...



The table character set and collation are used as default values if the column character set and collation are not specified in individual column definitions. The table character set and collation are MySQL extensions; there are no such things in standard SQL.
---

doc






So if yo happen to modify charsets and collation for some tables here is what worked for me: dump to file database and recreate tables ... doing ALTER for each column is just too painful!



CREATE DATABASE is2db
DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_bin

USE is2db;

ALTER DATABASE is2db
DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_bin

ALTER TABLE entity
DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_bin

ALTER TABLE entity MODIFY xml mediumtext
CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL;




And some handy debug commands


SHOW VARIABLES;
SHOW CHARACTER SET;
SHOW COLLATION LIKE 'latin1%';





NOTE:

full text indexes are case sensitive from version 4.1/
if you have latin_bin collating or similiar used -
BEWARE!!!!!



Notes on making full text indexing table to work:


ALTER TABLE lookup_fulltext
DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci

ALTER TABLE lookup_fulltext MODIFY title varchar(255)
CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL
ALTER TABLE lookup_fulltext MODIFY content mediumtext
CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL
ALTER TABLE lookup_fulltext MODIFY meta mediumtext
CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL

ALTER TABLE lookup_fulltext MODIFY space_id varchar(16)
CHARACTER SET utf8 COLLATE utf8_bin NOT NULL;
ALTER TABLE lookup_fulltext MODIFY id varchar(255)
CHARACTER SET utf8 COLLATE utf8_bin NOT NULL;
ALTER TABLE lookup_fulltext MODIFY status varchar(8)
CHARACTER SET utf8 COLLATE utf8_bin NOT NULL;
ALTER TABLE lookup_fulltext MODIFY author varchar(16)
CHARACTER SET utf8 COLLATE utf8_bin NOT NULL;

Nenhum comentário: