Dealing with national specific character is always really tedious work. Last week I spent a few hours trying to find out why Zend_Search_Lucene doesn't search for words that include Czech specific letters. It seemed like for some words it works and for other it doesn't. To be honest I don't know what was wrong.
My first idea was really simple. I'll convert all non English characters to their appropriate form. Like 'á' to 'a', 'č' to 'c', etc. If I had to convert texts only from Czech to standard English alphabet, it would quiet easy using strtr but what if someone tried to use German, Spanish or French?
When I started using Doctrine I was quiet curious how does their "slug generator" work, because it converts strings written in Spanish, French, Czech, German, etc. language into URL appropriate form. That's without any accent or language specific characters. Only english letters, numbers and dashes. I knew there must be some magic inside. Just to be clear what I mean, it turns this
Babí léto definitivně skončilo, zatáhne se a na horách začne sněžit
That's perfect, so how does Doctrine do this?
Doctrine 1.2 comes as many other frameworks with some self-sustaining components. One of them is
Doctrine_Inflector class. This class contains only static functions related to text manipulation. In this case the most important are
Doctrine_Inflector::urlize. I think their names are self-explanatory. The best thing is that all these functions were tested by programers all over the world and if there was a bug they would probably already fixed it.
So what I did with
Zend_Search_Lucene? Before indexing any text I always converted it using
1 2 3 4 5 6
echo Doctrine_Inflector::unaccent('Babí léto definitivně skončilo, zatáhne se a na horách začne sněžit'); // prints "Babi leto definitivne skoncilo zatahne se a na horach zacne snezit" // and by the way this is how urlize work echo Doctrine_Inflector::urlize('Babí léto definitivně skončilo, zatáhne se a na horách začne sněžit'); // prints "babi-leto-definitivne-skoncilo-zatahne-se-a-na-horach-zacne-snezit"
and I do exactly the same when a user enteres a search query. I'm not sure if this is the best practise but I'm always trying to avoid problems with character encodings when storing anything outside database.
Great, but where's the source code?
If you want to use
Doctrine_Inflector as I did you don't have to download whole
Doctrine ORM. You can just extract functions you want to use from the Doctrine_Inflector's source code.