Handling non-English characters

Although the American Standard Code for Information Interchange (ASCII) works for most of us, it only allows a set of 256 characters to be used to describe the alphanumeric characters available to print. That range, 0 to 255, is used because it is the size of a "byte" - eight ones and zeroes in computing terminology. Languages such as Russian, Korean, and Japanese have special characters in them, which means you need more than 256 characters, and therefore need more than one byte of space - you need a multibyte character.

Dealing with these complex characters is a little different to working with normal characters, because functions like substr() and strtoupper() expect precisely one byte per character, and will corrupt a multibyte string. Instead, you should use the multibyte equivalents of these functions, such as mb_strtoupper() instead of strtoupper(), mb_ereg_match() rather than ereg_match(), and mb_strlen() rather than strlen(). The parameters required for these functions are the same as their original, except that most accept an optional extra parameter to force specific encoding.

So, working with multibyte strings is easy for the most part, there is one exception: what do you do with an existing script you'd like to multibyte enable? To cope with that scenario, there's a special php.ini setting: mbstring.func_overload. By default this is set to 0, which means functions behave as you would expect them to. If you set it to 1, calling the mail() function gets silently rerouted to the mb_send_mail() function. If you set it to 2, all the functions starting with "str" get rerouted to their multibyte partners. If you set it to 4, all the "ereg" functions get rerouted. You can combine these together as you please by simply adding them - for example, for "mail" and "str" rerouting you add 1 and 2, giving 3, so you set mbstring.func_overload to 3 to overload these two. To overload everything, set it to 7 - 1 ("mail") + 2 ("str") + 4 ("ereg").

PHP 6, which has been under development almost as long as Perl 6, will bring with it full support for Unicode. Hurrah!

 

Next chapter: Undocumented functions >>

Previous chapter: The declare() function and ticks

Jump to:

 

Home: Table of Contents

Follow us on Identi.ca or Twitter

Username:   Password:
Create Account | About TuxRadar