Chapter 20. Internationalization and Localization

As mentioned in “Text”, strings in PHP are sequences of bytes. A byte can have up to 256 possible values. This means that representing text that only uses English characters (the US-ASCII character set) is straightforward in PHP, but you must take extra steps to ensure that processing text that contains other kinds of characters works properly.

The Unicode standard defines how computers encode the thousands and thousands of possible characters you can use. In addition to letters such as ä, ñ, ž, λ, ד, د, and ド, the standard also includes a variety of symbols and icons. The UTF-8 encoding defines what bytes represent each character. The easy English characters are each represented by only one byte. But other characters may require two, three, or four bytes.

You probably don’t have to do anything special to ensure your PHP installation uses UTF-8 for text processing. The default_charset configuration variable controls what encoding is used, and its default value is UTF-8. If you are having problems, make sure default_charset is set to UTF-8.

This chapter tours the basics of successfully working with multibyte UTF-8 characters in your PHP programs. The next section, “Manipulating Text”, explains basic text manipulations, such as calculating length and extracting substrings. “Sorting and Comparing” shows how to sort and compare strings in ways that respect different languages’ rules for the proper order of characters. “Localizing Output” provides examples of how to use PHP’s message formatting features so your program can display information in a user’s preferred language.

The code in this chapter relies on PHP functions in the mbstring and intl extensions. The functions in “Manipulating Text” whose names begin with mb_ require the mbstring extension. The Collator and MessageFormatter classes referenced in “Sorting and Comparing” and “Localizing Output” require the intl extension. The intl extension in turn relies on the third-party ICU library. If these extensions aren’t available, ask your system administrator or hosting provider to install them, or follow the instructions in Appendix A.

Manipulating Text

Since the strlen() function only counts bytes, it reports incorrect results when a character requires more than one byte. To count the characters in a string, independent of how many bytes each character requires, use mb_strlen(), as shown in Example 20-1.

Example 20-1. Measuring string length
$english = "cheese";
$greek = "τυρί";

print "strlen() says " . strlen($english) . " for $english and " .
    strlen($greek) . " for $greek.\n";

print "mb_strlen() says " . mb_strlen($english) . " for $english and " .
    mb_strlen($greek) . " for $greek.\n";

Since each of the Greek characters requires two bytes, the output of Example 20-1 is:

strlen() says 6 for cheese and 8 for τυρί.
mb_strlen() says 6 for cheese and 4 for τυρί.

Operations that depend on string positions, such as finding substrings, must also be done in a character-aware instead of byte-aware way when multibyte characters are used. Example 2-12 used substr() to extract the first 30 bytes of a user-submitted message. To extract the first 30 characters, use mb_substr() instead, as shown in Example 20-2.

Example 20-2. Extracting a substring
$message = "In Russia, I like to eat каша and drink квас.";

print "substr() says: " . substr($message, 0, 30) . "\n";
print "mb_substr() says: " . mb_substr($message, 0, 30) . "\n";

Example 20-2 prints:

substr() says: In Russia, I like to eat ка�
mb_substr() says: In Russia, I like to eat каша 

The line of output from substr() is totally bungled! Each Cyrillic character requires more than one byte, and 30 bytes into the string is midway through the byte sequence for a particular character. The output from mb_substr() stops properly on the correct character boundary.

What “uppercase” and “lowercase” mean is also different in different character sets. The mb_strtolower() and mb_strtoupper() functions provide character-aware versions of strtolower() and strtoupper(). Example 20-3 shows these functions at work.

Example 20-3. Changing case
$english = "Please stop shouting.";
$danish = "Venligst stoppe råben.";
$vietnamese = "Hãy dừng la hét.";

print "strtolower() says: \n";
print "   " . strtolower($english) . "\n";
print "   " . strtolower($danish) . "\n";
print "   " . strtolower($vietnamese) . "\n";

print "mb_strtolower() says: \n";
print "   " . mb_strtolower($english) . "\n";
print "   " . mb_strtolower($danish) . "\n";
print "   " . mb_strtolower($vietnamese) . "\n";

print "strtoupper() says: \n";
print "   " . strtoupper($english) . "\n";
print "   " . strtoupper($danish) . "\n";
print "   " . strtoupper($vietnamese) . "\n";

print "mb_strtoupper() says: \n";
print "   " . mb_strtoupper($english) . "\n";
print "   " . mb_strtoupper($danish) . "\n";
print "   " . mb_strtoupper($vietnamese) . "\n";

Example 20-3 prints:

strtolower() says: 
   please stop shouting.
   venligst stoppe r�ben.
   h�y dừng la h�t.
mb_strtolower() says: 
   please stop shouting.
   venligst stoppe råben.
   hãy dừng la hét.
strtoupper() says: 
   PLEASE STOP SHOUTING.
   VENLIGST STOPPE RåBEN.
   HãY D{NG LA HéT.
mb_strtoupper() says: 
   PLEASE STOP SHOUTING.
   VENLIGST STOPPE RÅBEN.
   HÃY DỪNG LA HÉT.

Because strtoupper() and strtolower() work on individual bytes, they don’t replace whole multibyte characters with the correct equivalents like mb_strtoupper() and mb_strtolower() do.

Sorting and Comparing

PHP’s built-in text sorting and comparison functions also operate on a byte-by-byte basis following the order of letters in the English alphabet. Turn to the Collator class to do these operations in a character-aware manner.

First, construct a Collator object, passing its constructor a locale string. This string references a particular country and language and tells the Collator what rules to use. There are lots of finicky details about what can go into a locale string, but usually it’s a two-letter language code, then _, then a two-letter country code. For example, en_US for US English, or fr_BE for Belgian French, or ko_KR for South Korean. Both a language code and a country code are provided to allow for the different ways a language may be used in different countries.

The sort() method does the same thing as the built-in sort() function, but in a language-aware way: it sorts array values in place. Example 20-4 shows how this function works.

Example 20-4. Sorting arrays
// US English
$en = new Collator('en_US');
// Danish
$da = new Collator('da_DK');

$words = array('absent','åben','zero');

print "Before sorting: " . implode(', ', $words) . "\n";

$en->sort($words);
print "en_US sorting: " . implode(', ', $words) . "\n";

$da->sort($words);
print "da_DK sorting: " . implode(', ', $words) . "\n";

In Example 20-4, the US English rules put the Danish word åben before the English word absent, but in Danish, the å character sorts at the end of the alphabet, so åben goes at the end of the array.

The Collator class has an asort() method too that parallels the built-in asort() method. Also, the compare() method works like strcmp(). It returns -1 if the first string sorts before the second, 0 if they are equal, and 1 if the first string sorts after the second.

Localizing Output

An application used by people all over the world not only has to handle different character sets properly, but also has to produce messages in different languages. One person’s “Click here” is another’s “Cliquez ici” or “اضغط هنا” The MessageFormatter class helps you generate messages that are appropriately localized for different places.

First, you need to build a message catalog. This is a list of translated messages for each of the locales you support. They could be simple strings such as Click here, or they may contain markers for values to be interpolated, such as My favorite food is {0}, in which {0} should be replaced with a word.

In a big application, you may have hundreds of different items in your message catalog for each locale. To explain how MessageFormatter works, Example 20-5 shows a few entries in a sample catalog.

Example 20-5. Defining a message catalog
$messages = array();
$messages['en_US'] = array('FAVORITE_FOODS' => 'My favorite food is {0}',
                           'COOKIE' => 'cookie',
                           'SQUASH' => 'squash');
$messages['en_GB'] = array('FAVORITE_FOODS' => 'My favourite food is {0}',
                           'COOKIE' => 'biscuit',
                           'SQUASH' => 'marrow');

The keys in the $messages array are locale strings. The values are the messages appropriately translated for each locale, indexed by a key that is used to refer to the message later.

To create a locale-specific message, create a new MessageFormatter object by providing a locale and a message format to its constructor, as shown in Example 20-6.

Example 20-6. Formatting a message
$fmtfavs = new MessageFormatter('en_GB', $messages['en_GB']['FAVORITE_FOODS']);
$fmtcookie = new MessageFormatter('en_GB', $messages['en_GB']['COOKIE']);

// This returns "biscuit"
$cookie = $fmtcookie->format(array());

// This prints the sentence with "biscuit" substituted
print $fmtfavs->format(array($cookie));

Example 20-6 prints:

My favourite food is biscuit

When a message format has curly braces, the elements in the array passed as an argument to format() are substituted for the curly braces.

In Example 20-6, we had to do most of the work to figure out the right en_GB strings to use, so MessageFormatter didn’t add much. It really helps, though, when you need locale-specific formatting of numbers and other data. Example 20-7 shows how MessageFormatter can properly handle numbers and money amounts in different locales.

Example 20-7. Formatting numbers in a message
$msg = "The cost is {0,number,currency}.";

$fmtUS = new MessageFormatter('en_US', $msg);
$fmtGB = new MessageFormatter('en_GB', $msg);

print $fmtUS->format(array(4.21)) . "\n";
print $fmtGB->format(array(4.21)) . "\n";

Example 20-7 prints:

The cost is $4.21.
The cost is £4.21.

Because MessageFormatter relies on the powerful ICU library, it uses its internal database of currency symbols, number formatting, and other rules about how different places and languages organize information to produce proper output.

The MessageFormatter class can do lots more than what’s described here, such as format text properly for singular and plural, handle languages where the gender of a word affects how it’s written, and format dates and times. If you want to learn more, check out the ICU User Guide to Formatting and Parsing.

Chapter Summary

This chapter covered:

  • Understanding why some characters need more than one byte to represent them
  • Measuring string length in characters instead of bytes
  • Extracting substrings by character position
  • Safely changing the case of characters
  • Sorting text in a locale-aware manner
  • Comparing strings in a locale-aware manner
  • Localizing output for different locales