Thursday, May 5, 2011

Character Encoding

Character encoding is something that's important but can be very confusing people. Let's go back to the beginning... Original nearly all character encoding was done in ASCII. 7 bits to represent 128 characters. This was fine until software became more popular and demands to support more and more international languages grew. Cue Unicode to solve this problem. The latest version (version 6.0) of Unicode includes support for 109,000 different characters! Wow! But here's where it gets confusing. Unicode can be implemented in different character encoding mechanisms. For example: UTF-8, UTF-16 and UTF-32.

So what's the difference? Well UTF-8 uses by default one byte per character unless it has to use more to encode a character. UTF-16 uses 2 (unless it has to use more) and UTF 32 always uses 4 bytes.

Ok so you're thinking what's this all about? Well UTF-8 is trying to be smart. Use as few bytes as possible and thereby minimise footprint. This makes it beneficial to things like internet web pages which have a lot of simple text in mark up. UTF-8 also represents ASCII characters in as they are. Recall ASCII uses 7 bits per character. UTF-8 maps the characters exactly how they appear in ASCII and then uses the leftover bit (remember 8 bits in a byte :-)) to indicate if this character is using multiple bytes or just a single byte. Cool.

This is all very nice until you get to a situation where you are going to need to support a more complex range of characters. Suppose your architecture has a lot of text in an Asian language for example - now of which are ASCII characters. This can still be supported by UTF-8 but UTF-8 may end up using 3 bytes to the encode the character in some cases when UTF-16 will do it in 2. In this case, UTF-16 will give you a smaller footprint.

So when to use UTF-32? Well UTF-16 can be variable as well. Each character is between 2 and 4 bytes but usually 2. This is for similar reason as to why UTF-8 is variable. UTF-32 is never variable and is always 4 bytes per character. This means when you get your file size you can work out exactly how many characters there are or when you know the number of characters you can work out exactly what size your file is. This might characteristic suit some applications.

Ok, so you're back in Java land putting together your web architecture. Which character encoding should you use? Well you gotta use Unicode or else you'll run into problems when something as simple as a euro symbol comes along. You should use UTF-8 by default as this will give you the smallest footprint unless you are supporting a huge amount of unusual characters which is unlikely. Finally, you gotta be consistent. You gotta make everything UTF-8. This includes:

  • All html pages

  • All JSPs and Javascripts

    <%@ page pageEncoding="UTF-8" contentType="text/html; charset=UTF-8">


  • Your JVM (i.e. your app / web server)
    Check with your vendor
  • Your database.
    Check with your vendor

Otherwise you may see an euro symbol on your web page but you'll store it as gobbly gunk in your DB.

Have fun.

No comments:

Post a Comment