The Ironism

The Ironism

The lair of Lars J. Nilsson. Contains random musings on beer, writing and this thing we call life.

January 2007
« Dec Feb »


Musing on Java characters


I’ve run across a certain problem amoung Java programmers for a long time now. And today I saw it again. Enough. Consider the following, made-up, statement about a fictional factory class:

This factory contructs objects from a Reader. But I need to use it from an InputStream. Why isn’t there a kind of ReaderInputStream or why can’t the factory just overload its parse methods to take a raw InputStream as well?

Obvously I’m making generalisations here, but behind statements like this often lurks a common problem: characters are not the same as bytes.

Think of the character as a logical thing, not a binary. There is no fixed binary form for any certain character, there are many. Consider a word in a spoken language: Although you can change language the meaning of the word stays the same. The logical “word” stays the same altough you just switched from, say, English “car” to Swedish “bil”. The characters are diferent but the logical meaning of the word stays the same.

And so it is with the Java characters also: Altough every character has one logical meaning you can express it in bytes in many ways. And just as you called the languages “english” and “swedish” above, the different ways of representing a character also has names, called character encodings. For example, internally in the Java virtual machine every character is represented as 2 bytes, and it is named “unicode”. I’m sure you recognise it. You probably also have seen names like “us-ascii” and “iso-8859-1” etc.

From this follows that if you have a binary source, and wants to read it as characters, you must first know what character encoding the binary source is in. If you don’t know that, it is like reading a word from a book but not knowing what language you should interpret it as, for example: should you interpret the word “gun” as a kind of pistol or a womans first name? If it is written in Swedish it is a womans name, but in English it is a pistol.

A side note here: it is originally the same word. The Vikings brought it to England, but then it meant “war” in old Norse. So a woman in Sweden named “Gun” is actually named “Warrior”.

So back to a fictional factory above. You can’t just override a parse method that takes a Reader with one that takes only an InputStream unless you also tell the factory what character encoding it should use. For example:

But then of course, you could easily do this instead:

And conversely, when you write characters to a binary source, you can’t just write them: First you have to know what binary form you should use. Ie. what character encoding you should use.

And so it goes. There are many similar problems you can find when programming. For example, one programmer I worked with wanted to convert a String to HEX and back again… See the problem? HEX is a representation for binary data but charcters in a String are logical entities. First you have to turn the string it to bytes using a character encoding and then you can transform the bytes into HEX. But it is a two step process, not one.Had enough? Then repeat after me: characters are not the same as bytes!

The proprietor of this blog. Lunchtime poet, former opera singer, computer programmer. But not always in that order. Ask me again tomorrow.

    Comments 0
    There are currently no comments.