That second line:
Originally Posted by javasymbrew
Will behave the same on all devices that support UTF-8 as an encoding. That should include all MIDP-2 devices. (UTF-8 was not a requirement for MIDP-1.)
String utfstr = new String(bytearr,"UTF-8");
The first line, however:
This will behave differently on different devices. This is because no encoding has been specified, and so the platform default encoding will be used. This encoding can be read from:
byte bytearr = somestr.getBytes();
Remember that this encoding is the default encoding. It is not necessarily the only encoding supported by the device. One device might support several different encodings for converting between characters and bytes.
Yes, it must support the characters. By that, I mean it must have an image in its font corresponding to those characters.
Originally Posted by javasymbrew
But: this is nothing to do with the encoding. A device can use ISO-8859-1 as its default encoding, and still be able to display Russian, Greek and Arabic.
Within Java, characters ("char") are 16 bit. They are Unicode characters. Fonts are Unicode fonts. (However, just because they are Unicode fonts does not mean they will contain every character. Typically, Chinese, Japanese and Korean characters are left out, to save space - except, obviously, on phones sold in those countries.) Just as in Windows, any character that is not present in the font will display as a small square (or similar, depending on the device).
So long as you are working with characters, there is no "encoding". All characters are Unicode.
Encodings only appear when you want to convert characters into bytes, or bytes into characters. When you do this, you are squeezing 16 bit data into 8 bit units (or the reverse). There are many different ways to represent characters in 8 bit units. The "encoding" tells the Java API which representation to use.
There are several different techniques for converting characters to bytes:
1. One byte per character. Each character is encoded as a single byte. Since a char is 16 bits (with 65536 different values), and a byte is 8 bits (with 256 values), not all Unicode characters can be converted. We must choose which 256 characters we are going to use. Any character in the String that cannot be encoded in the set we choose, will be lost (or will become "?" or something).
A look-up table like these is used to convert the characters into bytes (or the bytes into characters, if we're converting the other way). Note that this table is just a piece of data in the Java API. It is not part of the font.
2. Two bytes per character. Since a char is 16 bits, we can just split it into two bytes. We can order the bytes most-significant first, or least-significant first (commonly known as "big-endian" or "little-endian"). This format represents all unicode characters. It is often hard to view on non-Unicode systems, giving a characteristic "double-spaced" look.
3. Variable bytes per character. Some encodings use a different number of bytes to encode different characters. Typically, ASCII characters (0 - 127) are encoded in a single byte, while other characters require two or three. UTF-8 is an example of this, and allows all 65536 characters to be encoded. Chinese and Japanese specific encodings also often follow this pattern (but without encoding all unicode characters). Examples are GB-2312 and BIG5.
The encoding is simply a choice of which of these schemes to use. Usually, chars are converted to bytes in order to exchange data with some other system (like a server), and the choice of encoding will depend on the whatever is at the other end of the connection.
The "platform default encoding" is just the encoding that will be used automatically if you don't specify one.
Fonts and encodings are completely unrelated.