Solving character encoding problems

Contents

Background

Getting computers to present text correctly has been a challenge since the early days of computing. If you're English speaking you've probably not reflected on this, but the rest of us that have so called "foreign characters" like å ä ö in our language have seen a never ending line of misinterpretations of them like {, ?, Σ and lately Ã¥.

In this tutorial we'll explain why this happens and how you can adjust jAlbum to get your text presented right.

The basic problem is that computers actually don't deal natively with text. Computers deal with numbers, the numbers 0 and 1 to be specific. These numbers, named "bits", are handled in groups of 8 called a "byte". Each byte can represent a number in the range 0-255 where 0 is coded as 00000000 (8 zeroes) and 255 coded as 11111111 (8 ones). To represent larger numbers, bytes are combined. By combining two bytes (16 bits) one can represent numbers in the range 0-65535 and combining 4 bytes (32 bits) gives a number range of 0-4294967295. So far so good.


Problem

Computers store text as a sequence of numbers where each character has a unique number according to an agreed upon "character encoding standard". The problem is that there are many standards and each standard assigns different numbers to the same character. "ä" is for instance stored as 228 in the popular ISO-8859-1 standard but stored as the two byte number 50084 in the UTF-8 standard (written out as c3a4 with hexadecimal notation). If a UTF-8 encoded "ä" is interpreted according to the IS0-8859-1 standard, it shows up as the character pair "ä".


Solutions

This problem can be handled in two ways:

  • The world agrees to one single standard
  • Before sending text to another computer (actally numbers, remember?), the sending computer first tells what standard is being used

The first approach has historically been impossible to pursue as each standard has favored one region of the world over another (Supporting Swedish, but not Russian and vice versa for instance). Today there is a global standard called "Unicode/UTF-8" that handles any possible character (and symbol) in any language. I'd honestly say that the only problem with Unicode/UTF-8 is that it's not used everywhere.

On the Internet, the web and email uses the second approach: Before a text document is sent, the sending computer passes the character encoding standard used (using the ASCII standard by the way). Web servers use so called "HTTP headers" for this. This is usually a web server setting. Problems occur if the document text is written using one standard but the HTTP header states another standard. To de-mystify HTTP headers, here's an example of a reply from a web server where the standard to use is ISO-8859-1:

HTTP/1.1 200 OK
Date: Mon, 19 Jan 2015 11:00:55 GMT
Server: Apache
Last-Modified: Sun, 18 Jan 2015 18:26:22 GMT
Content-Length: 4259
Content-Type: text/html; charset=ISO-8859-1


Note: Use your browser to determine content type: point the browser to a HTML page on your server, open developer tools (F12) and search HTTP header for charset.

jalbum42.jpg

Note: If you inspect the source code of a html document you may also see that the character set used is stated in a so called "meta tag". It seems however that computers prefer to look at the HTTP header, so don't be confused by this. Ensure that the encoding standard of the web server matches the encoding used in your documents and you'll be fine.

Configuring jAlbum

jAlbum defaults to using the Unicode/UTF-8 standard, but if your web server is configured to use another standard, either change your web server to Unicode/UTF-8 and update the existing documents or change jAlbum to match the standard used by your web server. We recommend moving to Unicode/UTF-8 as it isn't discriminating any language. This is also a standard that's growing massively each year. If you decide to change jAlbum's encoding instead, just go to Settings->Advanced->General and untick "Write UTF-8" and adjust the "Encoding" setting to match that of your web server. When done, make and upload the pages again.

FTP and broken links

So far we've talked about getting text showing up correctly in web pages, but these encoding issues can also cause trouble when clicking links to get from one web page to another if the link or target web page contains foreign characters in its file name. The easiest workaround is to stay away from using foreign characters in file and folder names, but if you want to use them, ensure that the server you upload to interprets the file names right when you transfer the files to the server. Most ftp servers will correctly tell ftp clients if they expect file names to be encoded as Unicode/UTF-8 or not, but some ftp servers tell that they don't support Unicode/UTF-8 even though they expect file names to be encoded like that. To solve that problem, jAlbum has a "Force UTF-8" setting that can be checked under Upload/Manage->Advanced.

Special characters

When naming files for use over the web, also pay attention to never use characters that are reserved for special purposes. These include #$%&*"\/:;?=|. If you for instance use a slash "/" or hash "#" in a file name, you will break links to such files as the slash is used to separate files and folders and the hash is used to point out sections of a file. By default, jAlbum prevents you from using any special characters in file names.

Unicode and UTF-8

Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. UTF-8 is a so called "implementation of Unicode" and the best one to use I'd say. There are also other Unicode implementations like UTF-16 and UTF-32. These implementations are just different ways to, at a binary level, store the unique number each character has been assigned to according to the Unicode standard. UTF-16 always uses 16 bits to express each character. It means that UTF-16 is limited to handling a maximum of 65536 different characters. To overcome this limitation UTF-32 was developed that always uses 32 bits. It is therefore capable of handling 4294967295 different characters.

Perhaps this sounds great, but there are four downsides to using UTF-16 and UTF-32:

  • As each character consumes 2 or 4 bytes, ordinary English text consumes 2-4 times the space compared to older standards
  • It's not backwards-compatible with the old and widespread ASCII standard
  • UTF-16 comes in two variants depending on the order of the bytes in each pair, thereby causing misinterpretation
  • Text in UTF-16 and UTF-32 can be misinterpreted if the interpreting computer is out of sync when reading groups of 2 or 4 bytes at a time (reading byte pairs AABBCC like AB BC etc instead of AA BB CC)

UTF-8 solves all these issues! It's a variable length system that's backwards compatible with ASCII. This means that ordinary English text is stored just like the ASCII standard. This also makes UTF-8 compact for most text. Foreign characters consume 2-6 bytes depending on the character to encode. UTF-8 uses a clever binary scheme to ensure that the receiving computer never reads a stream of bytes out-of-sync. Finally, this scheme also makes it possible for computers to automatically identify UTF-8 encoded text.