rwebdb 07-May-2010

I thought I understood the basics of character encodings and why Unicode is important for the web (here’s a nice explanation). But I was wrong. Very, very WRONG.

This was made evident repeatedly in the past few days as I explored more of the IPEDS CSV data. Some of the data got bounced by the Perl program I wrote to convert CSV into XML. It was easy enough to identify and examine the offending data rows. All of them involved “special” characters.

I naively assumed the CSV would be encoded in UTF-8. Poor assumption. The next choice was a Latin-1 encoding (specifically something called ISO-8859-1). This almost worked except for a few characters like the “right single quotation mark” and the “left double quotation mark.” A bit of sleuthing led to Microsoft Word “smart quotes” and the Microsoft 1252 character encoding. That worked, after revising the Perl program to use a CP1252 encoding for the CSV data file.

The problems I encountered were not all that difficult in the scheme of character encoding issues. It’s just that I needed to do a heap of learning before understanding the problem.

Here are several links to things I found useful:
1. Related to HTML5, Mark Pilgrim asks: “how does your browser actually determine the character encoding of the stream of bytes that a web server sends?”
2. Alan Wood provides an extensive multilingual series of web pages so you can test Unicode characters that are (or are not) displayed given your choice of browser, browser options, and fonts.
3. Joel Spolsky on The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
4. The Unicode Consortium provides considerable online data and resources about Unicode. In particular, check out the link called Cross Mapping Tables. Comparing the ISO-8859-1 and the CP1252 encodings for a few characters especially helped me.