ASCII, UTF-8, and Latin-1 Coding Nuances

I got thrown off quite a few times when there are textual fields covering names of people or companies across the globe. The error message I frequently received is:

‘ascii’ codec can’t encode character u’\u2019′ in position 16: ordinal not in range(128)

Various encoding system applied causes the issues, hence it’s needed to dig deeper into the mechanism behind to solve this problem.

I am referencing a great explanation from the stackoverflow founder Joel Spolsky, writing his post of this knowledge., where he quipped that “plain text = ascii = characters are 8 bits” is not only wrong, it’s hopelessly wrong”, it’s time, “it does not make sense to have a string without knowing what encoding it uses. ”

We all know some basic computer architectural theory that the computer can only process binary input 1 or 0, representing switch on or off status. In an 8 binary system, a letter A can be mapped to  0100 0001. Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world’s writing systems, the latest version of Unicode contains a repertoire of over 136,000 characters covering 139 modern and historic scripts, as well as multiple symbol sets. The Unicode standard defines UTF-8UTF-16, and UTF-32, and several other encodings are in use. The most commonly used encodings are UTF-8, UTF-16 and UCS-2, a precursor of UTF-16. UTF-8, dominantly used by websites (over 90%), uses one byte for the first 128 code points, and up to 4 bytes for other characters. The first 128 Unicode code points are the ASCII characters, so an ASCII text is a UTF-8 text. (from wiki).

UTF-8 was another system for storing your string of Unicode code points, those magic U+ numbers, in memory using 8-bit bytes. In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.

Unicode was meant to  – a brave effort – create a single character set that included every reasonable writing system on the planet. But in Unicode, the letter A is platonic ideal. Every platonic letter in every alphabet is assigned a magic number by the Unicode consortium which is written like this: U+0639. this magic number is called a code point. so a string “Hello”, in Unicode’s code points is U+0048 U+0065 U+006C U+006C U+006F.

When there’s no equivalent for the Unicode code point you’re trying to represent in the encoding system UTF-8, usually, you will get a little question mark: ? or, been thrown an error as displayed previously.

Some popular encodings of English text are Windows-1252 (the Windows 9x standard for Western European languages) and ISO-8859-1, aka Latin-1 (also useful for any Western European language). To address this issue or error, the following python lines will help,

ppu[‘Individual Name’] = ppu[‘Individual Name’].str.decode(‘latin-1’).str.encode(‘utf’)
ppu[‘title’] = ppu[‘title’].str.decode(‘latin-1’).str.encode(‘utf’)

import codecs
def changeencode(data, cols):
for col in cols:
data[col] = data[col].str.decode(‘iso-8859-1’).str.encode(‘utf-8’)
return data

#encoding error function
def changeencode(data, cols):
for col in cols:
data[col] = data[col].str.decode(‘latin-1’).str.encode(‘utf-8’)
return data

Note, on Windows, many editors assume the default ANSI encoding (CP1252 on US Windows) instead of UTF-8 if there is no byte order mark (BOM) at the start of the file.
pet_df.to_csv(‘checkingst.csv’, encoding=’utf-8-sig’)



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.