Unicode character dump in Python

Sometimes you just need to see what characters are lurking inside a Unicode encoded text file. Your garden variety dump utility (like the venerable od in UNIX systems and the Windows standard hex dump (though I don’t think there is one) only shows you the plain bytes, so you have to head over to unicode.org to find out what they mean. But first you need to decode UTF-8 to get the actual code points, or grok UTF-16 LE or BE, and so on. It’s fun, but it’s not for everyone.

The udump utility shows you a nice list of character names, together with their offsets in the file. Currently it only handles UTF-8, so the offset is calculated based on the UTF-8 length of the character.

Here is an example of the udump display:

$ python udump.py testfile2
Using Unicode 5.2.0 data
Read 15 characters
00000000: U+000030 DIGIT ZERO
00000001: U+000020 SPACE
00000002: U+0020AC EURO SIGN
00000005: U+00003A COLON
00000006: U+000020 SPACE
00000007: U+00006E LATIN SMALL LETTER N
00000008: U+00006F LATIN SMALL LETTER O
00000009: U+000074 LATIN SMALL LETTER T
0000000A: U+000020 SPACE
0000000B: U+000062 LATIN SMALL LETTER B
0000000C: U+000061 LATIN SMALL LETTER A
0000000D: U+000064 LATIN SMALL LETTER D
0000000E: U+000021 EXCLAMATION MARK
0000000F: U+000020 SPACE
00000010: U+00000A (unnamed character)

That also serves as a usage example. As you can see, udump is a Python script; you need Python 2.3 or later to use it.

The file in the example had this content:

0 €: not bad!

Here is the complete source code for udump:

import sys
import codecs
import unicodedata

inputFile = codecs.open(sys.argv[1], "r", "utf-8")
fileData = inputFile.read()
inputFile.close()

print "Using Unicode %s data" % unicodedata.unidata_version
print "Read %d characters" % len(fileData)

offset = 0
for ch in fileData:
    utf8ch = ch.encode("utf-8")
    print "%08X: U+%06X %s" % (offset, ord(ch), unicodedata.name(ch, "(unnamed character)"))
    offset = offset + len(utf8ch)

Currently udump only handles UTF-8, and does not know about surrogate characters, because Python didn’t support them back in 2006 (it might now). Here is a list of improvement ideas that quickly come to mind:

  • Show names of control characters.
  • Support other Unicode encodings besides UTF-8.
  • Support surrogate characters too.
  • Better error handling.

More on the subject matter:

Also, while looking for something else entirely I discovered John Walker’s unum utility. It is a handy Unicode and HTML entity lookup tool, highly recommended.

udump is provided with NO WARRANTY for any purpose whatsoever. Share and enjoy. Give feedback on Twitter if you care. This might end up in Github. Unicode is a trademark of Unicode, Inc.

(This article was originally published in a slightly different format on the author’s personal website in 2006.)

UPDATE 2013-02-21: udump just saved my day.

The following code was lifted from a PDF e-book:

if (sqlite3_prepare_v2(database, 
    sqlStatement, −1, &compiledStatement, NULL) == SQLITE_OK) {

The clang compiler reports: “Parse issue: Expected expression”.

After much headscratching, the -1 part seems the only problem left.
Copy and paste it into a file and run udump on it:

% python udump.py minusone.txt
Using Unicode 5.2.0 data
Read 3 characters
00000000: U+002212 MINUS SIGN
00000003: U+000031 DIGIT ONE
00000004: U+00000A (unnamed character)

Cue minor epiphany.

Explanation: the publishing system for the e-book (or some other link in the production chain) has transformed the ordinary minus sign expected by the C compiler into a character good for publishing but confusing for the compiler. With a monospaced font in a programming editor you couldn’t really tell them apart by visual inspection.