Conifer Productions

From ideas to apps. From mobile to global.

Unicode character dump in Python

Sometimes you just need to see what characters are lurking inside a Unicode encoded text file. Your garden variety dump utility (like the venerable od in UNIX systems and the Windows standard hex dump (though I don’t think there is one) only shows you the plain bytes, so you have to head over to to find out what they mean. But first you need to decode UTF-8 to get the actual code points, or grok UTF-16 LE or BE, and so on. It’s fun, but it’s not for everyone.

The udump utility shows you a nice list of character names, together with their offsets in the file. Currently it only handles UTF-8, so the offset is calculated based on the UTF-8 length of the character.

Read more →