Unicode character dump in Python

Sometimes you just need to see what characters are lurking inside a Unicode encoded text file. Your garden variety dump utility (like the venerable od in UNIX systems and the Windows standard hex dump (though I don’t think there is one) only shows you the plain bytes, so you have to head over to unicode.org to find out what they mean. But first you need to decode UTF-8 to get the actual code points, or grok UTF-16 LE or BE, and so on. It’s fun, but it’s not for everyone.

The udump utility shows you a nice list of character names, together with their offsets in the file. Currently it only handles UTF-8, so the offset is calculated based on the UTF-8 length of the character.

Continue reading