Book review: Data Science at the Command Line

No matter how handy graphical user interfaces are, the good old command line remains a useful tool for performing various low-level data manipulation and system administration tasks. It is the fallback when you need to do something that has no way of graphical control. Being much more expressive and open-ended than a predefined set of controls, the command shell is the ultimate control environment for your computer.

Data science has become one of the most intensely practised computer applications, so it is no wonder that it also benefits greatly from the hands-on control approach of the command line shell. Data scientist Jeroen Janssens has had the foresight to combine the fundamental operations of data science and the most suitable command line tools into a book that collects many useful practices, tips and tricks for processing and preparing data, called “Data Science at the Command Line” (O’Reilly, 2014).

Data Science at the Command Line

At its highest abstraction levels, data science involves using models and machine learning to extract patterns from data and extrapolate results from data sets that are often much larger than fits in memory at any one time. At a lower level, it involves multiple file formats and just plain hard work to get the data in a fit shape to be analysed, and this is where the command line comes in.

There is only so much you can do with canned tools like text editors, but a world of possibilities opens for you when you have the power can chain simple commands together, forming pipelines of data where one command’s output becomes another one’s input. You can also redirect input from a file to a command, and from a command to a file.

Even though Linux and macOS installations have various command shells, apart from the defaults, Janssens shows you how to use a set of tools called the Data Science Toolbox, which actually uses VirtualBox or Vagrant to plant a self contained GNU/Linux environment with Python, R and various other tools of the trade on your local machine, without disturbing the host operating system too much.

With real-life examples, Janssens shows you how to use classic Linux command line tools like cut, grep, tr, uniq and sort to your advantage. You will also learn how to get data from the Internet, from databases and even Microsoft Excel spreadsheets, where most of the world’s operational data lies hidden from plain sight.

From this book I learned completely new and interesting ways to work with CSV (Comma Separated Value) files, and it introduced me to the excellent csvkit, with its collection of power tools to cut, merge and reorder columns in CSV files, perform SQL-style queries on the lines, and grep through them.

Among other things you get information on Drake, described as “make for data” – which, if you’re familiar with the classic software development tool make (and of course you are) should whet your appetite. There is also a chapter about how to make your data pipelines run faster by parallelising them and running commands on remote machines.

Scrubbing the data is less than half the fun, but usually more than half of the work in data science. You will learn to write executable scripts in Python and R with their comprehensive data science and statistics libraries, and learn to explore your data using visualisations that consist of statistical diagrams like bar charts and box plots. So the command line is not just text; even though the images are generated using commands, they are obviously shown in a GUI window.

Finally, there is a chapter on modelling data using both supervised and unsupervised learning methods, which serves as a cursory introduction to machine learning, although you are referred to more comprehensive texts on the algorithms involved.

At the back of the book there is a handy reference for all the commands discussed in the book, which include many of the old UNIX stalwarts found in Linux, but also newer tools like jq for processing JSON.

If you need to do data preparation for a data science project, you owe it to yourself to become good friends with the command line, and utilise the many tools described in Janssens’ book in your daily work. Even if you don’t “automate all the things“, you will benefit from the pipeline approach to data processing.

Buy the e-book at the O’Reiily web shop:
Data Science at the Command Line

The book also has a website,, where you can preview some of its content.

For the history and philosophy of the command line, you should read Neal Stephenson’s In the Beginning Was the Command Line.

Semi-Autonomous, Programmable Drones Incoming

Drones, or Unmanned Aerial Vehicles (UAVs), be they quadcopters or other type of flyer, will become more “intelligent” as themselves or by forming swarms, as this TED Talk by Vijay Kumar at the University of Pennsylvania shows:

My interest in drones lies not in flying them myself live, because I’m a lousy pilot and don’t play games much anyway, but in making them follow a predetermined route and return back to the starting point – for example, surveying an object or estate, or even carrying cargo between waypoints. The gorgeous aerial shots you get with many drones these days are great, of course, but I’ll let others play the director, and instead concentrate on the programming.

I recently got a Parrot AR.Drone 2.0 Elite Edition, mostly because it was the cheapest quadcopter that has an SDK, allowing you to create your own applications on top of it, or extend and customise some sample applications. (AR.Drone 2.0 SDK)

I did some web searches on the programmability of the AR.Drone, and it seems that the biggest craze has faded a little bit. Many of the libraries for Python and Node.js are not seeing as active development as I would have thought, and groups like NodeCopter are not too active either.

It also seems that some active members have moved on to do greater things, like Fleye, a personal flying robot – the result of work by Laurent Eschenauer and Dimitri Arendt:

The Fleye Kickstarter campaign is still ongoing, with delivery scheduled for September 2016.

Ecshenauer is the author of the Node.js library ardrone-autonomy, which itself is based on node-ar-drone by Felix Geisendörfer.

There is also the python-ardrone library for Python, which I would prefer over Node.js.

I have tested both node-ar-drone and python-ardrone quickly with the AR.Drone 2.0, and it is an exhilarating experience to see your quadcopter come to life and rise up to hover, just by entering a few commands in the Node or Python REPL. (Just make sure you can quickly call the land() function, especially if you are experimenting indoors.)

There are also some Clojure libraries for controlling the AR.Drone, such as clj-drone and turboshrimp, but I’m not sure if I would want to add JVM to the mix.

My inspiration for programming drones actually got sparked by the O’Reilly Programming Newsletter, which featured a recent article by Greg on The Yhat Blog titled “Building a (semi) Autonomous Drone with Python“. It had a lot of tips about how to start with this kind of activity, and extending it to involve computer vision using OpenCV.

I intend to develop some applications that fly the AR.Drone automatically along the perimeter of a large object, such as a house, or along some predetermined line, like the side of a field. I hope to document some of the results in this blog.

If you’re interested in programming semi-autonomous drones, drop me a line with any ideas, tips, questions, or collaborations.

LCD-like banners in Python

Back in 1998 or so, I wrote a CD player application for Microsoft Windows in Borland Delphi. It was for a magazine tutorial article, and I wanted a cool LCD-like display to show track elapsed and remaining time. There was a good one available for Delphi, called LCDLabel, written by Peter Czidlina (if you’re reading this, thanks once more for your cooperation).

I’ve been thinking about doing a modern version of the LCD display component for several times over the years, and I even got pretty far with one for OS X in 2010, but then abandoned it because of other projects. A few years ago I did some experiments with the LCD font file and wrote a small Python app to test it.

My most recent idea involving simulated LCD displays is to create a custom component for iOS and OS X in Swift. For that, I dug up the most recent Python project and tried to nail down the LCD font file format, so that I could later use it in Swift. I decided to use JSON.

The LCD font consists of character matrices, typically 5 columns by 7 rows, each describing a character on the LCD panel. The value of a matrix cell is one if the dot should be on, and zero if it should be off. I decided to store each cell value as an integer, even if it is a bit wasteful – but it is easy to maintain, and if you squint a bit, you can see the shape of the LCD character.

So the digit zero would be represented as a 2-dimensional matrix like this:


The font consists of as many characters as you like, but you need to identify them somehow. In JSON, you can do this with one-character strings, where the sole character is the Unicode code point of the character. So, with some additional useful information, a font with just the numeric digits 0, 1, and 2 would be represented in JSON like this:

"name": "Hitachi",
"columncount": 5,
"rowcount": 7,
"characters": {
"\u0030": [
"\u0031": [
"\u0032": [

With the font coming along nicely, I wrote a Python script to exercise it, by printing a banner-like message:

import json

def banner(message):
    mats = []
    for ch in message:

    output = ''
    num_rows = len(mats[0])
    num_cols = len(mats[0][0])

    for r in range(0, num_rows):
        for m in mats:
            for c in range(0, num_cols):
                if m[r] == 1:
                    output += 'X'
                    output += '.'
            output += ' '
        output += '\n'
    return output

font_data = None
with open('lcd-font-hitachi.json') as json_file:
  font_data = json.load(json_file)
  characters = font_data['characters']


Running this Python script would print out a banner like this one:

.XXX. ..X.. .XXX.
X...X .XX.. X...X
X..XX ..X.. ....X
X.X.X ..X.. ...X.
XX..X ..X.. ..X..
X...X ..X.. .X...

By adding characters to the JSON font file it becomes possible to print text messages instead of just numbers:

X...X ..... .XX.. .XX.. ..... ..X..
X...X ..... ..X.. ..X.. ..... ..X..
X...X .XXX. ..X.. ..X.. .XXX. ..X..
XXXXX X...X ..X.. ..X.. X...X ..X..
X...X XXXXX ..X.. ..X.. X...X ..X..
X...X X.... ..X.. ..X.. X...X .....
X...X .XXX. .XXX. .XXX. .XXX. ..X..

But I think that a custom control for iOS in Swift would see most use in games or applications displaying numeric parameters like volume level, geographical coordinates or RPM.

If you want to learn Python, here is a good book:

Learning Python
Learning Python
by Mark Lutz

Functional programming without feeling stupid

If you follow software design trends (yes, they exist), you may have noticed an increasing amount of buzz about functional programming, and particularly the Clojure language. While functional programming is hard to define, almost everyone mentions pure functions, the lack of side effects and state, and easy parallelisation. As for Clojure, it is all about (a kind of) Lisp running on the Java Virtual Machine (and .NET, and transformed to JavaScript).

I’m somewhat convinced that functional programming is at least worth knowing about and trying out, even if you don’t expect to fully convert. It has been said that learning about the functional paradigm makes you a better programmer in your current imperative language. Functional languages reduce accidental complexity, and that helps you focus.

“Whoop de doo, what does it all mean, Basil?”

If you have a background in imperative languages, you will have an interesting time if and when you start digging into functional programming, because whatever else it is, it’s different. And I’m not talking about syntax only, but most of what you do. If you need to add an item to a list, you construct a new list with the new item appended to the previous list (no, it is not as inefficient as it sounds, because there is great stuff under the hood to handle that). This is because immutability is one of the cornerstones of functional programming. If you can’t change something after it is created, there is no state to mess up. You program with values, not stateful objects.

I see I’m getting myself tricked into presenting a definition of functional programming, when that has been done better elsewhere. For pointers, see Michael Fogus’ 10 Technical Papers Every Programmer Should Read (At Least Twice), including the classic “Why Functional Programming Matters” by John Hughes. But I actually wanted to talk about something else.

Continue reading

Thinking of Learning Python? Start here!

Python is one of the friendliest general-purpose programming languages out there. It is free to use, well supported and used by many big companies. Since its introduction in 1991, it may not have taken the world by storm, but has gained a huge share of programmers’ interest. As of this writing (November 2014), Python is number 8 on the TIOBE Index.

Recently I have been studying bioinformatics, and in the course of my studies I have met many people who are learning to program for the first time, and doing it with Python. Others have a little bit of programming experience, but not in Python. Luckily Python is an excellent language for both groups, because it is clean and easy to learn, but it can still be powerful and expressive.

Beginners, step this way

Learning programming is not easy, but some of the things you need to understand are the same no matter what programming language you study. That is why I recommend Think Python by Allen Downey to all beginners. I’ve been programming for close to 30 years now, and I think that this book is one of the most accessible introductions to programming in general, and Python in particular. The subtitle of the book is “How to think like a computer scientist”, which essentially means “problem solving”. You need to be able to take apart what you are trying to achieve, and then find ways to make the computer do what you mean.

Think Python

Think Python is free to download from Green Tea Press in PDF format. However, if you want a printed book, you can buy one from O’Reilly.

Seasoned experts, check this out

I first learned Python in the early 2000s, when the language was still relatively unknown, but already had a lot of users. Since I learn best from a good book, I spent some time looking for one about Python, and quickly found Learning Python by Mark Lutz. At the time it was not a lean book anymore: the 2nd edition, which covers Python 2.3, already came up to almost 600 pages. Still, it is an easygoing book which has only gotten better with time.

Learning Python

In the recent years I’ve gone strictly e-book only, because I don’t have the shelf space for all the books I want or need, and e-books are also a lot cheaper. My whole programming library fits on my iPad, so it is with me wherever I go. New editions of a popular book like Learning Python typically accumulate more material over the years; the latest, 5th edition covers both Python 2.7 and 3.3, and comes up to (count ’em) 1540 pages. That might already be a little too much for a “learning” book, but there you have it.

To each their own

As a summary:

  • Absolute beginners in programming who want or need to learn Python, get Think Python by Allen Downey.
  • Those who already know a little bit about programming, and want to learn Python,
    get Learning Python by Mark Lutz.

This post contains links to the O’Reilly webstore. If you follow the links and buy a book, I will get a minuscule commission. However, I was using both of these books professionally before I became an O’Reilly affiliate, and I want people to know about them and benefit from them. - Your tech ebook super store

Unicode character dump in Python

Sometimes you just need to see what characters are lurking inside a Unicode encoded text file. Your garden variety dump utility (like the venerable od in UNIX systems and the Windows standard hex dump (though I don’t think there is one) only shows you the plain bytes, so you have to head over to to find out what they mean. But first you need to decode UTF-8 to get the actual code points, or grok UTF-16 LE or BE, and so on. It’s fun, but it’s not for everyone.

The udump utility shows you a nice list of character names, together with their offsets in the file. Currently it only handles UTF-8, so the offset is calculated based on the UTF-8 length of the character.

Continue reading