Book review: Data Science at the Command Line

No matter how handy graphical user interfaces are, the good old command line remains a useful tool for performing various low-level data manipulation and system administration tasks. It is the fallback when you need to do something that has no way of graphical control. Being much more expressive and open-ended than a predefined set of controls, the command shell is the ultimate control environment for your computer.

Data science has become one of the most intensely practised computer applications, so it is no wonder that it also benefits greatly from the hands-on control approach of the command line shell. Data scientist Jeroen Janssens has had the foresight to combine the fundamental operations of data science and the most suitable command line tools into a book that collects many useful practices, tips and tricks for processing and preparing data, called “Data Science at the Command Line” (O’Reilly, 2014).

Data Science at the Command Line

At its highest abstraction levels, data science involves using models and machine learning to extract patterns from data and extrapolate results from data sets that are often much larger than fits in memory at any one time. At a lower level, it involves multiple file formats and just plain hard work to get the data in a fit shape to be analysed, and this is where the command line comes in.

There is only so much you can do with canned tools like text editors, but a world of possibilities opens for you when you have the power can chain simple commands together, forming pipelines of data where one command’s output becomes another one’s input. You can also redirect input from a file to a command, and from a command to a file.

Even though Linux and macOS installations have various command shells, apart from the defaults, Janssens shows you how to use a set of tools called the Data Science Toolbox, which actually uses VirtualBox or Vagrant to plant a self contained GNU/Linux environment with Python, R and various other tools of the trade on your local machine, without disturbing the host operating system too much.

With real-life examples, Janssens shows you how to use classic Linux command line tools like cut, grep, tr, uniq and sort to your advantage. You will also learn how to get data from the Internet, from databases and even Microsoft Excel spreadsheets, where most of the world’s operational data lies hidden from plain sight.

From this book I learned completely new and interesting ways to work with CSV (Comma Separated Value) files, and it introduced me to the excellent csvkit, with its collection of power tools to cut, merge and reorder columns in CSV files, perform SQL-style queries on the lines, and grep through them.

Among other things you get information on Drake, described as “make for data” – which, if you’re familiar with the classic software development tool make (and of course you are) should whet your appetite. There is also a chapter about how to make your data pipelines run faster by parallelising them and running commands on remote machines.

Scrubbing the data is less than half the fun, but usually more than half of the work in data science. You will learn to write executable scripts in Python and R with their comprehensive data science and statistics libraries, and learn to explore your data using visualisations that consist of statistical diagrams like bar charts and box plots. So the command line is not just text; even though the images are generated using commands, they are obviously shown in a GUI window.

Finally, there is a chapter on modelling data using both supervised and unsupervised learning methods, which serves as a cursory introduction to machine learning, although you are referred to more comprehensive texts on the algorithms involved.

At the back of the book there is a handy reference for all the commands discussed in the book, which include many of the old UNIX stalwarts found in Linux, but also newer tools like jq for processing JSON.

If you need to do data preparation for a data science project, you owe it to yourself to become good friends with the command line, and utilise the many tools described in Janssens’ book in your daily work. Even if you don’t “automate all the things“, you will benefit from the pipeline approach to data processing.

Buy the e-book at the O’Reiily web shop:
Data Science at the Command Line

The book also has a website, datascienceatthecommandline.com, where you can preview some of its content.

For the history and philosophy of the command line, you should read Neal Stephenson’s In the Beginning Was the Command Line.

C# and F# on the Mac with Mono

Mono is the open source .NET runtime for Windows, Linux, and OS X. It consists of the Mono runtime environment, libraries, and C# and F# compilers. Recently Mono has gained extra popularity due to Microsoft’s purchase of Xamarin, the makers of a cross-platform toolkit of the same name.

If you just want to create command-line .NET applications on the Mac, and don’t need Xamarin.Forms or the mobile tools, you can just install Mono and start hacking away.

The Mono Project home page advises you to download and install Mono as a Mac package, but you also do a a Homebrew-based installation. If you don’t yet have Homebrew (“the missing package manager for OS X”), install it by following the instructions on its home page.

Once you have Homebrew installed, you can install Mono:

$ brew install mono
==> Downloading https://homebrew.bintray.com/bottles/mono-4.2.3.4.yosemite.bottl
######################################################################## 100.0%
==> Pouring mono-4.2.3.4.yosemite.bottle.tar.gz
==> Caveats
To use the assemblies from other formulae you need to set:
export MONO_GAC_PREFIX="/usr/local"
Note that the 'mono' formula now includes F#. If you have
the 'fsharp' formula installed, remove it with 'brew uninstall fsharp'.
==> Summary
🍺 /usr/local/Cellar/mono/4.2.3.4: 1,280 files, 205.2M

You can probably pick up that I’m still using OS X Yosemite on this machine, but there shouldn’t be any difference with El Capitan. If you upgraded from Yosemite to El Capitan, and had Homebrew installed, you may have run into an issue with the OS X security restrictions – read the solution.

C# support for Visual Studio Code

Microsoft Visual Studio Code, or VSCode for short, is a relatively new programmer’s text editor, but already quite mature. Typically I use it for Python, Clojure and JavaScript. Now I wanted to use it to edit C# source files on the Mac, but surprisingly it does not have C# syntax highlighting support out of the box. You need to install an extension and restart VSCode.

Hello, .NET world!

Just a simple C# source file to get you started:

using System;

namespace Hello
{
    class Hello
    {
        static void Main(string[] args)
        {
            Console.WriteLine("Hello, .NET world!");
        }
    }
}

Save it as Hello.cs. Compile with:

mcs Hello.cs

Here, mcs is the Mono C# compiler. You should get a file named Hello.exe, but you can’t execute it directly. Instead, use the Mono runtime:

mono Hello.exe

You should see the greeting printed out by Console.WriteLine.

Why C#?

I’m dusting off the C# tools on my Mac because I envision that C# and .NET will become more important on OS X because of the Xamarin acquisition. I like C#, sometimes better than Java, and have programmed many applications for Windows Phone with it.

Why F#?

F# intrigues me as a language that embraces many of the good things about functional programming, but lets you leverage the .NET ecosystem. I’ve started to learn F# in earnest several times during the last few years, but have not made a concentrated attempt yet. Hopefully soon.

Semi-Autonomous, Programmable Drones Incoming

Drones, or Unmanned Aerial Vehicles (UAVs), be they quadcopters or other type of flyer, will become more “intelligent” as themselves or by forming swarms, as this TED Talk by Vijay Kumar at the University of Pennsylvania shows:

My interest in drones lies not in flying them myself live, because I’m a lousy pilot and don’t play games much anyway, but in making them follow a predetermined route and return back to the starting point – for example, surveying an object or estate, or even carrying cargo between waypoints. The gorgeous aerial shots you get with many drones these days are great, of course, but I’ll let others play the director, and instead concentrate on the programming.

I recently got a Parrot AR.Drone 2.0 Elite Edition, mostly because it was the cheapest quadcopter that has an SDK, allowing you to create your own applications on top of it, or extend and customise some sample applications. (AR.Drone 2.0 SDK)

I did some web searches on the programmability of the AR.Drone, and it seems that the biggest craze has faded a little bit. Many of the libraries for Python and Node.js are not seeing as active development as I would have thought, and groups like NodeCopter are not too active either.

It also seems that some active members have moved on to do greater things, like Fleye, a personal flying robot – the result of work by Laurent Eschenauer and Dimitri Arendt:

The Fleye Kickstarter campaign is still ongoing, with delivery scheduled for September 2016.

Ecshenauer is the author of the Node.js library ardrone-autonomy, which itself is based on node-ar-drone by Felix Geisendörfer.

There is also the python-ardrone library for Python, which I would prefer over Node.js.

I have tested both node-ar-drone and python-ardrone quickly with the AR.Drone 2.0, and it is an exhilarating experience to see your quadcopter come to life and rise up to hover, just by entering a few commands in the Node or Python REPL. (Just make sure you can quickly call the land() function, especially if you are experimenting indoors.)

There are also some Clojure libraries for controlling the AR.Drone, such as clj-drone and turboshrimp, but I’m not sure if I would want to add JVM to the mix.

My inspiration for programming drones actually got sparked by the O’Reilly Programming Newsletter, which featured a recent article by Greg on The Yhat Blog titled “Building a (semi) Autonomous Drone with Python“. It had a lot of tips about how to start with this kind of activity, and extending it to involve computer vision using OpenCV.

I intend to develop some applications that fly the AR.Drone automatically along the perimeter of a large object, such as a house, or along some predetermined line, like the side of a field. I hope to document some of the results in this blog.

If you’re interested in programming semi-autonomous drones, drop me a line with any ideas, tips, questions, or collaborations.

Learning Clojure

About one year ago I wrote a multi-part tutorial on Clojure programming, describing how I wrote a small utility called ucdump (available on GitHub).

Here are links to all the parts:

However, Carin Meier’s Living Clojure is excellent in many ways. Get it from O’Reilly (we’re an affiliate):
Living Clojure

My little tutorial started with part zero, in which I lamented how functional programming is made to appear unlearnable by mere mortals, and it kind of snowballed from there. Hope you like it and/or find it useful!

 

Functional programming without feeling stupid, part 4: Logic

In the previous parts of “Functional programming without feeling stupid” we have slowly been building ucdump, a utility program for listing the Unicode codepoints and character names of characters in a string. In actual use, the string will be read from a UTF-8 encoded text file.

We don’t know yet how to read a text file in Clojure (well, you may know, but I only have a foggy idea), so we have been working with a single string. This is what we have so far:

(def test-str 
  "Na\u00EFve r\u00E9sum\u00E9s... for 0 \u20AC? Not bad!")
(def test-ch { :offset 0 :character \u20ac })
(def short-test-str "Na\u00EFve")

(defn character-name [x]
  (java.lang.Character/getName (int x)))

(defn character-line [pair]
  (let [ch (:character pair)]
    (format "%08d: U+%06X %s"
      (:offset pair) (int ch)
      (character-name ch))))
    
(defn character-lines [s]
  (let [offsets (repeat (count s) 0)
        pairs (map #(into {} {:offset %1 :character %2}) 
          offsets s)]
    (map character-line pairs)))

I’ve reformatted the code a bit to keep the lines short. You can copy and paste all of that in the Clojure REPL, and start looking at some strings in a new way:

user=> (character-lines "résumé")
("00000000: U+000072 LATIN SMALL LETTER R" 
"00000000: U+0000E9 LATIN SMALL LETTER E WITH ACUTE" 
"00000000: U+000073 LATIN SMALL LETTER S" 
"00000000: U+000075 LATIN SMALL LETTER U" 
"00000000: U+00006D LATIN SMALL LETTER M" 
"00000000: U+0000E9 LATIN SMALL LETTER E WITH ACUTE")

But we are still missing the actual offsets. Let’s fix that now.

Continue reading

Functional programming without feeling stupid, part 3: Higher-order functions

Welcome to the third installment of “Functional programming without feeling stupid”! I originally started to describe my own learnings about FP in general, and Clojure in particular, and soon found myself writing a kind of Clojure tutorial or introduction. It may not be as comprehensive as others out there, and I still don’t think of it as a tutorial — it’s more like a description of a process, and the documented evolution of a tool.

I wanted to use Clojure “in anger”, and found out that I was learning new and interesting stuff quickly. I wanted to share what I’ve learned in the hope that others may find it useful.

Some of the stuff I have done and described here might not be the most optimal, but I see nothing obviously wrong with my approach. Maybe you do; if that is the case, tell me about it in the comments, or contact me otherwise. But please be nice and constructive, because…

…in Part 0 I wrote about how some people may feel put off by the air of “smarter than thou” that sometimes floats around functional programming. I’m hoping to present the subject in a friendly way, because much of the techniques are not obvious to someone (like me) conditioned with a couple of decades of imperative, object-oriented programming. Not nearly as funny as Learn You a Haskell For Great Good, and not as zany as Clojure for the Brave and True — just friendly, and hopefully lucid.

xkcd 1270: Functional
xkcd 1270: Functional. Licensed under Creative Commons Attribution-Non-Commercial License. This is a company blog, so it is kind of commercial by definition. Is that a problem?

In Part 1 we played around with the Clojure REPL, and in Part 2 we started making definitions and actually got some useful results. In this third part we’re going to take a look at Clojure functions and how to use them, and create our own — because that’s what functional programming is all about.

Continue reading

Git with the program – use version control

If you are programming, and you are still not using any form of version control, you really have no excuse. There are many benefits to being able to keep track of your code and try out various branches, even if you are the only programmer in the project. If you are collaborating with someone, it soon becomes nearly impossible (or at least very time-consuming) to deal with various versions and changes.

Of all the version control systems I’ve tried over the years (CVS, Subversion, a little bit of Mercurial, and Git) it seems that Git has “won” in a sense. There is a sizable open-source community born around GitHub (and Bitbucket) for which Git works very well indeed. Also many programming tools have built-in or plug-in support for Git, so you don’t even have to use command-line tools for managing your source code repositories if you don’t want to.

For open-source development, GitHub is the obvious choice. If you’re doing closed source, or you think your code isn’t ready for public scrutiny, Bitbucket gives you unlimited private repositories. I’m currently using GitHub to collaborate on some private repositories, which you can get with a paid plan, and Bitbucket for my closed-source app projects.

In a spirited attempt to really learn to use the tools of my trade, I wanted to take some time to better learn Git for version control (and also dive deeper into Xcode, but that is another story).

Earlier I’ve occasionally been using the fine tome Version Control with Git, 2nd Edition* by Jon Loeliger and Matthew McCullough to learn the basics, but I wanted to really dive in. I’ve already mastered the very basics, and have also used remote repositories with both GitHub and BitBucket, but there is a lot more to learn to be able to really take advantage of Git.

Version Control with Git

* Disclaimer: I’m an O’Reilly affiliate, and the links above take you to the O’Reilly online bookstore, in the hope that you purchase something, so that I will get a small commission.

Continue reading