How effective is my research programming workflow? The Philip Test – Part 2
This is the second in my series of posts examining how well I fulfil each of the items on the Philip Test. The first part, with an explanation of exactly what this is, is available here, this time we’re moving on to the next three items in the list:
4. Are your scripts, data sets and notes backed up on another computer?
Let’s take these one at a time. My scripts are nearly always backed up. The exact method varies: sometimes it is just by using Dropbox, but I try to use proper source control (with Git and Github) as much as possible. The time that this falls apart is when I’ve been developing some code for a while, and just somehow ‘forgot’ to put it in source control at the start, and then never realise! This is particularly frustrating when I want to look at the history of a project later on and find one huge commit at the beginning with a commit message saying “First commit, forgot to do this earlier – oops”.
Of course, Git by itself doesn’t count as a backup, you need to actually push the commits to some sort of remote repository to get a proper backup. I try to keep as much of my code open as possible, and make it public on Github (see the list of my repositories), but I can’t do this with all of my code – particularly for collaborative projects when I don’t have the permission of the other contributors, or when the license for parts of the code is unknown). For these I tend to either have private repositories on Github (I have five of these free as part of a deal I got), or to just push to a git remote on my Rackspace server.
Notes are fairly straightforward: electronic notes are synchronised through Dropbox (for my LaTeX notes), and through Simplenote for my other ASCII notes. My paper notes aren’t backed up anywhere – so I hope I don’t loose my notebook!
Data is the difficult part of this as the data I use is very large. Depending on what I’m processing, individual image files can range from under 100Mb to 30-40Gb for a single image (the latter is for airborne images which have absolutely huge amounts of data in them). Once you start gathering together a lot of images for whatever you’re working on, and then combine these with the results of your analyses (which will often be the same size as the input images, or possibly even larger), you end up using a huge amount of space. It’s difficult enough finding somewhere to store this data – let alone somewhere to back it up! At the moment, my computer at work has a total of 4.5Tb of storage, through both internal and external hard drives, plus access to around 1Tb of networked storage for backup – but I’m having to think about buying another external hard drive soon as I’m running out of space.
One major issue in this area is that university IT services haven’t yet caught up with ‘the data revolution’, and don’t realise that anyone needs more than a few Gb of storage space – something that really needs to change! In fact, data management by itself is becoming a big part of my workload: downloading data, putting in sensible folder structures, converting data, removing old datasets etc takes a huge amount of time. (It doesn’t help that I’m scared of deleting anything in case I need it in future!).
5. Can you quickly identify errors and inconsistencies in your raw datasets?
Hmm, I’d probably say “most of the time”. The problem with working on satellite images is that often the only sensible way to identify errors and inconsistencies is to view the images – which is fun (I like actually looking at the images, rather than always working with the raw numbers), but time-consuming. As for non-image data, I find a quick look at the data after importing, and using some simple code to sanity-check the data (such as np.all(data > 0) to check that all of the data have positive values) works well.
The key tools that allow me to do this really easily are Python – particularly with numpy and pandas, ENVI for looking at satellite images (unfortunately I haven’t found any open-source tools that I am quite as productive with), and text editors for reading files. I often use Excel for looking at raw CSV data, although I hate how much Excel pesters me about how “not all features are supported in this file format” – I’d really like a nice simple ‘CSV file viewer’, if anyone knows of one?
6. Can you write scripts to acquire and merge together data from different sources and in different formats?
Yes – but only because I have access to such brilliant libraries.
One thing I end up doing a lot of is merging time series – trying to calculate the closest measurement from a satellite to some sort of ground measurement. I’ve done this in a couple of different ways: sometimes using xts in R and sometimes with Pandas in Python. To be honest, there isn’t much to choose between them, and I tend to use Python now as most of my other code is written in Python.
More to come in the next installment…