How effective is my research programming workflow? The Philip Test – Part 3
7. Do you use version control for your scripts?
Yes, almost always. I’ve found this a lot easier since I started using Git – to start using version control with Git simply requires running “git init” – whereas with SVN you had to configure a new repository and do all sorts of admin work before you could start. Of course, if you want to host the Git repository remotely then you’ll need to do some admin, but all of that can be done at a later date, once you’ve started recording changes to your code. This really helps me start using version control from the beginning of the project – there’s less to interrupt me when I’m in the mood to write code, rather than to fiddle around with admin.
Occasionally I still forget to start using version control at the beginning of a project, and I always kick myself when I don’t. This tends to be when I’m playing around with something that doesn’t seem important, and won’t have any future…but it’s amazing how many of those things go on to become very important (and sometimes get released as open-source projects of their own). I’m trying to remember to run git init as soon as I create a new folder to write code in – regardless of what I’m writing or how important I think it’ll be.
8. If you show analysis results to a colleague and they offer a suggestion for improvement, can you adjust your script, re-run it, and produce updated results within an hour?
Sometimes, but often not – mainly because most of my analyses take more than an hour to run! However, I can often do the alterations to the script/data and start it running within the hour – and colleagues are often surprised at this. For example, someone I’m currently working with was very surprised that I could re-run my analyses, producing maximum values rather than mean values really easily – in fact, within about five minutes! Of course, all I needed to do in my code was change the mean() function to the max() function, and re-run: simple!
I’ve already talked a bit on this blog about ProjectTemplate – a library for R that I really like for doing reproducible research, and the benefits of ProjectTemplate are linked to this question. By having defined separate folders for library functions and analysis functions – and by caching the results of analyses to stop them having to be run again (unless they’ve changed), it really helps me know what to change, and to be able to produce new results quickly. I’d recommend all R programmers check this out – it’s really handy.
9. Do you use assert statements and test cases to sanity check the outputs of your analyses?
Rarely – except in libraries that I develop, such as Py6S. I recently wrote a post on how I’d set up continuous integration and automated testing for Py6S, and I am strongly aware of how important it is to ensure that Py6S is producing accurate results, and that changes I make to improve (or fix!) one part of the library don’t cause problems elsewhere. However, I don’t tend to do the same for non-library code. Why?
Well, I’m not really sure. I can definitely see the benefits, but I’m not sure whether the benefits outweight the time that I’d have to put in to creating the tests. Assert statements would be easier to add, but I’m not sure if I should start using them in my code. Currently I use exception handling in Python to catch errors, print error messages and – crucially – set all of the results affected by that error to something appropriate (normally NaN), and these have the benefit that the code continues to run. So, if one of the 100 images that I’m processing has something strange about it then my entire processing chain doesn’t stop, but the error gets recorded for that image, and the rest of the images still get processed. I actually think that’s better – but if someone has an argument as to why I should be using assert statements then I’d love to hear it.