Robin's Blog

How effective is my research programming workflow? The Philip Test – Part 4

10. Can you re-generate any intermediate data set from the original raw data by running a series of scripts?

It depends which of my projects you’re talking about. For some of my nicely self-contained projects then this is very easy – everything is encapsulated in a script or a series of scripts, and you can go from raw data, through all of the intermediate datasets, to the final results very easily. The methods by which this is done vary, and include a set of Python scripts, or the use of the ProjectTemplate package in R. Since learning more about reproducible research, I try to ‘build in’ reproducibility from the very beginning of my research projects. However, I’ve found this very difficult to add to a project retrospectively – if I start a project without considering this then I’m in trouble. Unfortunately, a good proportion of my Phd is in that category, so not everything in the PhD is reproducible. However, the main algorithm that I’m developing is – and that is fully source-controlled, relatively well documented and reproducible. Thank goodness!

11. Can you re-generate all of the figures and tables in your research paper by running a single command?

The answer here is basically the same as above: for some of my projects definitely yes, for others, definitely no. Again, there seems to be a pattern that smaller more self-contained projects are more reproducible – and not all figures/tables of my PhD thesis can be reproduced – but generally you’ve got a relatively good chance. At the moment I don’t use things like Makefiles, and don’t write documents with Sweave, KnitR or equivalents – so to reproduce a figure or table you’ll often have to find a specific Python file and run it (eg. create_boxplot.py, or plot_fig1.py), but it should still produce the right results.

12. If you got hit by a bus, can one of your lab-mates resume your research where you left off with less than a week of delay?

Not really not – it would be difficult, even for my supervisor or someone who knew a lot about what I was doing to take over my work. My “bus factor” is definitely 1 (although I hope that the bus factor for Py6S is fractionally greater than 1). Someone who had a good knowledge of Python programming, including numpy, scipy, pandas and GDAL, would have a good chance at taking over one of my better-documented and more-reproducible smaller projects – but I think someone would struggle to pick up my PhD. In many ways though, that’s kinda the point of a PhD – you’re meant to end up being the World Expert in your very specific area of research, which would make it very difficult for anyone to pick up anyone’s PhD project.

For one of my other projects, it may take a while to get familiar with it – but it should be perfectly possible to take my code, along with drafts of papers and/or other documentation I’ve written and continue the research. In many ways that is the whole point of reproducible research: aiming to develop research that someone else can easily reproduce and extend. The only difference is that usually the research is reproduced/extended after it’s been completed by you, whereas if you get hit by a bus then it’ll never have been completed in the first place!


Categorised as: Academic, Programming


Leave a Reply

Your email address will not be published.