This is the second in my series of posts examining how well I fulfil each of the items on the Philip Test. The first part, with an explanation of exactly what this is, is available here, this time we’re moving on to the next three items in the list:
4. Are your scripts, data sets and notes backed up on another computer?
Let’s take these one at a time. My scripts are nearly always backed up. The exact method varies: sometimes it is just by using Dropbox, but I try to use proper source control (with Git and Github) as much as possible. The time that this falls apart is when I’ve been developing some code for a while, and just somehow ‘forgot’ to put it in source control at the start, and then never realise! This is particularly frustrating when I want to look at the history of a project later on and find one huge commit at the beginning with a commit message saying “First commit, forgot to do this earlier – oops”.
Of course, Git by itself doesn’t count as a backup, you need to actually push the commits to some sort of remote repository to get a proper backup. I try to keep as much of my code open as possible, and make it public on Github (see the list of my repositories), but I can’t do this with all of my code – particularly for collaborative projects when I don’t have the permission of the other contributors, or when the license for parts of the code is unknown). For these I tend to either have private repositories on Github (I have five of these free as part of a deal I got), or to just push to a git remote on my Rackspace server.
Notes are fairly straightforward: electronic notes are synchronised through Dropbox (for my LaTeX notes), and through Simplenote for my other ASCII notes. My paper notes aren’t backed up anywhere – so I hope I don’t loose my notebook!
Data is the difficult part of this as the data I use is very large. Depending on what I’m processing, individual image files can range from under 100Mb to 30-40Gb for a single image (the latter is for airborne images which have absolutely huge amounts of data in them). Once you start gathering together a lot of images for whatever you’re working on, and then combine these with the results of your analyses (which will often be the same size as the input images, or possibly even larger), you end up using a huge amount of space. It’s difficult enough finding somewhere to store this data – let alone somewhere to back it up! At the moment, my computer at work has a total of 4.5Tb of storage, through both internal and external hard drives, plus access to around 1Tb of networked storage for backup – but I’m having to think about buying another external hard drive soon as I’m running out of space.
One major issue in this area is that university IT services haven’t yet caught up with ‘the data revolution’, and don’t realise that anyone needs more than a few Gb of storage space – something that really needs to change! In fact, data management by itself is becoming a big part of my workload: downloading data, putting in sensible folder structures, converting data, removing old datasets etc takes a huge amount of time. (It doesn’t help that I’m scared of deleting anything in case I need it in future!).
5. Can you quickly identify errors and inconsistencies in your raw datasets?
Hmm, I’d probably say “most of the time”. The problem with working on satellite images is that often the only sensible way to identify errors and inconsistencies is to view the images – which is fun (I like actually looking at the images, rather than always working with the raw numbers), but time-consuming. As for non-image data, I find a quick look at the data after importing, and using some simple code to sanity-check the data (such as np.all(data > 0) to check that all of the data have positive values) works well.
The key tools that allow me to do this really easily are Python – particularly with numpy and pandas, ENVI for looking at satellite images (unfortunately I haven’t found any open-source tools that I am quite as productive with), and text editors for reading files. I often use Excel for looking at raw CSV data, although I hate how much Excel pesters me about how “not all features are supported in this file format” – I’d really like a nice simple ‘CSV file viewer’, if anyone knows of one?
6. Can you write scripts to acquire and merge together data from different sources and in different formats?
Yes – but only because I have access to such brilliant libraries.
One thing I end up doing a lot of is merging time series – trying to calculate the closest measurement from a satellite to some sort of ground measurement. I’ve done this in a couple of different ways: sometimes using xts in R and sometimes with Pandas in Python. To be honest, there isn’t much to choose between them, and I tend to use Python now as most of my other code is written in Python.
GDAL/OGR is an essential tool for me to access spatial data through Python code – and, depending on the application, I often use the nicer interfaces that are provided by fiona, rasterio and RIOS.
For a while now I’ve been frustrated by an error that I get whenever I’m using git on Windows. When I try and run certain git commands – such as git log or git diff – I get the following message:
Git error message stating “WARNING: terminal not fully functional”
The error message “WARNING: terminal not fully functional” appears, but if you press return to continue, the command runs ok (albeit without nice colours). I’d just put up with this for ages, but decided to fix it today – and the fix is really simply. Basically, the problem is that the TERM environment variable is set to something strange, and git can’t cope with it – so warns you that there might be trouble.
So, to fix this, you just need to make sure that the TERM environment variable is set to “xterm”, which is the standard graphical terminal on Linux machines. There are a number of ways that you can do this:
Summary: Rackspace are great: easy-to-use control panel, helpful support, fast servers and I got it all for free to host my open-source projects! Upgrading servers isn’t as easy as it could be, but that’s a very minor problem overall.
After thinking about it for a while, I took advantage of Jesse Noller’s offer that I found on Twitter
and emailed requesting a free Rackspace account to host Py6S, PyProSAIL and RTWTools (along with my other Open-Source projects). To be honest, I wasn’t sure that my projects would quality – they are relatively niche (although useful within my field) – but Jesse replied quickly and said he’d get things setup for me. I was amazed when they offered $2000 per month in expenses for servers, storage and so on – seems like an amazing amount of money for me!
Setup was nice and easy – I created an account and then got a phonecall from a lovely chap to confirm I was who I said I was. Even though I forgot to put the proper country code on my phone number when registering, they obviously realised I was from the UK and got through to me – and called at a sensible time of the day for me (middle of the night for them!). Anyway, as soon as I’d had the phonecall and confirmed with Jesse, I could start getting things set up.
I hadn’t really used cloud servers before, so wasn’t sure exactly what to expect, but there were helpful guides on their website. I created a Linux server, followed their guide to set it up securely (turning off root access via SSH, changing SSH port, setting up a firewall etc) and got apache working. It was great to have root access on a webserver (previously I’d had shared hosting through Dreamhost, and had been frustrated at being unable to do some things) – I could configure anything to my heart’s content – although I was aware that all of the security would also be down to me!
Anyway, I then created a Windows server to allow me to easily test my software on a Windows machine that doesn’t matter (my work machine is too important for me to possibly screw-up by testing weird configurations of my code on it). This machine was costing a fair amount per hour to host, so I assumed I could start it up and shut it down at will, like I’d heard that you could do on Amazon’s cloud, but I then found out that didn’t seem to be the case. If you want to shutdown a server so you’re not paying for it, you have to image the server, delete the server, and then recreate it using the image – possible, but a bit of a pain. That’s not a major problem for me, as I’m getting it all for free, but might be a bit of a frustration for people who are paying for it!
After playing around with my Linux server a bit, I got my software installed and tried to run it. The underlying model (6S, which I didn’t write), kept crashing and I had no idea why. I contacted Rackspace Support, who were very helpful – even though the software I was trying to run was nothing to do with them – and suggested that I tried upgrading to a better spec server. This was a bit of a pain (I had to image the server, delete it, and then create a new one from the image), but I’ve now upgraded to a nice fancy server which is able to run my code in parallel (makes running the tests far faster!).
A RStudio web-interface (for playing around with possible Py6S/R integration)
An IPython Notebook server (whenever needed, with authentication of course) for doing manual testing of Py6S on the server remotely without needing to SSH
Private Git repositories for various pieces of code that will be open-sourced, but aren’t quite ready to reveal to the world yet
Backups of previous binary versions of RTWTools
And more… (I’m sure I’ve forgotten things)
I also have another server running at the moment with a heavily-secured IPython Notebook interface running on it, for use in some Py6S teaching that I will be doing shortly.
So, overall the experience has been great, and once I got the server setup I’ve barely had to touch the Rackspace Control Panel, the server has Just Worked ™, with no downtime or problems at all. So – thanks Rackspace!
Philip Guo, who writes a wonderful blog on his views and experiences of academia – including a lot of interesting programming stuff – came up with a research programming version of The Joel Test last summer, and since then I’ve been thinking of writing a series commenting on how well I fulfil each of the items on the test.
For those of you who haven’t come across The Joel Test – it’s a list of simple Yes/No questions you can answer to measure the quality of your software team. Questions include things like: Do you use source control? and Can you make a build in one step? Philip came up with a similar set of questions for research programmers (that is, people who program as part of their research work – which includes researchers in a very wide range of fields these days).
So, starting with the first few questions:
1. Do you have reliable ways of taking, organizing, and reflecting on notes as you’re working?
2. Do you have reliable to-do lists for your projects?
These are probably the most important questions on the list, but it’s something that I’ve often struggled to do well. I often wish for a system like the prototype that Philip developed as part of his PhD (see here), but that wasn’t suitable for use in production. Instead, I tend to make do with a few different systems for different parts of my work.
I have a large ‘logbook’ for my main PhD work (it’s even got the logo of my University on it, and some pages at the back with information on Intellectual Property law), which I try and use as much as possible. This includes comments on how things are working, notes from meetings with my supervisors, To Do lists and so on. When I want to keep electronic notes on my PhD I tend to keep long notes in LaTeX documents (I can write LaTeX documents almost effortlessly now) like my PhD status document (a frequently-updated LaTeX document that I have which has the planned structure of my PhD thesis in it, with the status of each piece of work, and planned completion dates). I often keep shorter notes in Simplenote – a lovely simple web-based ASCII text note system, which synchronises with Notational Velocity for OS X and Resoph Notes for Windows.
I also keep my Research Ideas list in Trello – and try and keep it updated as often as possible.
3. Do you write scripts to automate repetitive tasks?
Yes – to an extreme extent, because I’ve been burnt too many times.
I now get scared when I have to do proper analysis through a non-scriptable method – because I know that I’ll have to repeat it at some point, and I know it’ll take a lot of work. I’ve just finished the analysis for a project which is entirely reproducible, apart from a key stage in the middle where I have to export the data as a CSV file and manually classify each row into various categories. That scares me, because when it all needs redoing for whatever reason, the rest of the analysis can be run with a click of a button, but this bit will require a lot of work.
In that example – as in most cases – the manual work probably could be automated, but it’d take so much effort that it (probably) wouldn’t be worth it. It still scares me though…
Looking at this more positively, I find that I’m quite unusual in exactly how much I automate. I know quite a few people who will automate some particularly frustrating repetitive tasks, such as renaming files or downloading data from a FTP site, but I try and do as much of my analysis as possible in code. This really shines through when I need to do the analysis again: I can click a button and go and have a break while the code runs, whereas my colleagues have to sit there clicking around in the GUI to produce their results.
In terms of the tools that I use to do this, they vary depending on what I’m trying to do:
For specific tasks in certain pieces of software, I’ll often use the software’s own scripting interface. For example, if there is already an ENVI function that does exactly what I want to do – and if it is something relatively complex that it would take a lot of effort to implement myself – I’ll write some IDL code to automate running the process in ENVI. Similarly, I’d do the same for ArcGIS, and even Microsoft Office.
For filesystem-related tasks, such as organising folder hierarchies, moving, copying and renaming files, I tend to either use unix commands (it’s amazing what can be done with a few commands like find, mv and grep combined together), simple bash scripts (though I am by no means an expert) or write my own Python code if it is a bit more complex.
For most other things I write Python code – and I tend to find this the most flexible way due to the ‘batteries included’ approach of the standard library (and the Python Module of the Week website to find out how best to use it) and the wide range of other libraries that I can interface with. I’ll be posting soon on my most frequently-used Python modules – so look out for that post.
So, that’s the first three items on the Philip Test – stay tuned for the next three.
Today I got sent a file by a colleague in OSM format. I’d never come across the format before, but I did a quick check and found that OGR could read it (like pretty much every vector GIS format under the sun). So, I ran a quick OGR command:
Oh dear. The data that I was given was meant to be polygons covering village areas in India, but when I imported it I just got all of the vertices of the polygons. I looked around for a while for the best way to convert this in QGIS, but I gave up when I found that the attribute table didn’t seem to have any information showing in which order the nodes should be joined to create polygons (without that information the number of possible polygons is huge, and nothing automated will be able to do it).
Luckily, when I opened the OSM file in a text editor I found that it was XML- and fairly sensible XML at that. Basically the format was this:
Under a main <osm> tag, there seemed to be a number of <node>’s, each of which had a latitude and longitude value, and then a number of <way>’s, each of which defined a polygon by referencing the nodes through their ID, and then adding a few tags with useful information. Great, I thought, I can write some code to process this!
So, that’s what I did, and for those of you who don’t want to see the full explanation, the code is available here.
I used Python, with the ElementTree built-in library for parsing the XML, plus the Shapely and Fiona libraries for dealing with the polygon geometry and writing the Shapefile respectively. The code is fairly self-explanatory, and is shown below, but basically accomplishes the following tasks:
Iterate through all of the <node> elements, and store the lat/lon values in a dictionary with the ID used as the key of the dictionary.
For each <way>, iterate through all of the <nd> elements within it and use the ID to extract the appropriate lat/lon value from the dictionary we created earlier
Take this list of co-ordinates and create a Shapely polygon with it, storing it in a dictionary with the name of the village (extracted from the <tag> element) used as the key
Iterate through this dictionary of polygons, writing them to a shapefile
After all of that, we get this lovely output (overlain with the points shown above):
The code definitely isn’t wonderful, but it does the job (and is relatively well commented, so you should be able to modify it to fit your needs). It is below:
I’ve spent a long time over the last few days struggling with a problem with a Flask webapp that I’ve been developing. The app worked fine on my local computer, but when I tried to deploy it to my web server and run it via WSGI it seemed to ‘just hang’.
That is – when I visited the URL my browser said Waiting…, but there was nothing in the Apache error log or access log, it didn’t seem to have even registered that I was trying to visit the page. I had no idea what was going on, and after a long period of debugging, I found that removing some of the module imports stopped it hanging. In this case, it was removing the import for Py6S (my Python interface to the 6S Radiative Transfer Model) and matplotlib.
I had no idea why this fixed the problem, and after a lot of searching I found that it was all caused by my Apache WSGI configuration. Documenting this here will hopefully help others with the problem – and also remind me what to do next time that I want to deploy a WSGI app.
Basically, if you’ve got a directory structure like this:
That is, a folder for the app (Web6S in this example), with the main file for the app given the same name (web6s.py), and a similarly-named WSGI file (web6s.wsgi).
The WSGI file has the following contents:
from web6s import app as application
which alters the Python path to add the directory that the app is in, and then imports the app object, calling it application (which is what WSGI requires).
That’s fairly simple – the harder bit is the Apache configuration. I’d suggest creating a virtualhost for this, to keep the configuration separate from your other bits of Apache configuration. My configuration is:
Probably these bits aren’t all essential, but this is what seems to work for me. I didn’t have bits like the WSGIDaemonProcess line before, but adding them fixed it (I can’t guarantee that this is correct, but it works for me).
So, adding those lines into the correct files and restarting Apache should make it work – Good Luck!
As a Fellow of the Software Sustainability Institute I’m always trying to make my software more sustainable – and one element of this is ensuring that my software works correctly. Although crashes might annoy users (which generally isn’t a good plan if you want your software to be well-used), a far worse problem is your software producing subtly-incorrect results – which may not be noticed until papers have been published, sensors designed and large research projects started. Definitely not something I want to happen with Py6S!
So, for a while now I’ve been writing various tests for Py6S. You can find them all in the tests folder of the Py6S code, and they can be run by simply running nosetests in the root of the Py6S source code tree. Adding these tests was definitely an improvement, but there were two problems:
I kept forgetting to run the tests after I’d changed things. Then, just before a release I’d remember to run the tests (hopefully!) and find all sorts of problems and have to try and work out how I’d created them.
These tests were mostly regression tests. That means that I ran something in Py6S at the Python console, and then created a test to ensure that the same code would produce the same output that I’d just got. This is useful – as it protects against ‘regressions’, where changes to one part of the code also break things elsewhere – but it doesn’t test that Py6S itself actually produces the right answers. After all, it might have been wrong all along, and a regression test wouldn’t pick that up!
So, I decided to have a big push on testing Py6S and try and fix both of these problems.
Firstly, I set up a Continuous Integration server called Jenkins on my nice shiny Rackspace cloud server. Continuous Integration tools like this are often used to compile software after every commit to the source control system, to ensure that there aren’t any compiler errors that stop everything working – and then to run the test suite on the software. Of course, as Py6S is written in Python it doesn’t need compiling – but using Jenkins is a good way to ensure that the tests are run every time a modification to the code is committed. So now I simply alter the code, commit it, push to Github and Jenkins will automatically run all of the tests and send me an email if anything has broken. Jenkins even provides a public status page that shows that the Py6S build is currently passing all of the tests, and even provides graphs of test failures over time (shown below – hopefully automatically updating from the Jenkins server) and test coverage.
Using Jenkins to provide test coverage reports (which show which lines of code were executed during the tests, and therefore which lines of code haven’t been tested at all) showed me that quite a lot of important bits of Py6S weren’t being tested at all (even with regression tests), and of course I still had the problem that the majority of my tests were just regression tests.
I wasn’t sure what to do about this, as I couldn’t replicate all that 6S does by hand and check that it is giving the right results (even if I knew enough to do this, it’d be very time-consuming to do), so how could I do anything other than regression tests? Suddenly, the answer came to me: replicate the examples! The underlying 6S model comes with example input files, and the expected outputs for those input files. All I needed to do was to implement Py6S code to replicate the same parameterisation as used in the input files, and check that the outputs were the same. Of course, I wouldn’t do this for every parameter in the output files (again, that’d take a long time to manually setup all of the tests – although it may be worth doing sometime) – but a few of the most-used parameters should give me a high confidence that Py6S is giving the same results as 6S.
So, that’s what I did. The code is available on Github, and all of these example-based tests pass (well, they do now – as part of writing the tests I found various bugs in Py6S which I fixed).
Overall, I have a far higher confidence now that Py6S is producing correct results, and using Continuous Integration through Jenkins means that I get notified by email as soon as anything breaks.
This is just a quick Public Service Announcement, to let you know that my two main pieces of software have got fancy new websites. Py6S (my Python interface to the 6S Radiative Transfer Model) and RTWTools (my set of extensions for ENVI) are now hosted at:
In summary, what we’re doing is creating a Batch Processing Sequence, but we won’t actually be adding any commands to the sequence, we’ll just be configuring the output options.
Go to the menu Advanced -> Document Processing -> Batch Processing
Click the New Sequence button
Give the sequence a sensible name
You’ll now be in the dialog where you can configure the sequence – and this is where things get slightly counter-intuitive. Rather than selecting any commands at the top of the dialog, ignore everything else and click the Output Options button.
If you just want to save the files with the same filename, but with the appropriate extension for the new filetype then all you need to change is the bit at the very bottom of the dialog. Simply select Export files to alternate format and then choose the format you want from the dropdown box. If you want to do anything fancy with the filename then have a look at the other options: they’re pretty self-explanatory.
Click OK multiple times to get back to the dialog listing all of the sequences.
Select your new sequence in the list and click Run Sequence. You’ll be asked to select the files you want to convert, and then confirm a summary of the sequence (it will be blank as there are no command steps, so just click OK) and then the conversion will be done.
Over the last few months I’ve helped a number of people setup academic websites, and various other people have asked me whether it’s worth a PhD student, Early Career Researcher or other academic creating a website, especially given that it does take a bit of time to do it well. My unequivocal answer is YES!
In emails to these people I’ve given a brief rundown of what websites I have created, what content they have, and – most importantly – what ‘good things’ have resulted from having these websites. Again, using one of Matt Might’s tips for academic blogging, I thought I’d ‘Reply to Public’ by just posting it as a blog post and pointing people here. So, here goes…
Summary: My various websites have been very useful to me. They get my name out there, and have got me paid work (contracting for various people during my PhD), as well as a huge range of opportunities.
I run three main websites: my main academic website, my blog, and my FreeGISData site. First off, these aren’t wonderful, and there are a lot of things that can be improved (particularly the design), but it’s the principle which is important. So, let’s look at those in turn:
This is my ‘academic homepage’ and the link that I put on my business card and give out to people in email signatures etc. It has the standard ‘About Me’ stuff talking about what field I work in, what my research is about, and provides links to my various other sites. One key part of the site is the individual pages for pieces of software that I’ve written, such as Py6S, AutoZotBib and RTWTools, providing brief explanations and links to download them. Again, these are the links that I give out to people who are interested in my software (note that they are simple links that make sense – for example, http://www.rtwilson.com/academic/py6s – it goes where it says on the tin!).
A key page for any academic’s website is the publications list, shown above. Mine uses a graphical approach, breaking up the boring line of text with thumbnails of the first pages of the articles, which works particularly well for my conference posters:
Importantly, each paper/poster is linked to a full PDF and – even more importantly – the page has metadata in it allowing the papers on the page to be automatically added to citation managers such as Zotero and Mendeley. This was actually harder to do than I thought it would be, so I wrote a Python module to do it all for you – bib2coins. It takes a list of publications formatted as a BibTeX file and converts them to the COinS metadata that the citation managers can understand. I’ve now got a fully automated system that will allow me to add a new publication to my BibTeX file and will then automatically update my CV and the publications page and upload both to my website (hopefully I’ll get chance to blog how that works).
That reminds me of the other important item on an academic website: a full academic CV. You never know when someone might want to employ you – possibly as a temporary contractor – and the CV will help.
I don’t post on a regular schedule – I just post when I feel I have something interesting and useful to say. In fact, I have quite a few draft posts and ideas for posts that I haven’t found time to turn in to full posts yet. Overall it doesn’t take a huge amount of time, but quite a few people visit the blog and find it useful.
This is my simplest site – it is just a long list of links – but my most popular.
I started collecting a set of links to freely available GIS datasets (land cover, climate – all sorts of things) early in my PhD, and then decided that rather than just keeping the list on my computer, I’d put it on the web so that other people can use it. Other people really liked it, and I started to search for other datasets to add to the list. Loads of people have contacted me to suggest more datasets to add to the list – and there are now over 400 links there, in a huge range of categories.
Overall the websites get a fair number of visitors – not absolutely huge, but not too bad either. Approximate average and maximum monthly visits for each of the sites are listed below:
Average Monthly Visits
Maximum Monthly Visits
Free GIS Data
This is the key bit: things that have happened because of my website. The general gist is that it has been really helpful, and has led to money, jobs, and free stuff – amongst other things. So, on with the list:
Jobs: Directly through one of my blog posts, I’ve been contracted to do some processing of Landsat footprint shapefiles by some people in the US. The link to my website on my business card also helped me get some work with a university in London for a few weeks, which was very beneficial for me.
Contacts: A huge range of people have contacted me because of my websites and blog posts – people at universities, in industry, the press, and larger organisations. I’ve had the UN contact me to ask whether it was ok to put my Free GIS Data list in a collection of useful sites to acquire data for disaster management (funnily enough, I said yes!), and I’ve been invited to do a keynote speech at a conference in Florida about the importance of freely available geographic data (although sadly this then fell through due to budget cuts at their organisation).
Free books: After reviewing a couple of books that I’d bought with my own money, I was contacted by a publisher and asked to review one of their books. This continued, and I have now reviewed books from three or four publishers – and one of these reviews was even published in a journal.
Money: I have a ‘donate’ link on my Free GIS Data site. It doesn’t get much at all, but every so often someone donates a fiver. I also have a couple of (small) ads on my blog and my Free GIS Data site, run through the Google AdSense platform. So far I’ve made about £50 from these ads – not much, but enough to cover the hosting costs (or buy myself some nice treats!).
I’m sure there are many more good things that have happened (when my wife reads this she will probably remind me of some that I’ve forgotten) – but even just the ones above were well worth the cost of setting up the sites!