Robin's Blog

Encouraging citation of software – introducing CITATION files

Summary: Put a plaintext file named CITATION in the root directory of your code, and put information in it about how to cite your software. Go on, do it now – it’ll only take two minutes!

Software is very important in science – but good software takes time and effort that could be used to do other work instead. I believe that it is important to do this work – but to make it worthwhile, people need to get credit for their work, and in academia that means citations. However, it is often very difficult to find out how to cite a piece of software – sometimes it is hidden away somewhere in the manual or on the web-page, but often it requires sending an email to the author asking them how they want it cited. The effort that this requires means that many people don’t bother to cite the software they use, and thus the authors don’t get the credit that they need. We need to change this, so that software – which underlies a huge amount of important scientific work – gets the recognition it deserves.

As with many things relating to software sustainability in science, the R project does this very well: if you want to find out how to cite the R software itself you simply run the command:


If you want to find out how to cite a package you simply run:


For example:

> citation('ggplot2')

To cite ggplot2 in publications, please use:

  H. Wickham. ggplot2: elegant graphics for data analysis. Springer New York,

A BibTeX entry for LaTeX users is

    author = {Hadley Wickham},
    title = {ggplot2: elegant graphics for data analysis},
    publisher = {Springer New York},
    year = {2009},
    isbn = {978-0-387-98140-6},
    url = {},

In this case the citation was given by the author of the package, in R code, in a file called (surprise, surprise) CITATION inside the package directory. R can even intelligently make up a citation if the author hasn’t provided one (and will intelligently do this far better if you use the person class in your description). Note also that the function provides a nice handy BibTeX entry for those who use LaTeX – making it even easier to use the citation, and thus reducing the effort involved in citing software properly.

I think the R approach is wonderful, but the exact methods are rather specific to R (it is all based around a citEntry object or a bibentry object and the CITATION file contains actual R code). That’s fine for R code, but what about Python code, Ruby code, C code, Perl code and so on… I’d like to suggest a simpler, slightly more flexible approach for use broadly across scientific software:

Create a CITATION file in the root directory of your project and put something in there that tells people how to cite it

In most cases this will probably be some plain-text which gives the citation, and possibly a BibTeX entry for it, but it could be some sort of code (in the language your project uses) which will print out an appropriate citation when run (and, of course, R users should stick to the standard way of writing CITATION files for R packages – this proposal is really for users of other languages). This CITATION file will be one of a number of ALL CAPITALS files in your project’s directory – it will go alongside your README file (you do have one, don’t you?), and your LICENCE file (you must have one – see my explanation for why) and possibly also your INSTALL file and so on.

I know this approach isn’t perfect (machine-readability of citations is a problem using this method, but then again machine readability of citations is a big problem generally…) but I think it is a start and hopefully it’ll reduce the effort required to cite software, and thus encourage software citation.

So, go on – go and write a few CITATION files now!

Beware: Latitude/Longitude co-ordinates in ENVI may not be in WGS-84

Summary: When you use the Pixel Locator or Cursor Location/Value tool in ENVI, the latitude and longitude co-ordinates given are based on the datum that the image is in, not necessarily WGS-84.

This may be obvious to some people, but it wasn’t to me – and I thought that if I got confused then some other people probably would too – hence this article. I often use the Pixel Locator dialog box in ENVI to find a specific location in the image by entering the X and Y pixel location (referred to in ENVI as Samples and Lines) or the map co-ordinates of whatever co-ordinate system the image is projected into (for example, the Ordnance Survey National Grid):


Alternatively, you can click the button with the arrows on it, and enter a location in latitude and longitude:


Very handily, all of these values are updated as you move around the image manually – so this dialog can also be used as an equivalent of the Pixel Location/Value window – as shown below.


This all sounds fine – but the problem is that the latitude/longitude values that are shown are calculated using the datum that is defined for the image. You can see this in the image below, where the latitude and longitude value is displayed, and the datum is listed above it:


This seems like a sensible thing to do, but it makes comparison with latitude and longitude co-ordinates from other sources very difficult – as nearly all other latitude and longitude co-ordinates are provided in the WGS-84 datum. Locations from GPS systems are always in WGS-84, and locations on web maps, in GIS systems and most other sources of latitude and longitude co-ordinate systems are very frequently in WGS-84.

So, this raises two questions:

What effect does this have?

Well, I haven’t yet done a full investigation of this (sometime I will sit down and write some code to do some proper testing), but when I found the problem it was causing offsets of around 100-200m – which can be quite significant in many applications.

What can we do about this?

There is a way to tell ENVI to use WGS-84 as the datum for calculating its latitude and longitude values, but it is a bit fiddly. Basically, when you’re in the Pixel Locator dialog before you switch to the latitude/longitude display, click the Change Proj… button and then select click the Datum button to select a new datum. Then switch to the latitude/longitude display, and you’ll find that the datum is listed as WGS-84, and the values will be correct in the WGS-84 datum:


Unfortunately, this means that the map co-ordinate values (for example, the Ordnance Survey Grid References) will now be wrong, and you’ll need to switch back to the original datum to get those to be correct. Also, I can’t find a way to get the Cursor Location/Value dialog to display WGS-84 latitudes and longitudes.

I actually found this while designing a lecture and practical session for MSc students at the University of Southampton on using the ENVI API in IDL. While writing a code example for the students on finding the pixel value at a specific lat/lon co-ordinate, I found that my results came up between ten and twenty pixels away from the place I thought they would (I was using 10m SPOT imagery), and I got different results from my IDL code and ENVI – even though my IDL code was using the ENVI API! Luckily I managed to find out what the problem was – and could therefore explain it to my students, and hopefully stop them running in to the same problem as me. It may be that I’m being silly here, and everyone naturally realises what ENVI does, and thinks it is the right thing to do – but it definitely confused me, so maybe it will helped you.

van Heuklon Ozone model implementation in Python

As part of my PhD I wanted to use a simple model which would give me an estimation of the atmospheric ozone amount given a location and time of year. A simple model to do this was created by van Heuklon in 1979, and was described in a delightfully simple paper (unfortunately not freely available online, but see the reference below) using a simple formula. The map below shows the modelled ozone amounts across the world for 15th April (click to enlarge):


The formula was built up from a baseline ozone amount (235 matm-cm) which then has various components added to it, some of which varied based upon latitude, longitude or time of year. The model isn’t intended to be accurate or high-resolution, but provides me with a rough estimate of the ozone amount at a given location at a given time. Accurate measurements of ozone amounts are available from satellites, but I would need to get data from many different times of year to be able to take into account the seasonal variation in ozone amounts, and the satellite data would take up a lot of space – hence using this simple model.

The code for this model is available on Github and has two implementations: get_ozone_conc and old_get_ozone_conc. The latter uses standard Python code with if statements to set the different values in the formula depending on the input latitudes and longitudes, which means it can only process one latitude/longitude combination at once. get_ozone_conc, on the other hand, uses somewhat more complicated numpy code which allows you to pass it arrays of latitudes and longitudes and it will return an array of results. This is what was used to produce the map above – the code for which is also in the repository.

This code was created for use as part of my PhD, but has been released into the public domain (as CC0) in case anyone else finds it useful.

Also, for those of you using Py6S (my Python interface to the 6S atmospheric Radiative Transfer Model), I’m sure you will be pleased to know that this model will shortly be integrated into Py6S to allow you to easily specify ozone amounts using it.


van Heuklon, T. K. (1979). Estimating atmospheric ozone for solar radiation models. Solar Energy, 22(1), 63-68

Please specify a license for your software (and even your code samples!)

Another one of the things that came out of the Collaborations Workshop 2013 was the importance of licensing any software or code that you release. The whole point of releasing code is so that other people can use it – but no-one can use it properly unless you tell them what license you have released it under.

I’ve run into various problems with un-licensed code during the last few years, which makes it all the more embarrassing that I didn’t have proper licenses specified for all of my GitHub repositories – I didn’t even have proper licenses specified for some of the packages that I have released on the Python Package Index!

Anyway, I’ve been through and set licenses for all of my repositories, and I’ve explained the procedure I went through for each repository below, which consists of two simple steps:

  1. Choose a license – this is the thing that often scares people, and it can be fairly difficult. Unfortunately most of the guides to choosing open-source licenses online are fairly complicated, and unfortunately not much can be done about that, because some of the licenses themselves are fairly complicated. There are a few key things to keep in mind:
    • Make sure that any license you choose is OSI-approved, or it won’t properly count as ‘open source’
    • Think about whether you want to force everyone who uses your code to release anything they use it in as open-source (as in the GPL), any modifications they make to it as open-source (as in the LGPL), or whether people can do whatever they want with it.
    • Think about whether you want to let anyone – including yourself – use the code in commercial software. If so, depending on your views on the previous question, you may want to release under a license like LGPL or BSD rather than releasing under the GPL which would mean the code could never be used commercially.
    • Do you just really not care what happens to the code – anyone can do anything they want, in any way they want, and don’t have to bear you in mind in the slightest? If so, formally releasing the code into the public domain through something like CC0 may be a good way to go – although I wouldn’t recommend this for large bits of code/software as it is not specifically designed with software in mind – it is more appropriate for small code samples and examples.
  2. Tell people what the license is – in as many places as possible – now you’ve gone to the trouble of choosing a license you want people to be able to easily find out what license the code is released under. To achieve that, do as many of the items below as possible (ordered roughly in order of importance):
    • Place a license file in your source directory with a filename of LICENSE or COPYING (depending on the license – generally GPL-style licenses ask you to use a file called COPYING, whereas many other licenses suggest a file called LICENSE). Put a full ASCII text copy of the license into this file.
    • State in the README file (you do have a README file, don’t you?) what license the code is released under.
    • State in the documentation (preferably somewhere fairly obvious, such as above the Table of Contents, or on the Introduction page) what license the code is released under.
    • Add headers to the top of your source code files stating the license that the code is released under, and the author(s). Many licenses provide a suggested short version of the license to put here, but if not, just write a simple comment stating the license and telling readers to look in the LICENSE or COPYING file for more details.
    • If your code has been uploaded to some sort of package repository, such as CRAN, CPAN, the Python Package Index or Rubygems then see if there is some metadata you can fill in to say what license you are using – as this will mean people can search for your software by the type of license it uses (for example, if they’re looking to use it commercially). For example, when writing the file ready to upload a Python module to the Python Package Index you can fill in various categories – known as troves – and a number of these are for licenses, such as
      License :: OSI Approved :: GNU General Public License (GPL)

That’s it – it really is as simple as that. All you need to do now is to commit your changes to you version control system (you are using version control, aren’t you?) and anyone who finds your code will know what license it is under, and therefore what they are allowed to do with it.

However, there is one more thing to consider: what about example code that you post online (on your blog, on StackOverflow etc) – what should you do about that? I’m not a lawyer, but the consensus seems to be that if there isn’t an explicit license attached to the code then it is very difficult to decide what people are allowed to do with it – they’re almost certainly not allowed to use it commercially, and there may – or may not – be uses that count as ‘fair use’. To save everyone a lot of bother I’ve decided to release all of my small code samples into the public domain as CC0. The way I’ll do this is generally by a link in the text surrounding the sample (“As seen in the following example (released under CC0)”) or as a comment at the top of the code. I’ve also taken a number of old Github repositories that I wrote as part of various courses at university (for example, my parallel programming exercises) and released those as CC0, by simply specifying in the README files.

Hopefully this post will have been useful to you – let me know what you think in the comments below. Is the idea to use CC0 for small code samples a good idea? I think so personally, but I’d be interested to hear alternative points of view.

A simple way to improve sustainability, reproducibility and releasability of your code and data

This is a first of a number of posts based upon discussions I had while at the Collaborations Workshop 2013 (#CollabW13 on twitter) in Oxford, UK. During one of the sessions I described a simple technique that I try and use to increase the sustainability, reproducibility and releasability of code that I write, data I collect and the results of my work – and people thought this idea was great, and that I should blog about it…

So, what is this wonderful technique:

On Friday afternoon (when you’re bored and can’t be bothered to do any hard work…) spend an hour or two cleaning up and documenting your work from that week

It’s a very simple idea, but it really does work – and it also gives you something to do during the last afternoon of the week when you’re feeling tired and can’t really be bothered. If you can’t commit to doing it every week, you can try every fortnight or month – and I even go as far as adding it as an event to my calendar, to try and stop people arranging meetings with me then!

So, what sort of things can you do during this time?

  • Document your code: Depending on the project and the intended use of the documentation, this can be anything from adding some better comments to your code, to documenting individual functions/methods (for example, using docstrings in Python) or writing high-level documentation of the whole system.
  • Refactor your code: Refactoring is a “disciplined technique for restructuring an existing body of code, altering its internal structure without changing its external behaviour” – that is, basically tidying up, cleaning up, and possibly redesigning you’re code. If you’re anything like me, the code you write when actually doing science isn’t very neat, nice or well-designed – because you’re focused on the science at that point. These few hours on a Friday are your time to focus on the code for a bit…
  • Generalise your code to create a library: There are probably a number of things in your code that could be used in many other programs – things like reading certain file formats, performing statistical operations or applying algorithms to array data. This is a perfect time to take these algorithms and ‘decouple’ them from their immediate application in this code so that they can be easily used in other programs you may write in the future. Ideally, this generalised functionality can be packaged into a library and distributed for others to use. My Py6S library was created in this way: I took code that I was using to run the 6S model, generalised it into a library, documented it well, released it – and it has taken on a life of its own!
  • Add README files to your folders: If you’re anything like me, you’ll have loads of folders containing data and code – some of which isn’t particularly well named and may not have metadata. One of the easiest (and most effective) ways to deal with this is to create simple README files in each folder explaining what is in the folder, where it came from, what format it’s in – basically anything that you think you’ll want to know about it in a year’s time if you come back to it. I can say from experience just how useful having these README files is!

The key benefit of all of these is that it makes it so much easier to come back to your research later on, and it also makes it so much easier for you to share your research, make it reproducible and allow others to build upon it – and the great thing is that it doesn’t even take that much work. Thirty seconds writing a few notes in a README file could easily save you a week of work in a year’s time, and extending your code into a library would allow you to re-use it in other projects without much extra work.

Another similar idea that was mentioned by someone else at the Collaborations Workshop was for Research Councils to force people to add extra time to the end of their grants to do this sort of thing – although personally, I think it is a far better idea to do this as you go along. Trying to add documentation to some code that you wrote two years ago is often quite challenging…

So, there it is – a simple way to use up some time at the end of the week (when you can’t really be bothered to do anything ‘new’) which will significantly improve the sustainability, reproducibility and releasability of your code and data. Try it out, and let us know how you do in the comments below!

Review: Programming ArcGIS 10.1 with Python Cookbook

Summary: A useful guide to automating ArcGIS using Python, which is fully up-to-date with the latest version of ArcGIS. Definitely provides “quick answers to common problems”, but it may take more effort to get a deep understanding of the methods used. Good breadth of coverage – but notably lacks raster examples – and well explained throughout. I have recommended to colleagues using ArcGIS who want to start off with Python. 

Reference: Pimpler, E., 2013, Programming ArcGIS 10.1 with Python Cookbook, Packt Publishing, 508pp Amazon LinkPublishers Link (with sample chapter and Table of Contents)

I wrote this review in consultation with my wife, Olivia, who has been an ArcGIS user for a few years, but is fairly new to Python programming (she’s done a bit in the last few months, but nothing within ArcGIS). I am more experienced with Python – having released a number of packages to the Python Package Index, and having used Python with ArcGIS in various contexts with both ArcGIS 9.3 and 10.1, but I thought it would be useful to get an opinion from both a more- and less-experienced user.

We both found the book useful, and learnt a lot from it, but weren’t entirely convinced by the cookbook format. In general, if you like cookbook-style books then this is likely to appeal to you, but if they annoy you then you won’t like it. The benefits of the cookbook format are that it is very easy to pick up the book, find the relevant recipe for what you’re trying to achieve, turn to it and do it immediately, as each recipe is almost entirely self-contained. Of course, the disadvantage is learning in this way lends to lead to superficial knowledge rather than deep understanding: you may know what to do to achieve a certain result, but not why you’re doing it, and thus cannot adapt the recipes easily to other situations. Another disadvantage is that the recipes are often very repetitive – for example, each recipe in this book starts with the saying “Import the ArcPy module using import arcpy”, and often “import arcpy.da” – something which could be mentioned once and then referred to in future recipes.

Moving away from the cookbook format, which is very much a matter of personal taste, the book covers a wide range of tasks that you can scripts using Python in ArcGIS including interacting with the map window, laying out and printing maps, running geoprocessing operations and accessing the raw data in each layer. The ordering of the book may seem strange to you if you are interested in automating processing operations, rather than automating layout and map style operations, as the geoprocessing part of the book doesn’t start until chapter 6. However, when you get to this section it is very good, and chapters 6 to 9 provide almost all of the information needed to automate geoprocessing tools either using the built-in ArcGIS geoprocessing tools, or by writing your own geoprocessing algorithms using the raw spatial and attribute data.

I was impressed to see coverage of creating ArcGIS Addins using Python in chapter 11, as many ArcGIS experts are not aware of the possibility of creating addins using Python. These addins can provide very user-friendly ways to interact with processes which have been automated in Python, so it is good to see they were covered in this book. I was also pleased to see a chapter on error handling which covers both generic Python error-handling techniques (such as the try-except block) and ArcPy-specific methods, including how to get error messages from ArcGIS into Python. However, there is a noteable absence of any recipes involving raster data – the only exception is in chapter 10 when the use of the Describe() function to find out information about images is explained. However, ArcGIS 10.0 introduced various functions to operate on rasters – including returning the raster data as a numpy array, and allowing updates of raster data from numpy arrays – the coverage of which would really have made the book complete.

In terms of ancilliary information in the book: there is a useful twenty page introduction to the Python language, as well as useful appendices covering automation of Python scripts (using the Windows command-line, and Task Scheduler) and “Five Things Every GIS Programmer Should Know How to Do with Python” (including downloading via FTP, sending emails, and reading various types of files).

The book is generally well-written and formated, and things are explained well – although the quality of the images used is lacking sometimes (with JPEG compression artefacts clearly visible). Overall, the book is a very useful addition to a GIS library for people who are new to automating ArcGIS using Python, and particularly those who want to find out quickly how to automate a particular operation.

John Snow’s Cholera data in more formats

In honour of the bicentenary of John Snow’s birth – and because I was asked to by someone via email – I have now released my digitisation of John Snow’s Cholera data in a few other formats: KML and as Google Fusion Tables.

To save you reading my previous blog posts on the subject, I’ll give a brief overview of my data. John Snow produced a famous map in 1854 showing the deaths caused by a cholera outbreak in Soho, London, and the locations of water pumps in the area. By doing this he found there was a significant clustering of the deaths around a certain pump – and removing the handle of the pump stopped the outbreak. This is a bit of a simplification (see Wikipedia or the John Snow Society for more details), but generally covers what happened.

Anyway, I digitised John Snow’s original data and georeferenced it to the Ordnance Survey co-ordinate system, so that I could overlay it on modern maps of that area, as below (using the OS OpenData StreetView data, containing Ordnance Survey data © Crown copyright and database right 2013):

Locations of deaths from Snow's analysis shown on a modern OS Map

while still being able to overlay it on John Snow’s original map:

Snow's original map with the vector data for pumps and deaths overlainAnyway, the data that is available is:

  • Cholera Death locations (Vector) with attribute data giving the number of deaths at each point
  • Pump locations (Vector)
  • John Snow’s original map georeferenced to the Ordnance Survey National Grid (Raster)
  • Current Ordnance Survey maps of the area (from those released under OS OpenData; Contains Ordnance Survey data © Crown copyright and database right 2013Raster)

These are available for download/use in a number of formats:

  • A zip file with the Vector data as Shapefiles and the Raster data as TIFF images
    (this is the original data provided for download by me – and is probably what you want for importing into a GIS system)
  • A zip file with the Vector data as KML files and the Raster data as TIFF images
    (suitable for importing into Google Earth and other products that use KML files)
  • Links to Google Fusion Tables with the vector data already imported
    Cholera Deaths
    Deaths and Pumps together (this dataset has both pump and death points in the same table: pump points have Count values of -999, death points have Count values > 0 which give the number of Cholera deaths at that location)

The latter is particularly cool, I think as it allows you to very easily overlay the data on modern Google Maps data, and should allow some interesting ‘mashups’ to be created. All of the tables are set to be shared publically, so you could be able to copy them (using the Copy Table command in the File menu) and play around with them as much as you want! If you click the Example Map tab then you’ll see a very rudimentary map I’ve created using the data on top of Google Maps (see below) – I’m sure you’ll be able to do far better visualisations than that.

Example map from Google Fusion Tables

The folks at CartoDB have also used this data in one of their tutorials which shows you how to import the data to CartoDB and create a styled map to show the deaths with different sized markers – yet another way you can use the ‘first real GIS data’ in today’s modern web-based GIS tools.

So, enjoy – and please let me know (via the comments below) what you create!

Update: There was a problem with the KML files and Google Fusion Tables that I uploaded yesterday, caused by an incorrect co-ordinate transformation between the Ordnance Survey grid references and latitude/longitude. This has now been fixed and the downloads and tables have been updated. Sorry about this.

Resources for Sustainable Software and Reproducible Research in Remote Sensing presentation

I recently did a presentation entitled Sustainable Software and Reproducible Research in Remote Sensing at the Wavelength 2013 conference. The slides are available at SpeakerDeck, and shown below.

The rest of this page is a rather rough categorised list of various links and resources that may be useful to you after seeing the presentation (either in person at the conference, or online).

My blog posts

I’ve looked at some of these issues before on this blog. The most relevant posts are:

Papers referenced in the presentation

  • Saleska, Scott R., et al. “Amazon forests green-up during 2005 drought.”Science 318.5850 (2007): 612-612. PDF
  • Samanta, Arindam, et al. “Amazon forests did not green-up during the 2005 drought.” Geophysical Research Letters 37.5 (2010): L05401. PDF
  • Irish, Richard R., et al. “Characterization of the Landsat-7 ETM+ automated cloud-cover assessment (ACCA) algorithm.” Photogrammetric engineering and remote sensing 72.10 (2006): 1179. PDF
  • Tang, Jiakui, et al. “Aerosol optical thickness determination by exploiting the synergy of TERRA and AQUA MODIS.” Remote Sensing of Environment 94.3 (2005): 327-334. PDF
  • Aruliah, D. A., et al. “Best Practices for Scientific Computing.” arXiv preprint arXiv:1210.0530 (2012). PDF

Sustainable Coding

This is basically just good coding practices. Various resources are relevant:

  • Software Carpentry – look under Lessons for online lessons on various things (automated testing, version control, documentation, debugging, code program structure etc) and Boot Camps to see if there is a training day coming up near you.
  • Code Complete (2nd Edition) – a big book with loads of information on good coding styles etc. Your library is likely to have a copy
  • Version Control by Example – a great book by Eric Sink (disclosure: I proof-read it for him and gave various bits of feedback, and therefore got a free copy) which covers the concepts of version control in general, and then shows how they apply to various real-world systems such as Git, Mercurial and Subversion.
  • Science Code Manifesto – a manifesto for how software and code should be treated in science.


  • ArcGIS has a number of tools and functions that make it easier to do reproducible research.

    • Model Builder is a great way to get started with automating your ArcGIS analyses without having to learn to program. It allows you to build an automatic workflow through drag-and-drop, and is really handy for  automating things you do frequently, for documenting your processes and more. It can even export your models to Python code, so you can start bridging the gap between Model Builder and full Python programming. For more information look at the documentation, or Google for tutorials. (Similar tools exist in a number of other pieces of software, including Idrisi – but unfortunately not ENVI – although if you have the ArcGIS ENVI toolbox installed you can use many ENVI tools from ArcGIS).

      Example of ArcGIS Model Builder

      Example of ArcGIS Model Builder

    • My ArcGIS Provenance tool is not yet working well enough to be released, but until it is, you can look at your history manually (its hard to understand that way, but better than nothing). See my blog post about ArcGIS logs here XXX
  • ENVI
    • ENVI can be scripted in IDL, but it is generally harder than ArcGIS scripting in Python – details are on the ENVI website in the ENVI Programmers Guide. At the time of writing (March 2013), I would strongly suggest using the ENVI Classic IDL API, rather than the modern API for ENVI 5.0 onwards, as the latter is not really finished yet.
    • ENVI also had a log system (turn it on under File->XXX). The logs it produces are easier to read by hand than those created by ArcGIS, but harder for a computer to read…
  • If you are developing some sort of method/algorithm which works on satellite data to produce an output, publishing the results generated from a freely-available input image (eg. MODIS, Landsat etc) allows others to verify their work when they try and reimplement your method.
  • If you are collecting field spectra, please try and collect metadata along with it – fulfilling the requirements of: Hüni, Andreas, et al. “Metadata of spectral data collections.” 5th EARSeL Workshop on Imaging Spectroscopy, Bruges, Belgium. 2007. PDF

General Reproducible Research


All of the tools below are available to use today – so have a go, try them out, and see what happens. The only tool which isn’t available is something called Burrito. It is described in an article by its author who wrote it as part of his PhD. It was only a prototype and requires some rather unusual system settings (for example, Linux with an entirely different filesystem) – but it really shows the sort of things we should aim for in terms of reproducible research tools.

  • ProjectTemplate – the R package used for my example in the presentation. Also see my blog post.
  • Gloo – same idea as ProjectTemplate, but for Python
  • Repro – a build-tool-based approach for automating scientific computation
  • Sumatra – tracking and documentation tool for scientific computation
  • VCR – Verifiable Computational Research, a tool to allow you to store and verify computer-based research.

Useful sites

Lots of people online have come up with useful tutorials, guides and discussions about reproducible research – some of them say that they are about other fields, but lots of ideas are easily transferrable between fields.

General Sustainability

  • The Software Sustainability Institute do lots of work on software sustainability, but also cover data sustainability, and work with Software Carpentry
  • Software Carpentry have various online lessons, and also run boot camps, all of which teach good scientific software development practices.
  • The Open Data Handbook has a good section on open file formats.

Converting latitude/longitude co-ordinates to Landsat WRS-2 paths/rows

As part of some work I was doing for my PhD, I needed to automatically find what Landsat scene path and row would contain a pixel with a certain latitude/longitude co-ordinate (this was basically so I could automatically download the relevant Landsat image and do some processing on it).

There is an online converter (provided by the USGS) which allows you to input latitude/longitude and get back path/row, but it doesn’t have an API of any kind, and I couldn’t see an easy to way to automatically submit requests to the form, as they’ve (very sensibly really) included methods to stop people doing that.

Summary: So, I wrote my own code to do the conversion. It’s available on Github and depends on GDAL/OGR and Shapely. See the README file on Github for instructions.

Very helpfully, the USGS have made shapefiles of the WRS-2 scene boundaries available to download, so all my code needs to do is load the shapefile, do some ‘Point in Polygon’ operations to find out which polygon(s) the point is in, and extract the ‘path’ and ‘row’ attributes from those polygons. This is fairly easy to do in Python using the wonderful OGR library to access shapefiles, and Shapely to do the ‘Point in Polygon’ operations. When writing this I used the Python Geospatial Development book as a good reference for how to combine GDAL and Shapely in this way.

The code is available at GitHub, with instructions on how to install it (basically, download the shapefiles from the USGS – I’m deliberately not re-distributing them here – and download the Python script), and it is very easy to use:

>>> import get_wrs
>>> conv = ConvertToWRS()
>>> conv.get_wrs(50.14, -1.7)
[{'path': 202, 'row': 25}]
>>> conv.get_wrs(50.14, -1.43)
[{'path': 201, 'row': 25}, {'path': 202, 'row': 25}]


  • Instantiating the class may take a while, as it has to load in the shapefile
  • The get_wrs method may return multiple results, as Landsat scenes overlap at all four edges.

Please let me know (via a comment on this post) if you find this code useful.

For reference, the code is shown below, and is well-commented to show how it works:

Can I atmospherically-correct my images with Py6S?

I’m a big fan of Matt Might’s blog, and thought I’d implement one of his tips for blogging as an academic – namely Reply to Public. I’ve had a number of emails from Py6S users asking me questions about how to atmospherically-correct entire satellite images with Py6S – so I thought ‘d respond online, so that in the future people either find this response without needing to email me (ideally), or at the very least, I can point them to this blog post when they do email me.

Unfortunately, the simple answer to the question is: Py6S cannot (yet!) atmospherically correct satellite images – at least, not in a sensible manner – so I’d suggest using other atmospheric correction software such as ATCOR or FLAASH. The longer answer is below…

If you read the Py6S documentation you’ll find that there is an atmospheric correction option which you can use to take an at-sensor radiance (often called a Top of Atmosphere radiance when you’re dealing with satellite sensors) and atmospherically-correct it to get a corrected surface reflectance (or radiance) value. You could write some code to do that – it might look a bit like this:

from Py6S import *
# Create a SixS object called s
# (used as the standard name by convention)
s = SixS()

# Set the atmospheric conditions as usual
s.aero_profile = AeroProfile.PredefinedType(AeroProfile.Maritime)
s.aot550 = 0.05
s.atmos_profile = AtmosProfile.UserWaterAndOzone(2, 0.318)

# Set the wavelength

# Set the altitudes

# Set the geometry
s.geometry = Geometry.User()
s.geometry.solar_z= 35.8
s.geometry.solar_a= 149.4
s.geometry.view_z= 5.1
s.geometry.view_a= 106.8

# Turn on atmospheric correction mode and set it to do the
# correction assuming a Lambertian surface with a TOA
# radiance of 137.5 W/m^2
s.atmos_corr = AtmosCorr.AtmosCorrLambertianFromRadiance(137.5)   

# Run the model

# Print the result of the atmospheric correction
# (assuming Lambertian reflectance)
# This is the ground-reflectance for the given radiance,
# under the given atmospheric conditions
print s.outputs.atmos_corrected_reflectance_lambertian

This works fine, and so you might think that all you need to do to correct a satellite image is to take the code above, and run it for each pixel of the image using a for loop, as in the pseudo-code below:

# Load in the image (eg. using GDAL)

for pixel in image:
	reflectance = run_py6s(various, parameters, here)

	# Store reflectance in the output image

# Save output image

Unfortunately, that won’t work very well. Or rather, it will work – it’ll just take a long time! Let’s do some maths to work out roughly how long:

  • As a rough approximation, it takes Py6S around two seconds to atmospherically correct a single pixel in a single band (the exact time depends on how complicated the parameterisation is – for example, a full manual AERONET-based aerosol parameterisation will take longer than a standard aerosol model)
  • A full Landsat scene is roughly 5500 x 6000 pixels (170km x 185km), which is around 33 million pixels
  • 33 million pixels x 2 seconds per pixel = 66 million seconds, or 763 days!
  • If I started running this now – on one band of a Landsat scene – it’d be finished around April 2015. Even if I used a simple parameterisation that only took one second per pixel, it’d still take over a year!

Now, I suspect you’re now thinking the Py6S is really awful because it takes such a long time. Well, unfortunately it’s not down to Py6S (if it was, I could try and improve it!). In fact, the Python bit of Py6S adds very little time  – the vast majority of the time is taken by 6S itself.

So, maybe it’s 6S that is rubbish. Well, again, unfortunately not, or we could switch to something that is better. All other Radiative Transfer Models, such as MODTRAN, SCIATRAN etc, suffer from the same problem.

So, how on earth do people manage to correct satellite images? Obviously people have managed to do it – and there is commercial software available to do it. Well, they use something called a Lookup Table (LUT). This involves running the Radiative Transfer Model many times to produce a table of results, and then looking up the result you want in this table. Creating the table takes a long time (but nowhere near as long as 763 days!) and it often takes a lot of space to store the resulting table (for example, the Lookup Table used in ATCOR is a multi-Gigabyte file), but once you’ve got the table you can correct a pixel in a tiny fraction of a second – rather than one or two seconds – which means the correction of an image is a lot quicker.

A lookup table is a good example of two standard trade-offs in computing:

  • Setup time vs Run time: Often algorithms can either be designed to have a very short (or no) setup time, but then take a long time to run; or they can be designed to have a long setup phase, and then run very quickly from then onwards. Often the decision of which one to focus on depends on how often you’re going to run your process – if you want to run it multiple times (like an atmospheric correction algorithm) then it is normally better to have a long setup time (which you only have to do once!) and then a short run time.
  • Memory vs Time: Again, algorithms often trade speed against memory – a fast algorithm normally requires more memory, and vice-versa. In this case, a Lookup Table trades speed for memory – we have to store the Lookup Table in memory, but once we have access to it, we can do corrections very fast.

I haven’t written any code to use Py6S to create a lookup table and then use that to correct an image (although I will almost certainly write some code to do this sometime during my PhD) – but it wouldn’t be too difficult to do. The procedure would be something like the following:

Setup Phase:

  1. Come up with a list of parameter values for all of the important parameters – choosing a number of parameters across their realistic range (there is a trade-off between accuracy and speed here, as always). For example maybe: AOT = [0.1, 0.2, 0.3, 0.4…]; solar_z = [10, 20, 30, 40, 50…]; radiance = [10, 20, 30, 40, 50, 60, 70] etc.
  2. Write some code to loop over all of these lists and run every possible combination of them
  3. Store the result of all of these runs of Py6S (ie. the atmospherically-corrected reflectance) and the parameters used to get the result in a big table of some sort.

Correction Phase:

Loop through every pixel in the image, and for each pixel:

  1. Get the parameters (AOT, angles, radiance etc) for that pixel
  2. Choose the nearest parameters that were used to create results in the table (for ALL of the parameters you used)
  3. Interpolate between these parameters as needed (eg. for an AOT of 0.15, get the results for an AOT of 0.1 and and AOT of 0.2 and interpolate) to get an interpolated ground reflectance.
  4. Store this reflectance in another image.

This basic procedure has been used in almost all atmospheric correction tools for the last thirty years, and is described in various papers including Fraser et al. (1992) and Liang et al. (1997). Another way to speed-up the computation even more is by introducing parallel computation. If we can assume that each pixel’s atmospheric correction is entirely independent from every other pixel’s (which may not be the case in the real-world, but is generally assumed by most atmospheric correction algorithms) then we can split the pixels between a number of processors and thus correct many pixels in parallel – which gets an even greater speedup as we can do this both when generating the LUT, and when correcting the image.

Unfortunately, actually implementing this is a bit more complicated than I’ve explained here and various decisions have to be taken (including what ranges of parameters to use, how to interpolate, how best to store the multi-dimensional lookup table and more…) – but it is definitely on my ‘to do’ list as part of my PhD, and as part of Py6S.

So, in conclusion: Py6S can’t do atmospheric correction of satellite imagery in a sensible time at the moment, but should be able to within the next few years. In the meantime, I suggest using other atmospheric correction software (most of which, unfortunately, is commercial software). You may think that not being able to do this makes Py6S useless…but, as my next post will show, Py6S is still very useful for remote-sensing research.


Fallah-Adl, Hassan, et al. “Fast algorithms for removing atmospheric effects from satellite images.” Computational Science & Engineering, IEEE 3.2 (1996): 66-77.

Fraser, R. S., Ferrare, R. A., Kaufman, Y. J., Markham, B. L., & Mattoo, S. (1992). Algorithm for atmospheric corrections of aircraft and satellite imagery.International Journal of Remote Sensing13(3), 541-557.

Liang, Shunlin, et al. “An operational atmospheric correction algorithm for Landsat Thematic Mapper imagery over the land.” Journal of Geophysical Research 102.D14 (1997): 17173-17.