Category Archives: GIS
Beware: Latitude/Longitude co-ordinates in ENVI may not be in WGS-84
Summary: When you use the Pixel Locator or Cursor Location/Value tool in ENVI, the latitude and longitude co-ordinates given are based on the datum that the image is in, not necessarily WGS-84.
This may be obvious to some people, but it wasn’t to me – and I thought that if I got confused then some other people probably would too – hence this article. I often use the Pixel Locator dialog box in ENVI to find a specific location in the image by entering the X and Y pixel location (referred to in ENVI as Samples and Lines) or the map co-ordinates of whatever co-ordinate system the image is projected into (for example, the Ordnance Survey National Grid):
Alternatively, you can click the button with the arrows on it, and enter a location in latitude and longitude:
Very handily, all of these values are updated as you move around the image manually – so this dialog can also be used as an equivalent of the Pixel Location/Value window – as shown below.
This all sounds fine – but the problem is that the latitude/longitude values that are shown are calculated using the datum that is defined for the image. You can see this in the image below, where the latitude and longitude value is displayed, and the datum is listed above it:
This seems like a sensible thing to do, but it makes comparison with latitude and longitude co-ordinates from other sources very difficult – as nearly all other latitude and longitude co-ordinates are provided in the WGS-84 datum. Locations from GPS systems are always in WGS-84, and locations on web maps, in GIS systems and most other sources of latitude and longitude co-ordinate systems are very frequently in WGS-84.
So, this raises two questions:
What effect does this have?
Well, I haven’t yet done a full investigation of this (sometime I will sit down and write some code to do some proper testing), but when I found the problem it was causing offsets of around 100-200m – which can be quite significant in many applications.
What can we do about this?
There is a way to tell ENVI to use WGS-84 as the datum for calculating its latitude and longitude values, but it is a bit fiddly. Basically, when you’re in the Pixel Locator dialog before you switch to the latitude/longitude display, click the Change Proj… button and then select click the Datum button to select a new datum. Then switch to the latitude/longitude display, and you’ll find that the datum is listed as WGS-84, and the values will be correct in the WGS-84 datum:
Unfortunately, this means that the map co-ordinate values (for example, the Ordnance Survey Grid References) will now be wrong, and you’ll need to switch back to the original datum to get those to be correct. Also, I can’t find a way to get the Cursor Location/Value dialog to display WGS-84 latitudes and longitudes.
I actually found this while designing a lecture and practical session for MSc students at the University of Southampton on using the ENVI API in IDL. While writing a code example for the students on finding the pixel value at a specific lat/lon co-ordinate, I found that my results came up between ten and twenty pixels away from the place I thought they would (I was using 10m SPOT imagery), and I got different results from my IDL code and ENVI – even though my IDL code was using the ENVI API! Luckily I managed to find out what the problem was – and could therefore explain it to my students, and hopefully stop them running in to the same problem as me. It may be that I’m being silly here, and everyone naturally realises what ENVI does, and thinks it is the right thing to do – but it definitely confused me, so maybe it will helped you.
van Heuklon Ozone model implementation in Python
As part of my PhD I wanted to use a simple model which would give me an estimation of the atmospheric ozone amount given a location and time of year. A simple model to do this was created by van Heuklon in 1979, and was described in a delightfully simple paper (unfortunately not freely available online, but see the reference below) using a simple formula. The map below shows the modelled ozone amounts across the world for 15th April (click to enlarge):
The formula was built up from a baseline ozone amount (235 matm-cm) which then has various components added to it, some of which varied based upon latitude, longitude or time of year. The model isn’t intended to be accurate or high-resolution, but provides me with a rough estimate of the ozone amount at a given location at a given time. Accurate measurements of ozone amounts are available from satellites, but I would need to get data from many different times of year to be able to take into account the seasonal variation in ozone amounts, and the satellite data would take up a lot of space – hence using this simple model.
The code for this model is available on Github and has two implementations: get_ozone_conc and old_get_ozone_conc. The latter uses standard Python code with if statements to set the different values in the formula depending on the input latitudes and longitudes, which means it can only process one latitude/longitude combination at once. get_ozone_conc, on the other hand, uses somewhat more complicated numpy code which allows you to pass it arrays of latitudes and longitudes and it will return an array of results. This is what was used to produce the map above – the code for which is also in the repository.
This code was created for use as part of my PhD, but has been released into the public domain (as CC0) in case anyone else finds it useful.
Also, for those of you using Py6S (my Python interface to the 6S atmospheric Radiative Transfer Model), I’m sure you will be pleased to know that this model will shortly be integrated into Py6S to allow you to easily specify ozone amounts using it.
References:
van Heuklon, T. K. (1979). Estimating atmospheric ozone for solar radiation models. Solar Energy, 22(1), 63-68
A simple way to improve sustainability, reproducibility and releasability of your code and data
This is a first of a number of posts based upon discussions I had while at the Collaborations Workshop 2013 (#CollabW13 on twitter) in Oxford, UK. During one of the sessions I described a simple technique that I try and use to increase the sustainability, reproducibility and releasability of code that I write, data I collect and the results of my work – and people thought this idea was great, and that I should blog about it…
So, what is this wonderful technique:
On Friday afternoon (when you’re bored and can’t be bothered to do any hard work…) spend an hour or two cleaning up and documenting your work from that week
It’s a very simple idea, but it really does work – and it also gives you something to do during the last afternoon of the week when you’re feeling tired and can’t really be bothered. If you can’t commit to doing it every week, you can try every fortnight or month – and I even go as far as adding it as an event to my calendar, to try and stop people arranging meetings with me then!
So, what sort of things can you do during this time?
- Document your code: Depending on the project and the intended use of the documentation, this can be anything from adding some better comments to your code, to documenting individual functions/methods (for example, using docstrings in Python) or writing high-level documentation of the whole system.
- Refactor your code: Refactoring is a “disciplined technique for restructuring an existing body of code, altering its internal structure without changing its external behaviour” – that is, basically tidying up, cleaning up, and possibly redesigning you’re code. If you’re anything like me, the code you write when actually doing science isn’t very neat, nice or well-designed – because you’re focused on the science at that point. These few hours on a Friday are your time to focus on the code for a bit…
- Generalise your code to create a library: There are probably a number of things in your code that could be used in many other programs – things like reading certain file formats, performing statistical operations or applying algorithms to array data. This is a perfect time to take these algorithms and ‘decouple’ them from their immediate application in this code so that they can be easily used in other programs you may write in the future. Ideally, this generalised functionality can be packaged into a library and distributed for others to use. My Py6S library was created in this way: I took code that I was using to run the 6S model, generalised it into a library, documented it well, released it – and it has taken on a life of its own!
- Add README files to your folders: If you’re anything like me, you’ll have loads of folders containing data and code – some of which isn’t particularly well named and may not have metadata. One of the easiest (and most effective) ways to deal with this is to create simple README files in each folder explaining what is in the folder, where it came from, what format it’s in – basically anything that you think you’ll want to know about it in a year’s time if you come back to it. I can say from experience just how useful having these README files is!
The key benefit of all of these is that it makes it so much easier to come back to your research later on, and it also makes it so much easier for you to share your research, make it reproducible and allow others to build upon it – and the great thing is that it doesn’t even take that much work. Thirty seconds writing a few notes in a README file could easily save you a week of work in a year’s time, and extending your code into a library would allow you to re-use it in other projects without much extra work.
Another similar idea that was mentioned by someone else at the Collaborations Workshop was for Research Councils to force people to add extra time to the end of their grants to do this sort of thing – although personally, I think it is a far better idea to do this as you go along. Trying to add documentation to some code that you wrote two years ago is often quite challenging…
So, there it is – a simple way to use up some time at the end of the week (when you can’t really be bothered to do anything ‘new’) which will significantly improve the sustainability, reproducibility and releasability of your code and data. Try it out, and let us know how you do in the comments below!
Review: Programming ArcGIS 10.1 with Python Cookbook
Summary: A useful guide to automating ArcGIS using Python, which is fully up-to-date with the latest version of ArcGIS. Definitely provides “quick answers to common problems”, but it may take more effort to get a deep understanding of the methods used. Good breadth of coverage – but notably lacks raster examples – and well explained throughout. I have recommended to colleagues using ArcGIS who want to start off with Python.

Reference: Pimpler, E., 2013, Programming ArcGIS 10.1 with Python Cookbook, Packt Publishing, 508pp Amazon Link, Publishers Link (with sample chapter and Table of Contents)
I wrote this review in consultation with my wife, Olivia, who has been an ArcGIS user for a few years, but is fairly new to Python programming (she’s done a bit in the last few months, but nothing within ArcGIS). I am more experienced with Python – having released a number of packages to the Python Package Index, and having used Python with ArcGIS in various contexts with both ArcGIS 9.3 and 10.1, but I thought it would be useful to get an opinion from both a more- and less-experienced user.
We both found the book useful, and learnt a lot from it, but weren’t entirely convinced by the cookbook format. In general, if you like cookbook-style books then this is likely to appeal to you, but if they annoy you then you won’t like it. The benefits of the cookbook format are that it is very easy to pick up the book, find the relevant recipe for what you’re trying to achieve, turn to it and do it immediately, as each recipe is almost entirely self-contained. Of course, the disadvantage is learning in this way lends to lead to superficial knowledge rather than deep understanding: you may know what to do to achieve a certain result, but not why you’re doing it, and thus cannot adapt the recipes easily to other situations. Another disadvantage is that the recipes are often very repetitive – for example, each recipe in this book starts with the saying “Import the ArcPy module using import arcpy”, and often “import arcpy.da” – something which could be mentioned once and then referred to in future recipes.
Moving away from the cookbook format, which is very much a matter of personal taste, the book covers a wide range of tasks that you can scripts using Python in ArcGIS including interacting with the map window, laying out and printing maps, running geoprocessing operations and accessing the raw data in each layer. The ordering of the book may seem strange to you if you are interested in automating processing operations, rather than automating layout and map style operations, as the geoprocessing part of the book doesn’t start until chapter 6. However, when you get to this section it is very good, and chapters 6 to 9 provide almost all of the information needed to automate geoprocessing tools either using the built-in ArcGIS geoprocessing tools, or by writing your own geoprocessing algorithms using the raw spatial and attribute data.
I was impressed to see coverage of creating ArcGIS Addins using Python in chapter 11, as many ArcGIS experts are not aware of the possibility of creating addins using Python. These addins can provide very user-friendly ways to interact with processes which have been automated in Python, so it is good to see they were covered in this book. I was also pleased to see a chapter on error handling which covers both generic Python error-handling techniques (such as the try-except block) and ArcPy-specific methods, including how to get error messages from ArcGIS into Python. However, there is a noteable absence of any recipes involving raster data – the only exception is in chapter 10 when the use of the Describe() function to find out information about images is explained. However, ArcGIS 10.0 introduced various functions to operate on rasters – including returning the raster data as a numpy array, and allowing updates of raster data from numpy arrays – the coverage of which would really have made the book complete.
In terms of ancilliary information in the book: there is a useful twenty page introduction to the Python language, as well as useful appendices covering automation of Python scripts (using the Windows command-line, and Task Scheduler) and “Five Things Every GIS Programmer Should Know How to Do with Python” (including downloading via FTP, sending emails, and reading various types of files).
The book is generally well-written and formated, and things are explained well – although the quality of the images used is lacking sometimes (with JPEG compression artefacts clearly visible). Overall, the book is a very useful addition to a GIS library for people who are new to automating ArcGIS using Python, and particularly those who want to find out quickly how to automate a particular operation.
John Snow’s Cholera data in more formats
In honour of the bicentenary of John Snow’s birth – and because I was asked to by someone via email – I have now released my digitisation of John Snow’s Cholera data in a few other formats: KML and as Google Fusion Tables.
To save you reading my previous blog posts on the subject, I’ll give a brief overview of my data. John Snow produced a famous map in 1854 showing the deaths caused by a cholera outbreak in Soho, London, and the locations of water pumps in the area. By doing this he found there was a significant clustering of the deaths around a certain pump – and removing the handle of the pump stopped the outbreak. This is a bit of a simplification (see Wikipedia or the John Snow Society for more details), but generally covers what happened.
Anyway, I digitised John Snow’s original data and georeferenced it to the Ordnance Survey co-ordinate system, so that I could overlay it on modern maps of that area, as below (using the OS OpenData StreetView data, containing Ordnance Survey data © Crown copyright and database right 2013):
while still being able to overlay it on John Snow’s original map:
Anyway, the data that is available is:
- Cholera Death locations (Vector) with attribute data giving the number of deaths at each point
- Pump locations (Vector)
- John Snow’s original map georeferenced to the Ordnance Survey National Grid (Raster)
- Current Ordnance Survey maps of the area (from those released under OS OpenData; Contains Ordnance Survey data © Crown copyright and database right 2013; Raster)
These are available for download/use in a number of formats:
- A zip file with the Vector data as Shapefiles and the Raster data as TIFF images
(this is the original data provided for download by me – and is probably what you want for importing into a GIS system)
Download - A zip file with the Vector data as KML files and the Raster data as TIFF images
(suitable for importing into Google Earth and other products that use KML files)
Download - Links to Google Fusion Tables with the vector data already imported
Cholera Deaths
Pumps
Deaths and Pumps together (this dataset has both pump and death points in the same table: pump points have Count values of -999, death points have Count values > 0 which give the number of Cholera deaths at that location)
The latter is particularly cool, I think as it allows you to very easily overlay the data on modern Google Maps data, and should allow some interesting ‘mashups’ to be created. All of the tables are set to be shared publically, so you could be able to copy them (using the Copy Table command in the File menu) and play around with them as much as you want! If you click the Example Map tab then you’ll see a very rudimentary map I’ve created using the data on top of Google Maps (see below) – I’m sure you’ll be able to do far better visualisations than that.
The folks at CartoDB have also used this data in one of their tutorials which shows you how to import the data to CartoDB and create a styled map to show the deaths with different sized markers – yet another way you can use the ‘first real GIS data’ in today’s modern web-based GIS tools.
So, enjoy – and please let me know (via the comments below) what you create!
Update: There was a problem with the KML files and Google Fusion Tables that I uploaded yesterday, caused by an incorrect co-ordinate transformation between the Ordnance Survey grid references and latitude/longitude. This has now been fixed and the downloads and tables have been updated. Sorry about this.
Converting latitude/longitude co-ordinates to Landsat WRS-2 paths/rows
As part of some work I was doing for my PhD, I needed to automatically find what Landsat scene path and row would contain a pixel with a certain latitude/longitude co-ordinate (this was basically so I could automatically download the relevant Landsat image and do some processing on it).
There is an online converter (provided by the USGS) which allows you to input latitude/longitude and get back path/row, but it doesn’t have an API of any kind, and I couldn’t see an easy to way to automatically submit requests to the form, as they’ve (very sensibly really) included methods to stop people doing that.
Summary: So, I wrote my own code to do the conversion. It’s available on Github and depends on GDAL/OGR and Shapely. See the README file on Github for instructions.
Very helpfully, the USGS have made shapefiles of the WRS-2 scene boundaries available to download, so all my code needs to do is load the shapefile, do some ‘Point in Polygon’ operations to find out which polygon(s) the point is in, and extract the ‘path’ and ‘row’ attributes from those polygons. This is fairly easy to do in Python using the wonderful OGR library to access shapefiles, and Shapely to do the ‘Point in Polygon’ operations. When writing this I used the Python Geospatial Development book as a good reference for how to combine GDAL and Shapely in this way.
The code is available at GitHub, with instructions on how to install it (basically, download the shapefiles from the USGS – I’m deliberately not re-distributing them here – and download the Python script), and it is very easy to use:
>>> import get_wrs
>>> conv = ConvertToWRS()
>>> conv.get_wrs(50.14, -1.7)
[{'path': 202, 'row': 25}]
>>> conv.get_wrs(50.14, -1.43)
[{'path': 201, 'row': 25}, {'path': 202, 'row': 25}]
Beware:
- Instantiating the class may take a while, as it has to load in the shapefile
- The get_wrs method may return multiple results, as Landsat scenes overlap at all four edges.
Please let me know (via a comment on this post) if you find this code useful.
For reference, the code is shown below, and is well-commented to show how it works:
Major update to Free GIS Data site
The New Year is a time of new beginnings – and so it is rather appropriate to launch the complete redesign of my Free GIS Data list today.

As you can see from the screenshot above, it looks far nicer than before – but it is also far easier to navigate. The dropdown menus at the top allow easy access to all of the categories (and also, incidentally, make it easier for me to add new categories as needed). I’ve also added quite a few new datasets, and tidied up some of the descriptions and categorisations of links that were already there.
Doing the re-design was around an evening’s work – mainly because I used the wonderful Bootstrap framework for the page. There were a few niggles I had to sort out (particularly regarding the header bar sticking at the top when you scroll down, and what that does for within-page anchors) but generally it was a fairly painless experience.
So, I hope the list is useful to you during 2013 and beyond!
Validating the validation?
So, I’ve been pondering an interesting scientific dilemma recently: how do you validate a validation technique? That is, if you’re using a certain procedure to validate some data (that is, check how correct/accurate it is), how can you validate the validation procedure itself?
This has come up in my work recently in relation to validating Aerosol Optical Thickness (AOT) data from satellites. Validation is normally performed by comparing satellite-derived data to data from the Aerosol Robotic Network (AERONET). However, this is challenging for a number of reasons: the main one being that AERONET measurements are point measurements of the AOT above the measurement site, whereas the satellite will measure AOT averaged over a large pixel area (10 x 10km for MODIS). A number of validation methods have been proposed to deal with this, and they are all based on spatial and temporal averaging of the data, to bring it into a comparable state.

Example MODIS AOT data for Europe
The big question then becomes:
If there are a number of different methods for validating this data, then how do we choose which one is best?
This is a big problem because the results of the validation will be used to infer the accuracy and uncertainty of the dataset – and if we use a validation that gives misleading results then we will have a misleading opinion of the dataset. This gets even more difficult when you need to choose parameter values for use in these validation procedures. For example, if the validation involves temporal or spatial averaging then what period/distance should the averaging be done over? 30 minutes? 1 hour? 5 hours?
So, how should we deal with this?
Unfortunately, I don’t know. I know that I have a problem with some of the current methods (for example, trying a wide range of parameter values and choosing the ones that give you the best validation results – that is, the lowest error) as some of them really do seem to be ‘cheating the system’. This is something that I’m hoping to come back to over time – I already have some preliminary ideas on how to ‘fix’ the AOT validation system – but I think it is a problem which won’t go away.
Updated Snow GIS data
A while back I released a GIS dataset containing Snow’s Cholera analysis data in modern GIS formats, and georeferenced to the British National Grid (see my previous post). Unfortunately, there was an error in some of the attributes of the Cholera Deaths shapefile which caused issues when using the data.
This error has now been fixed, and the updated data are available to download here: SnowGIS_v2.zip. The link on the previous post has also been updated.
Please accept my apologies for this error.
I signed the Science Code Manifesto – and you should too!
I’ve just signed the Science Code Manifesto because I firmly believe in what it says. Ok well, that probably doesn’t tell you much – generally I tend to believe in things that I sign – but I’d like to tell you why I signed it, and why I think it’s really important.
A lot of my PhD life is spent writing code (a lot of my life outside of my PhD is also spent writing code, but that’s another story). When I tell people this quite a few of them are surprised that I’m not doing a computer science PhD – because surely they’re the only ones who spend their time writing code? Well…no! A lot of scientists spend a lot of time writing code for research in almost every subject.
Why do they do that? Well, nearly every research project involves at least one of the following activities:
- Data processing
- Plotting graphs
- Calculating statistics
- Running models
- Building new models, simulations and so on
All of these activities can easily be done through code, and in fact it’s often far more efficient to do them through code than by other methods. However, mistakes can be made in code, and people will often want to check the results of other people’s papers (that is, to ensure reproducibility – a key factor in science) – but to do that they need the code. That is what the first tenet of the Science Code Manifesto says: “All source code written specifically to process data for a published paper must be available to the reviewers and readers of the paper”. That means that as a reader (or reviewer) I can read the code (to check it looks like it does what it’s meant to do), and run the code (to check it actually does what its meant to do). It also means that if I have the code to do the processing, plus the input data, I can generate the output data that they got, and check it against the results in the paper. I was reading a paper today which examined aerosol optical depth variations across Europe. They had really high resolution data, and I’d love to have seen a map of the distribution across the UK in detail, but it wasn’t included in the paper (they had a lower-resolution map of the whole of Europe instead). If I’d had access to the code (and the data) then I could have generated the data myself, plotted a map over the whole of Europe (to check that it looked the same as their published map) and then zoomed in on the UK to examine it in more detail.
Scientific papers are designed to be built upon. As Newton said, “If I have seen further it is only by standing on the shoulders of giants” – as scientists we all need to stand on the shoulders of those who came before us (giants or not). If you have the code that other scientists have used to produce the results in their paper, it is likely that you might want to modify it (for example, to fix some errors you’ve found), extend it (to make it applicable to your particular study area), and share it or its modifications with your colleagues. You won’t be able to do this unless you know what license the code was originally released under – hence the second tenet of “The copyright ownership and license of any released source code must be clearly stated”.
The next two tenets are very important as they place scientific code at the same level as scientific papers, books and other more ‘traditional’ academic outputs. They state that Researchers who use or adapt science source code in their research must credit the code’s creators in resulting publications and Software contributions must be included in systems of scientific assessment, credit, and recognition. This is important because if we believe that scientific code is important (which I, and the 846 people who have signed the manifesto so far believe) then we need to recognise it. This means two things: firstly citing it, so that we give the proper attribution to the authors, and let people see how it is being used; and secondly giving credit for writing code when we assess how good researchers are. This is something that varies significantly by department and research area – but it is something which I think should be standard across all fields. If you write a good piece of scientific software (not a 10 line Python script in a random file somewhere, but something which is properly released, useful, documented and sustainable) then you should be given credit for it, just as if you had written a paper or a journal article! As a number of people have commented before: a scientific paper which describes a new algorithm is not the scientific work itself – it is just an advert for the work. The real scientific work, and scientific product, is the code that implements the algorithm.
Finally, the manifesto touches on the subject of software sustainability – something that I will (hopefully) be doing a lot more work on in the near future. This refers to the practice of sustaining software so that it can continue to be used (and, ideally, continue to be improved) in the future. Software is a funny thing – it is susceptible to rotting away, just like organic material. This is known as software decay and is primarily caused by the rapid progress made in technology: it may be that the ‘latest, greatest’ technology that you used to write your software in 2012 can’t be run in 2020, or 2025, but the job the software does may still be very important. I think (hope) that all of my code will be able to run for the foreseeable future as I’ve written it in fairly standard programming languages (such as Python and R), but this may not be the case – for example, libraries can easily break as standards evolve, and if the author is no longer maintaining their libraries then they may not get fixed. This can be a big issue, and leads on to the other part of sustaining software: that of generating a community around the software, which will help sustain it in the years to come. The manifesto is actually fairly limited in what it says: Source code must remain available, linked to related materials, for the useful lifetime of the publication, but I feel that it a lot of the other things I’ve raised in this paragraph are also relevant.
So, there we go. That’s why I signed the manifesto – now have a think about it, and if you agree go and sign it too!



