Category Archives: GIS
1. Reproducibility is key to science – if it can’t be reproduced then it can not be verified (that is, the experiment can’t be tried again to determine if the same result was produced then no-one can verify your work, and no-one can falsify it if it was incorrect), and therefore (according to various scientific philosophers such as Popper) it isn’t science.
2. It’s actually really useful for me as a researcher. Have you ever come back to a project six months after stopping work on it (possibly because you’d submitted a paper on it, and had to wait ages for the reviewers comments) and found it almost impossible to work out how to produce a certain graph or table, or which data was used to produce a certain result? Making your work reproducible by other scientists also means it will be reproducible by you when you’ve forgotten all about how it worked!
Basically reproducibility in scientific research these days means code. You could write a long Word document saying exactly how you processed all of your data (good luck keeping it up-to-date) and then run through all of that again, but in most of my work I use code in a variety of languages (Python, R and IDL mostly) to do the processing for me.
The beauty of this (aside from not spending ages clicking around in frustrating dialog boxes) is that doing your research through code gets it a long way to being reproducible without any other work on your part. You created your original outputs through code, so you can reproduce them just by running the code again! Simple isn’t it?
Well, unfortunately, it’s not quite that simple. Do you know which exact bit of data you used to create that code? Did you pre-process the data before using it in your code? Did you some processing on the data that the code produced before putting into a table/graph in your paper? Will you remember these things in six months/six years if you need to reproduce that bit of work yourself (or, more scarily, if someone emails you to tell you that they think your results were wrong…)? Unfortunately, I think that’s unlikely.
Anyway, to get to the point of this post: I have recently been using a R package called ProjectTemplate which has really helped me make my research properly reproducible. This package generates a standard folder structure for your R code, and provides a range of useful functionality for automatically loading data and caching the results of computations. I’ve been using this for a report that I wrote recently for my PhD supervisors (something which may turn into a paper sometime – hopefully), and it’s been great.
I’m not going to give a full overview of all of the functionality of ProjectTemplate here, but I’ll show you a bit about how I use it. Firstly, here is my folder structure for this project:
As you can see there are quite a few folders here with code in:
- data: scripts for loading in data (or data files themselves, which will be loaded automatically if they are of certain file types)
- lib: functions that will be used throughout the project
- munge: scripts to manipulate or ‘munge’ before use in the rest of the project (ie. pre-processing)
- src: scripts used to actually do the processing and analysis of the data
There are also quite a few folders that I use that I haven’t shown expanded above:
- cache: stores cached R objects (eg. results from time-consuming processing steps)
- graph: stores graphical outputs from the analysis
So, what we’ve got here is a place to put re-usable functions (basically functions that could eventually go into a separate R package – eg. for reading specific format data files etc), a place to put pre-processing code, and a place to put actually scientific analysis code. You’ll see there are loads of other folders that I haven’t mentioned that I don’t really use, but I suspect I will probably use them in new projects in the future.
The beauty of this folder structure is that the folder that contains the structure above can be simply zipped up, given to someone else and then they can run it themselves. How do they do that? Simple, change the working directory to the folder and run:
That will load all of the library needed, load the data (from files or cache), pre-process it (or load the results from the cache) and get it to the stage where you can run any of the files in src. Great!
The brilliant thing is that each of my scripts in the src folder will produce one or two outputs for my report. All of my graphs are saved as PDFs into the graphs folder, ready to be included directly into a LaTeX document, and my tables are produced as R data frames and then converted to LaTeX tables using the xtable package.
So, what’s the practical upshot of this? Well, if I come back to this in six months I can run any analysis that I use for my report by typing a couple of lines of code and then running the relevant file. It also meant that when I realised half-way through my writing up that I had accidentally done all of my results (about 5 graphs, and 4 big tables) based on some incorrect input data (basically I had data for half of Europe rather than just the UK, which makes a big difference to the ranges of atmospheric data!) it took me about 30 minutes to generate all of the new results by simply changing a line of code where the data was imported, running the pre-processing again (which took about 20 minutes of the 30 minutes time!) and then running each file to generate the required graph PDF files or xtable LaTeX code.
Hopefully this will have made sense to you all – stay tuned for some more reproducible research posts in the near future.
While processing some data at work today I had an issue where I had a raster dataset in ArcGIS, where all cells with invalid data had been set to 9999. Of course, this caused a lot of issues for the statistics on the dataset – basically they were all nonsense – so I needed to fix it. I couldn’t seem to get reclass to work properly with floating point data, so I posted a question to the GIS StackExchange site. Within a few minutes I had a couple of answers: one suggesting a possible way using the interface, and one suggesting a method using the ArcPy API.
However, I managed to find a way to do this easily using the approach recommended for ArcPy in the response from Aragon (thanks!), but using the GUI interface. To do this, follow the instructions below:
1. Run the Spatial Analyst -> Conditional -> Set Null tool
2. Set the input conditional raster to be the raster dataset which you want to change
3. Set the expression to VALUE = 9999 where 9999 is the value that you want to replace with NoData.
4. Set the Input false raster or constant value to the same raster dataset that you select in step 2.
5. Choose a sensible output raster. The dialog should look roughly like this (click for more detail):
6. Click OK
It’s done! Your new raster should have all of the values of 9999 replaced with NoData, and your statistics should work fine.
In 1854 there was a massive cholera outbreak in Soho, London – in three days over 120 people died from the disease. Famously, John Snow plotted the locations of the deaths on a map and found they clustered around a pump in Broad Street – he suggested that the pump be taken out of service – thus helping to end the epidemic. This then helped him formulate his theory of the spread of cholera by dirty water.
This analysis is famous as it is often considered to be:
- The first epidemiological analysis of disease – trying to understand the spread of cases by factors in the environment
- The first geographical analysis of disease data – plotting points on a map and looking for relationships
So, that’s what I did – and the data is available to download as SnowGIS.zip. All of the data is provided in TIFF (with TFW) or SHP formats, ready for loading in to ArcGIS, QGis, R, or anything else. There is a README in the zip file, but read on below for more details on what’s included.
To create the data I took I a copy of Snow’s original map, georeferenced it to the Ordnance Survey National Grid, warped it to fit correctly, and then digitised the locations of the deaths and the pumps. This allows the data that Snow collected to be overlaid on a modern OS map (click for larger copy):
The pumps are shown in blue, and the size of the red circles indicates the number of deaths at that location. Of course, the data can be overlaid on the original map created by Snow (so you can check I digitised it properly!):
- How about performing some sort of statistical cluster analysis on the deaths data? Does it identify the correct pump as the source?
- What if the data were only provided in aggregated form? Lots of healthcare data is provided in that way today because of privacy concerns – but if the data had been provided aggregated to (for example) census output areas or a standard grid, would the right pump have been identified?
When you try and reproject a dataset in ArcGIS (for example, by using the Project Raster tool) you will see a dialog a bit like the one below:
The highlighted field wants you to specific a Geographic Tranformation. Although it says that it is optional, it often isn’t (I think the optionality depends on the type of transformation you’re trying to do). I’ve often found that there are a number of items available in the dropdown box and I have absolutely no idea which one to choose!
For example, when converting from OSGB to WGS84 there are the following options:
How on earth should you choose one of these? Until now I had been choosing semi-randomly – picking one, seeing if the result looks good, if not then trying another. However, the other day I found out about a list of these transformations in the ArcGIS documentation – available to download from the ArcGIS documentation page. This document (available for each different version of ArcGIS) lists all of the transformations available in ArcGIS and their theoretical accuracy. So, for example, we can find out that OSGB_1936_To_WGS_1984_2 is meant to cover England only, and OSGB_1936_To_WGS_1984_4 is for Scotland. The accuracies seem to be around 20m for each transform, although OSGB_1936_To_WGS_1984_5 (for Wales) has an accuracy of 35m.
I can’t believe I’d never come across this resource before – it allows me to actually make intelligent decisions about which transformation to use. I’d strongly suggest you get familiar with this document.
(I’d like to thank the GIS.SE users who helped me with a question I asked related to this problem)
I often find myself using Julian days as a simple method to represent dates in my code. It’s nice and easy, because every day is simply an integer (the number of days since the beginning of the year) and any time during the day can be represented as a floating point number (the fraction of that day which has elapsed at that point). Furthermore, lots of satellite imagery is provided with the dates specified as Julian days – for example, MODIS data filenames include the year and then the Julian day.
It’s fairly easy to convert a standard date (eg. 24th March) to the Julian day in most programming languages (there will either be a library function to do it, or it’s fairly simple to write one yourself) – but it’s not that easy to do in your head. So, I have created a Julian Day calendar poster:
It’s a very simple table that lets you choose the month and day, and get the Julian Day number. Just make sure you remember the note at the top – add one to the number (after February) if it’s a leap year!
It’s available for download below:
If you are creating maps then for goodness sake
Use sensible colours!
I was helping some undergraduates with some work the other day, and they decided to use the following colour scheme for representing river depth:
- Deep water: Red
- Medium-depth water: Bright green
- Shallow water: Pink
- Deep water: Dark blue
- Medium-depth water: Medium-blue
- Shallow water: Light blue
Isn’t it hard work to come up with nice colour schemes for all of your maps? Nope not at all – ColorBrewer has done it already! If you haven’t used this website already I urge you to do so, it provides a number of carefully-chosen colour-schemes designed for various different purposes. For representing river depth you’d probably want to use one of the blue Sequential schemes, but there are also Diverging schemes for data that goes off in two directions, as well as schemes for representing Qualitative data (those that have no explicit ordering). What’s more you can tell it to only show schemes that are color-blind-friendly, photocopier-safe etc, and it’ll produce a preview for you with various map styles (labels, cities, coastlines etc). All in all it’s very impressive, and very useful.
Plugins and extensions are available for a number of pieces of software to allow ColorBrewer colours to be easily used. These include an ArcGIS plugin (see the bottom answer for how to install with ArcGIS 10), R package, Python module and IDL routines.
As part of the OS Open Data initiative the Ordnance Survey has released a free version of their 1:50,000 scale gazetteer. This lists all of the names shown on the 1:50,000 scale OS maps, linked to information such as their location (in both Ordnance Survey grid references and WGS84 latitude/longitude pairs) and type (city, town, water feature etc). I’ve had a little play with this data to do some statistics on it and see if I can find out anything interesting. Hopefully you’ll agree that at least some of the stuff below is interesting.
My main statistical analysis was to count the frequency of each name in the gazetteer, and then extract the ten most commonly-used names for each of the feature types listed in the database.
No cities in the UK share a name with another city.
There are 1,245 towns in the dataset and the results here are slightly more interesting – the winner is Seaton, of which there are three (in Cornwall, Devon and Yorkshire). This isn’t particularly surprising as Seaton was probably derived from Sea Town, and given that we are an island nation, there is a lot of sea by which towns could be located! A number of names are well-known for belonging to two towns in the UK, such as Newport, St. Ives and Ashford, but there are 12 such towns, listed below:
- St Ives
This is where it starts to get more interesting, but also more confusing. The feature type of Other Settlements seems to include anything that’s not a town or a city and there are over 34,000 of them. Nearly 31,000 of these are unique (a fairly impressive feat), but a number of them seem to be more popular than the rest – the top ten are listed below and include three variations on calling a place a new town (New Town, Newtown and Newton) as well as a number of other names relating the settlements to their location (West End, North End, East End) or what is there (Church End).
I was intrigued by the differences between the names of New Town, Newtown and Newton and wondered if there may be differences across the country in which name was used. Also, I wondered whether any areas of the country had significantly more of these new settlements than others. The map below shows the distribution of the three names across the country (click to see a larger version):
A number of patterns are noticeable:
- There seem to be a lot of these settlements around the Welsh border (the area known as the Welsh Marches). I’d suspect this is because of the frequently changing location of the border in this area, leading both sides to rename a town as ‘new’ once they’d captured (or re-captured) it.
- There is a distinct lack of these settlements in central Wales – fairly obviously, as most names there are in Welsh
- Scotland has far more Newtons than any other version of the name, and these are concentrated on the eastern side of the country.
- New town seems to be far more common in southern England, although there is a concentration of Newtons in Cornwall.
I’m not sure what exactly one can learn from the above, but it’s interesting to me at least. If anyone has any ideas which explain the distribution of these names better then leave a comment.
For a long time I have been searching for a simple, easy-to-use, comprehensive list of freely available GIS datasets that I can use in my academic work – or for any other non-commercial purposes (eg. teaching, ‘just for fun’ applications, etc). All of the lists that I have found have been out-of-date, riddled with adverts, or specific to one field or country, so…I decided to make my own:
The screenshot above just shows the first few links – there are many more, split in to categories by field (for example, natural disasters, population, ecology and administrative boundaries) and by geographical area (for example, global, UK, Puerto Rica). The site is being updated regularly, and I am frequently running a checking tool to ensure that all of the links still work. It is intentionally designed to be one simple page – my thoughts being that it is far simple to search for a dataset using Ctrl-F (the in-browser search facility) than navigating around a slow database-backed site.
The web address for the site is:
Have a look and let me know what you think. I hope this will become a useful resource for the GIS community.
Just a quick post this time, as I’m currently enjoying a nice holiday (well, holiday combined with work) in France. I had to post this because I’ve just realised that one of my biggest gripes with ArcGIS has been fixed in version 10! Hooray!
I suspect a lot of other people have been frustrated by this too: if you want to take an ArcGIS map document and use it on another computer it is (or at least, was) very difficult. The .mpd file only contains references to the actual data for the map, so you have to find where all of the data is stored and take that with you too – and whats more, the references are often stored as relative path names, so even moving it to a different folder on the same computer is a pain. In fact, I seem to remember being taught in an undergraduate ArcGIS course never to move an ArcGIS .mpd file once I had created it!
Anyway, ArcGIS 10 appears to have a new function called Map Packages. This allows you to package a .mpd file with all of the data that it uses into one (quite big, probably) .mpk file, which is then completely portable. Sounds great!
I haven’t been able to test it yet (I haven’t got ArcGIS on the laptop with me in France), but it sounds like it’ll be very useful.
For more information, see this ArcGIS blog post.
Summary: Great book – both for GIS concepts and for teaching Python libraries. Lives up to the boast on the front cover – you really will learn to create complete mapping applications, learning a lot of useful tools and techniques on the way.
Reference: Westra, E., 2010, Python Geospatial Development, Packt Publishing, Birmingham, UK, 508 pages Amazon Link
Before I start this review I should probably point out that I am a PhD student working in Remote Sensing and Geographical Information Systems, so I expected to know a fair amount of the theory in this book. The reason I wanted to read it though, was to learn how to do this analysis using Python and its associated libraries. This book succeeded in teaching me how to use Python to perform geospatial analyses, and actually taught me a significant amount of wider GIS knowledge which I had not picked up through any of my university courses.
The book is divided into four main sections: the first introduces general GIS concepts, the second explains basic GIS operations in Python, the third shows how to use databases with geographic data and the fourth combines all of the previous information into two GIS web-apps. It is always difficult to work out what level to pitch this sort of book at – as a number of potential readers will already have experience with GIS (and are using the book to learn about doing GIS analysis in Python, like me), but some will be complete beginners who want to introduce map-based analysis into their applications. The first few chapters of this book are pitched nicely at a mid-point between these two reader groups: the author explains things clearly and precisely without seeming patronising. Although I already knew much of the basics, I found the section on projections and co-ordinate systems very useful as I had never properly understood these (I’d always seen the different options in software I was using for Projected Co-ordinate Systems and Geographic Co-ordinate Systems, but I never knew the difference until I read this book!).
The next section explains how to use a number of Python GIS libraries such as GDAL/OGR, PyProj, and Shapely. The author starts with a general description of the capabilities of each library, and continues with a ‘cookbook-style’ approach showing how to do various tasks with these libraries. For example, instructions are given for how to convert projections and calculate Great Circle Distances with PyProj, how to extract shape geometries and attributes from shapefiles using OGR and how to do basic GIS analysis with Shapely. Details are given on joining these libraries together to exchange data between them using the very useful Well-Known Text (WKT) format – a format that I hadn’t come across, but which appears to be very useful. This section finishes by putting together a number of these libraries to do some real-world tasks such as identifying parks near urban areas.
The third section focuses on geodatabases – an area I know very little about. This section gives a good overview of the concept of a geodatabase, and then specific details about three geodatabases: PostGIS, MySQL and SpatiaLite. I was pleased to see that three contrasting databases were chosen, and a good listing of advantages and disadvantages of each was given. After introducing the workings of these databases, code examples for linking them to Python are given, starting from basic queries and going right up to complex spatial analysis performed within the database. A geospatial application (called DISTAL) is then implemented, showing how to combine geodatabase access with the GIS analyses explained in the previous section. This is implemented as a web-application, but previous experience with web programming is not needed as it is implemented using simple Python CGI scripts, and there are sidebars explaining terms that the reader may not have come across before.
The fourth section is by far the most complicated, and deals with producing maps using a library called Mapnik and producing geo-enabled web-applications using GeoDjango. I must admit that I didn’t quite follow all of this chapter, although this is probably because I’m not hugely interested in, or experience with, building web-apps. In some ways a little too much emphasis is made of how to do things using Django – and trying to introduce any web-app framework (be it Django, Ruby on Rails, or anything else) in one chapter is a tall order – and not enough on the GIS, but I can see why the author included it – as it brings together a fair amount of the tools covered in the book into one coherent whole.
Overall, I’m very impressed with this book. If I had my way (and you never know, if I end up as a lecturer one day I might…), I’d make Chapter 2 part of the core reading for any GIS course, as I am completely shocked that it covers areas that I have never covered even when doing Advanced GIS courses at degree level! I should mention that as well as the chapters mentioned above there is a useful chapter on sources of geospatial data (which, again, mentioned sources that I’d not heard of), and a comprehensive index which makes it very easy to find things. The instructions on how to use the Python libraries (and, more importantly, how to join them together) are well-written and comprehensive and the introductions to GIS concepts are pitched at just the right level. I would thoroughly recommend this book for any GIS or geospatial data user for two main reasons: firstly, it gives great introductions to GIS concepts they may not have come across, and secondly, knowing how to do these things in Python can make certain jobs so much easier (how about a 10 line Python script rather than a few hours of repetitive data conversion?).
(Disclaimer: I was provided with a free review copy of this book)