Robin's Blog

Quick way to delete columns from a shapefile

I haven’t posted anything on this blog for a long time – sorry about that. I’ve been quite ill, and had a new baby – so blogging hasn’t been my top priority. Hopefully I’ll manage some slightly more regular posts now. Anyway, on with the post…

I recently needed to delete some attribute columns from a very large (multi-GB) shapefile. I had the shapefile open in QGIS, so decided the easiest way would be to do it through the GUI as follows:

  1. Open up the attribute table
  2. Turn on editing (far left toolbar button)
  3. Click the Delete Field button (third from the right, or press Ctrl-L) and select the fields to delete

I was surprised to find that this took ages. It seemed to refresh the attribute table multiple times throughout the process (maybe after deleting each separate field?), and that took ages to do (because the shapefile was so large).

I then found I needed to do this process again, and looked for a more efficient way – and I found one. Unsurprisingly, it uses the GDAL/OGR command-line tools – a very helpful set of tools which often provide superior features and/or performance.

Basically, rather than deleting fields, copy the data to a new file, selecting just the fields that you want. For example:

ogr2ogr -f "ESRI Shapefile" -sql "SELECT attribute1, attribute2 FROM input" output.shp input.shp

This will select just the columns attribute1 and attribute2 from the file input.shp.

Surprisingly this command doesn’t actually produce a full shapefile as an output – instead of producing output.shp, output.shx, output.prj and output.dbf (the full set of files that constitute a ‘shapefile’), it just creates output.dbf – the file that contains the attribute table. However, this is easily fixed: just copy the other input.* files and rename them as appropriate (or, if you don’t want to keep the input data, then just rename output.dbf as input.dbf).

Regression in Python using R-style formula – it’s easy!

I remember experimenting with doing regressions in Python using R-style formulae a long time ago, and I remember it being a bit complicated. Luckily it’s become really easy now – and I’ll show you just how easy.

Before running this you will need to install the pandas, statsmodels and patsy packages. If you’re using conda you should be able to do this by running the following from the terminal:

conda install statsmodels patsy

(and then say yes when it asks you to confirm it)

import pandas as pd
from statsmodels.formula.api import ols

Before we can do any regression, we need some data – so lets read some data on cars:

df = pd.read_csv("")

You may have noticed from the code above that you can just give a URL to the read_csv function and it will download it and open it – handy!

Anyway, here is the data:

Model MPG Cylinders Engine Disp Horsepower Weight Accelerate Year Origin
0 amc ambassador dpl 15.0 8 390.0 190 3850 8.5 70 American
1 amc gremlin 21.0 6 199.0 90 2648 15.0 70 American
2 amc hornet 18.0 6 199.0 97 2774 15.5 70 American
3 amc rebel sst 16.0 8 304.0 150 3433 12.0 70 American
4 buick estate wagon (sw) 14.0 8 455.0 225 3086 10.0 70 American

Before we do our regression it might be a good idea to look at simple correlations between columns. We can get the correlations between each pair of columns using the corr() method:

MPG Cylinders Engine Disp Horsepower Weight Accelerate Year
MPG 1.000000 -0.777618 -0.805127 -0.778427 -0.832244 0.423329 0.580541
Cylinders -0.777618 1.000000 0.950823 0.842983 0.897527 -0.504683 -0.345647
Engine Disp -0.805127 0.950823 1.000000 0.897257 0.932994 -0.543800 -0.369855
Horsepower -0.778427 0.842983 0.897257 1.000000 0.864538 -0.689196 -0.416361
Weight -0.832244 0.897527 0.932994 0.864538 1.000000 -0.416839 -0.309120
Accelerate 0.423329 -0.504683 -0.543800 -0.689196 -0.416839 1.000000 0.290316
Year 0.580541 -0.345647 -0.369855 -0.416361 -0.309120 0.290316 1.000000

Now we can do some regression using R-style formulae. In this case we’re trying to predict MPG based on the year that the car was released:

model = ols("MPG ~ Year", data=df)
results =

The ‘formula’ that we used above is the same as R uses: on the left is the dependent variable, on the right is the independent variable. The ols method is nice and easy, we just give it the formula, and then the DataFrame to use to get the data from (in this case, it’s called df). We then call fit() to actually do the regression.

We can easily get a summary of the results here – including all sorts of crazy statistical measures!

OLS Regression Results
Dep. Variable: MPG R-squared: 0.337
Model: OLS Adj. R-squared: 0.335
Method: Least Squares F-statistic: 198.3
Date: Sat, 20 Aug 2016 Prob (F-statistic): 1.08e-36
Time: 10:42:17 Log-Likelihood: -1280.6
No. Observations: 392 AIC: 2565.
Df Residuals: 390 BIC: 2573.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept -70.0117 6.645 -10.536 0.000 -83.076 -56.947
Year 1.2300 0.087 14.080 0.000 1.058 1.402
Omnibus: 21.407 Durbin-Watson: 1.121
Prob(Omnibus): 0.000 Jarque-Bera (JB): 15.843
Skew: 0.387 Prob(JB): 0.000363
Kurtosis: 2.391 Cond. No. 1.57e+03

We can do a more complex model easily too. First lets list the columns of the data to remind us what variables we have:

Index(['Model', 'MPG', 'Cylinders', 'Engine Disp', 'Horsepower', 'Weight',
       'Accelerate', 'Year', 'Origin'],

We can now add in more variables – doing multiple regression:

model = ols("MPG ~ Year + Weight + Horsepower", data=df)
results =
OLS Regression Results
Dep. Variable: MPG R-squared: 0.808
Model: OLS Adj. R-squared: 0.807
Method: Least Squares F-statistic: 545.4
Date: Sat, 20 Aug 2016 Prob (F-statistic): 9.37e-139
Time: 10:42:17 Log-Likelihood: -1037.4
No. Observations: 392 AIC: 2083.
Df Residuals: 388 BIC: 2099.
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept -13.7194 4.182 -3.281 0.001 -21.941 -5.498
Year 0.7487 0.052 14.365 0.000 0.646 0.851
Weight -0.0064 0.000 -15.768 0.000 -0.007 -0.006
Horsepower -0.0050 0.009 -0.530 0.597 -0.024 0.014
Omnibus: 41.952 Durbin-Watson: 1.423
Prob(Omnibus): 0.000 Jarque-Bera (JB): 69.490
Skew: 0.671 Prob(JB): 8.14e-16
Kurtosis: 4.566 Cond. No. 7.48e+04

We can see that bringing in some extra variables has increased the $R^2$ value from ~0.3 to ~0.8 – although we can see that the P value for the Horsepower is very high. If we remove Horsepower from the regression then it barely changes the results:

model = ols("MPG ~ Year + Weight", data=df)
results =
OLS Regression Results
Dep. Variable: MPG R-squared: 0.808
Model: OLS Adj. R-squared: 0.807
Method: Least Squares F-statistic: 819.5
Date: Sat, 20 Aug 2016 Prob (F-statistic): 3.33e-140
Time: 10:42:17 Log-Likelihood: -1037.6
No. Observations: 392 AIC: 2081.
Df Residuals: 389 BIC: 2093.
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept -14.3473 4.007 -3.581 0.000 -22.224 -6.470
Year 0.7573 0.049 15.308 0.000 0.660 0.855
Weight -0.0066 0.000 -30.911 0.000 -0.007 -0.006
Omnibus: 42.504 Durbin-Watson: 1.425
Prob(Omnibus): 0.000 Jarque-Bera (JB): 71.997
Skew: 0.670 Prob(JB): 2.32e-16
Kurtosis: 4.616 Cond. No. 7.17e+04

We can also see if introducing categorical variables helps with the regression. In this case, we only have one categorical variable, called Origin. Patsy automatically treats strings as categorical variables, so we don’t have to do anything special – but if needed we could wrap the variable name in C() to force it to be a categorical variable.

model = ols("MPG ~ Year + Origin", data=df)
results =
OLS Regression Results
Dep. Variable: MPG R-squared: 0.579
Model: OLS Adj. R-squared: 0.576
Method: Least Squares F-statistic: 178.0
Date: Sat, 20 Aug 2016 Prob (F-statistic): 1.42e-72
Time: 10:42:17 Log-Likelihood: -1191.5
No. Observations: 392 AIC: 2391.
Df Residuals: 388 BIC: 2407.
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept -61.2643 5.393 -11.360 0.000 -71.868 -50.661
Origin[T.European] 7.4784 0.697 10.734 0.000 6.109 8.848
Origin[T.Japanese] 8.4262 0.671 12.564 0.000 7.108 9.745
Year 1.0755 0.071 15.102 0.000 0.935 1.216
Omnibus: 10.231 Durbin-Watson: 1.656
Prob(Omnibus): 0.006 Jarque-Bera (JB): 10.589
Skew: 0.402 Prob(JB): 0.00502
Kurtosis: 2.980 Cond. No. 1.60e+03

You can see here that Patsy has automatically created extra variables for Origin: in this case, European and Japanese, with the ‘default’ being American. You can configure how this is done very easily – see here.

Just for reference, you can easily get any of the statistical outputs as attributes on the results object:

Intercept            -61.264305
Origin[T.European]     7.478449
Origin[T.Japanese]     8.426227
Year                   1.075484
dtype: float64

You can also really easily use the model to predict based on values you’ve got:

results.predict({'Year':90, 'Origin':'European'})
array([ 43.00766095])

Reminder about cross-platform case-sensitivity differences

This is just a very brief reminder about something you might run into when you’re trying to get your code to work on multiple platforms – in this case, OS X, Linux and Windows.

Basically: file names/paths are case-sensitive on Linux, but not on OS X or Windows.

Therefore, you could have some Python code like this:

f = open(os.path.join(base_path, 'LE72020252003106EDC00_B1.tif'))

which you might use to open part of a Landsat 7 image – and it would work absolutely fine on OS X and Windows, but fail on Linux. I initially assumed that the failure on Linux was due to some of the crazy path manipulation stuff that I had done to get base_path – but it wasn’t.

It was purely down to the fact that the file was actually called LE72020252003106EDC00_B1.TIF, and Linux treats LE72020252003106EDC00_B1.tif and LE72020252003106EDC00_B1.TIF as different files.

I’d always known that paths on Windows are not case-sensitive, and that they are case-sensitive on Linux – but I’d naively assumed that OS X paths were case-sensitive too, as OS X is based on a *nix backend, but I was wrong.

If you really have problems with this then you could fairly easily write a function that checked to see if a filename exists, and if it found that it didn’t then tried searching for files using something like a case-insensitive regular expression – but it’s probably just easiest to get the case of the filename right in the first place!

‘Got multiple values for argument’ error with keyword arguments in Python classes

This is a quick post to brief describe a problem I ran into the other day when trying to debug someone’s code – the answer may be entirely obvious to you, but it took me a while to work out, so I thought I’d document it here.

The problem that I was called over to help with was a line of code like this:

t.do_something(a=5, b=10)

where t was an instance of a class. Now, this wasn’t the way that I usually write code – I tend to only use keyword arguments after I’ve already used positional arguments – but it reminded me that in Python the following calls to the function f(a, b) are equivalent:

f(1, 2)
f(1, b=2)
f(a=1, b=2)

Anyway, going back to the original code: it gave the following error:

do_something() got multiple values for argument 'a'

which I thought was very strange, as there was definitely only one value of a given in the call to that method.

If you consider yourself to be a reasonably advanced Python programmer than you might want to stop here and see if you can work out what the problem is. Any ideas?

When you’ve had a bit of a think, continue below…

I had a look at the definition of t.do_something(), and it looked like this:

class Test:

    def do_something(a, b):
        # Do something here!
        print('a = %s' % a)
        print('b = %s' % b)

You may have noticed the problem now – although at first glance I couldn’t see anything wrong… There was definitely only one parameter called a, and it definitely wasn’t being passed twice…so what was going on?!

As you’ve probably noticed by now…this method was missing the self parameter – and should have been defined as do_something(self, a, b). Changing it to that made it work fine, but it’s worth thinking about exactly why we were getting that specific error.

Firstly, let’s have a look at a more ‘standard’ error that you might get when you forget to add self as the first argument for an instance method. We can see this by just calling the method without using keyword arguments (that is, t.do_something(1, 2)), which gives:

TypeError: test_noself() takes 2 positional arguments but 3 were given

Now, once you’ve been programming Python for a while you’ll be fairly familiar with this error from when you’ve forgotten to put self as the first parameter for an instance method. The reason this specific error is produced is that Python will always pass instance methods the value of self as well as the arguments you’ve given the method. So, when you run the code:

t.do_something(1, 2)

Python will change this ‘behind the scenes’, and actually run:

t.do_something(t, 1, 2)

and as do_something is only defined to take two arguments, you’ll get an error. Of course, if your function had been able to take three arguments (for example, if there was an optional third argument), then you would find that t (which is the value of self in this case) was being passed as the first argument (a), 1 as the value of the second argument (b) and 2 as the value of the third argument (which could have been called c). This is a good point to remind you that the first argument of methods is only called self by convention – and that Python itself doesn’t care what you call it (although you should always call it self!)

From this, you should be able to work out why you’re getting an error about getting multiple values for the argument a… What’s happening is that Python is passing self to the method, as the first argument (which we have called a), and is then passing the two other arguments that we specified as keyword arguments. So, the ‘behind the scenes’ code calling the function is:

t.do_something(t, a=1, b=2)

But, the first argument is called a, so this is basically equivalent to writing:

t.do_something(a=t, a=1, b=2)

which is obviously ambiguous – and so Python throws an error.

Interestingly, it is quite difficult to get into a situation in which Python throws this particular error – if you try to run the code above you get a different error:

SyntaxError: keyword argument repeated

as Python has realised that there is a problem from the syntax, before it even tries to run it. You can manage it by using dictionary unpacking:

def f(a, b):

d = {'a':1, 'b':2}
f(1, **d)

Here we are defining a function that takes two arguments, and then calling it with a single positional argument for a, and then using the ** method of dictionary unpacking to take the dictionary d and convert each key-value pair to a keyword argument and value combination.

So, congratulations if you’d have solved this problem far quicker than me – but I hope it has made you think a bit more about how Python handles positional and keyword arguments. A few points to remember:

  • Always remember to use self as the first argument of your methods! (This would have stopped this problem ever happening!)
  • But remember that the name self is just a convention, and Python will pass the instance of your class to your first argument regardless what it is called, which can cause weird problems.
  • All positional arguments can be passed as keyword arguments, and vice-versa – they are entirely interchangeable – which, again, can cause problems if this isn’t what you intended.

Showing code changes when teaching

A key – but challenging – part of learning to program is moving from writing technically-correct code “that works” to writing high-quality code that is sensibly decomposed into functions, generically-applicable and generally “good”. Indeed, you could say that this is exactly what Software Carpentry is about – taking you from someone bodging together a few bits of wood in the shed, to a skilled carpenter. As well as being challenging to learn, this is also challenging to teach: how should you show the progression from “working” to “good” code in a teaching context?

I’ve been struggling with this recently as part of some small-group programming teaching I’ve been doing. Simply showing the “before” and “after” ends up bombarding the students with too many changes at once: they can’t see how you get from one to the other, so I want some way to show the development of code over time as things are gradually done to it (for example, moving this code into a separate function, adding an extra argument to that function to make it more generic, renaming these variables and so on). Obviously when teaching face-to-face I can go through this interactively with the students – but some changes to real-world code are too large to do live – and students often seem to find these sorts of discussions a bit overwhelming, and want to refer back to the changes and reasoning later (or they may want to look at other examples I’ve given them). Therefore, I want some way to annotate these changes to give the explanation (to show why we’re moving that bit of code into a separate function, but not some other bit of code), but to still show them in context.

Exactly what code should be used for these examples is another discussion: I’ve used real-world code from other projects, code I’ve written specifically for demonstration, code I’ve written myself in the past and sometimes code that the students themselves have written.

So far, I’ve tried the following approaches for showing these changes with annotation:

  1. Making all of the changes to the code and providing a separate document with an ordered list of what I’ve changed and why.
    Simple and low-tech, but often difficult for the students to visualise each change
  2. The same as above but committing between each entry in the list.
    Allows them to step through git commits if they want, and to get back to how the code was after each individual change – but many of the students struggle to do this effectively in git, and it adds a huge technological barrier…particularly with Git’s ‘interesting’ user-interface.
  3. The same as above, but using Github’s line comments feature to put comments at specific locations in the code.
    Allows annotations at specific locations in the code, but rather clunky to step through the full diff view of commits in order using Github’s UI.

I suspect any solution will involve some sort of version control system used in some way (although I’m not sure that standard diffs are quite the best way to represent changes for this particular use-case), but possibly with a different interface on it.

Is this a problem anyone else has faced in their teaching? Can you suggest any tools or approaches that might make this easier – for both the teacher and students?

(This has also been cross-posted on the Software Carpentry blog)

Reading AERONET data in Pandas: a simple helper function

I use data from the AERONET network of sun photometers a lot in my work, and do a lot of processing of the data in Python. As part of this I usually want to load the data into pandas – but because of the format of the data, it’s not quite as simple as it could be.

So, for those of you who are impatient, here is some code that reads an AERONET data file into a pandas DataFrame which you can just download and use:

For those who want more details, read on…

Once you’ve downloaded an AERONET data file and unzipped it, you’ll find you have a file called something like 050101_161231_Chilbolton.lev20, and if you look at the start of the file it’ll look a bit like this:

Level 2.0. Quality Assured Data.<p>The following data are pre and post field calibrated, automatically cloud cleared and manually inspected.
Version 2 Direct Sun Algorithm
Location=Chilbolton,long=-1.437,lat=51.144,elev=88,Nmeas=13,PI=Iain_H._Woodhouse_and_Judith_Agnew_and_Judith_Jeffrey_and_Judith_Jeffery,[email protected][email protected]
AOD Level 2.0,All Points,UNITS can be found at,,,

You can see here that we have a few lines of metadata at the top of the file, including the ‘level’ of the data (AERONET data is provided at three levels, 1.0, 1.5 and 2.0, referring to the quality assurance of the data), and some information about the AERONET site.

In this function we’re just going to ignore this metadata, and start reading at the 5th line, which contains the column headers. Now, you’ll see that the data looks like a fairly standard CSV file, so we should be able to read it fairly easily with pd.read_csv. This is true, and you can read it using:

df = pd.read_csv(filename, skiprows=4)

However, you’ll find a few issues with the DataFrame you get back from that simple line of code: firstly dates and times are just left as strings (rather than being parsed into proper datetime columns) and missing data is still shown as the string ‘N/A’. We can solve both of these:

No data: read_csv allows us to specify how ‘no data’ values are represented in the data, so all we need to do is set this: pd.read_csv(filename, skiprows=4, na_values=['N/A']) Note: we need to give na_valueslist of values to treat as no data, hence we create a single-element list containing the string N/A.

Dates & times: These are a little harder, mainly because of the strange format in which they are provided in the file. Although the column header for the first column says Date(dd-mm-yy), the date is actually colon-separated (dd:mm:yy). This is a very unusual format for a date, so pandas won’t automatically convert it – we have to help it along a bit. So, first we define a function to parse a date from that strange format into a standard Python datetime:

dateparse = lambda x: pd.datetime.strptime(x, "%d:%m:%Y %H:%M:%S")

I could have written this as a normal function (def dateparse(x)), but I used a lambda expression as it seemed easier for such a short function. Once we’ve defined this function we tell pandas to use it to parse dates (date_parser=dateparse) and also tell it that the first two columns together represent the time of each observation, and they should be parsed as dates (parse_dates={'times':[0,1]}).

Putting all of this together, we get:

dateparse = lambda x: pd.datetime.strptime(x, "%d:%m:%Y %H:%M:%S")
aeronet = pd.read_csv(filename, skiprows=4, na_values=['N/A'],

That’s all we need to do to read in the data and convert the right columns, the rest of the function just does some cleaning up:

  1. We set the times as the index of the DataFrame, as it is the unique identifier for each observation – and makes it easy to join with other data later.
  2. We remove the JulianDay column, as it’s rather useless now that we have a properly parsed timestamp
  3. We drop any columns that are entirely NaN and any rows that are entirely NaN (that’s what dropna(axis=1, how='all') does).
  4. We rename a column, and then make sure the data is sorted
aeronet = aeronet.set_index('times')
del aeronet['Julian_Day']

# Drop any rows that are all NaN and any cols that are all NaN
# & then sort by the index
an = (aeronet.dropna(axis=1, how='all')
            .dropna(axis=0, how='all')
            .rename(columns={'Last_Processing_Date(dd/mm/yyyy)': 'Last_Processing_Date'})

You’ll notice that the last few bits of this ‘post-processing’ were done using ‘method-chaining’, where we just ‘chain’ pandas methods one after another. This is often a very convenient way to work in Python – see this blog post for more information.

So, that’s how this function works – now go off and process some AERONET data!

Conda revisions: letting you ‘rollback’ to a previous version of your environment

I now use Anaconda as my primary Python distribution – and my company have also adopted it for use on all of their developer machines as well as their servers – so I like to think I’m a relatively knowledgeable user. However, the other day I came across a wonderful feature that I’d never known about before…revisions!

The best way to explain is by a quick example. If you run conda list --revisions, you’ll get an output like this:

2016-06-10 20:20:37  (rev 10)

2016-06-10 20:22:19  (rev 11)
     libpng  {1.6.17 -> 1.6.22}

2016-06-10 20:25:49  (rev 12)

In this output you can see a number of specific versions (or revisions) of this environment (in this case the default conda environment), along with the date/time they were created, and the differences (installed packages shown as +, uninstalled shown as - and upgrades shown as ->). If you want to revert to a previous revision you can simply run conda install --revision N (where N is the revision number). This will ask you to confirm the relevant package uninstallation/installation – and get you back to exactly where you were before!

So, I think that’s pretty awesome – and really handy if you screw things up and want to go back to a previously working environment. I’ve got a few other hints for you though…

Firstly, if you ‘revert’ to a previous revision then you will find that an ‘inverse’ revision is created, simply doing the opposite of what the previous revision did. For example, if your revision list looks like this:

2016-06-14 21:12:34  (rev 1)

2016-06-14 21:13:08  (rev 2)

and you revert to revision 1 by running conda install --revision 1, and then run conda list --revisions again, you’ll get this:

2016-06-14 21:13:08 (rev 2)

2016-06-14 21:15:45 (rev 3)

You can see that the changes for revision 3 are just the inverse of revision 2.

One more thing is that I’ve found out that all of this data is stored in the history file in the conda-meta directory of your environment (CONDA_ROOT/conda-meta for your default environment and CONDA_ROOT/envs/ENV_NAME/conda-meta for any other environment). You don’t want to know why I went searching for this file (it’s a long story involving some stupidity on my part), but it’s got some really useful contents:

==> 2016-06-07 22:41:06 <==
# cmd: /Users/robin/anaconda3/bin/conda create --name hotbar python=2.7
# create specs: ['python 2.7*']
==> 2016-06-07 22:46:28 <==
# cmd: /Users/robin/anaconda3/envs/hotbar/bin/conda install matplotlib numpy scipy ipython jupyter mahotas statsmodels scikit-image pandas gdal tqdm

Specifically, it doesn’t just give you the list of what was installed, uninstalled or upgraded – but it also gives you the commands you ran! If you want, you can extract these commands with a bit of command-line magic:

cat ~/anaconda3/envs/hotbar/conda-meta/history | grep '# cmd' | cut -d" " -f3-

/Users/robin/anaconda3/bin/conda create --name hotbar python=2.7
/Users/robin/anaconda3/envs/hotbar/bin/conda install matplotlib numpy scipy ipython jupyter mahotas statsmodels scikit-image pandas gdal tqdm
/Users/robin/anaconda3/envs/hotbar/bin/conda install -c conda-forge rasterio

(For reference, the command-line magic gets the content of the history file, searches for all lines starting with # cmd, and then splits the line by spaces and extracts everything from the 3rd group onwards)

I find environment.yml files to be a bit of a pain sometimes (they’re not always cross-platform compatible – see this issue), so this is quite useful as it actually gives me the commands that I ran to create the environment.

Data sources for parameterising Radiative Transfer Models & atmospheric correction algorithms

There was a question recently on the Py6S mailing list about what data sources are best to use to provide atmospheric parameters (such as AOT, water vapour and ozone) for use with 6S, other atmospheric Radiative Transfer Models (such as MODTRAN) or other atmospheric correction algorithms (such as ATCOR). In the spirit of ‘reply to public’ I decided I’d post the response here, as it’d certainly be easier for other people to find in future!

The first thing to say is if you’re doing atmospheric correction, then by far the best solution is to derive as many of the parameters as possible from the satellite data that you are trying to correct. Sometimes this is relatively easy (for example, when using MODIS data, as there are standard MODIS products for AOT, water vapour and ozone) and sometimes harder (there are no standard atmospheric products for SPOT, and algorithms for estimating atmospheric parameters from Landsat are relatively limited). The main reasons for this are:

  1. The atmospheric data will be from the exact time of the satellite image acquisition
  2. Depending on the satellite, the atmospheric data may be provided for every pixel in the image, and if this isn’t the case then there might be more than one value across the image (for example, the LEDAPS algorithm estimates AOT over all pixels containing Dense Dark Vegetation). This is key as it allows the atmospheric correction algorithm to take into account changes in atmospheric conditions across the image. I have published research (Wilson et al., 2014) showing that failing to take into account this variability when doing atmospheric correction can lead to significant errors.

However, if you’ve found this post then you’ve probably either already tried using data from the satellite you’re working with, and found it isn’t possible – or you’re doing radiative transfer modelling for some other purpose than atmospheric correction, so the explanation above had no relevance to you. So, on with the other data sources – and in Part 1 we’re going to cover aerosol-related data.

Aerosol Optical Thickness

(Also known as Aerosol Optical Depth, just to confuse you!)

This is a measure of the haziness of the atmosphere, and is set in Py6S as: s.aot550 = x Please note that this means AOT at 550nm (yes, AOT is wavelength dependent!), so you may have to interpolate the AOT data from different wavelengths. I’ll post a separate post soon about how best to do this.

When choosing an AOT data source, you usually have a trade-off between accuracy, spatial resolution and temporal resolution. If you’re lucky, you can manage to get two of those, but not three! So, going in order of accuracy:


The Aerosol Robotic Network (AERONET) is the gold-standard in terms of accurate measurement of AOT. The measurements are acquired from ground-based sun photometers, which have the major benefit of being able to produce very accurate measurements at a high frequency – as they don’t depend on satellite orbits. However, there are two major disadvantages: they have a very low spatial resolution, and they only produce measurements when the sun is shining.

Specifically, although there are over 1,000 measurement sites listed on the AERONET website, many of these are for short-term measurement campaigns and so the number of sites regularly recording measurements over a long period of time is much lower. For example, excluding short-term sites, there are only three measurement sites in the UK (Hampshire, London and Edinburgh), and there are many parts of the world with very few sites (such as central Africa).

Also, as mentioned above, measurements are only acquired when there is a direct view of the sun from the instrument. This obviously means that measurements are not acquired during the night (which can stretch for many months near the poles), but also that cloud cover has a significant impact on the frequency of measurements. The AERONET cloud-screening algorithms have to be very strict, as even a tiny amount of cloud will cause a large error in the AOT measurement.

Good for: Very accurate measurements. High temporal resolution.

Bad for: Spatially-dense measurements. Cloudy areas.

Data available from: NASA AERONET site

Other sun photometer data

AERONET run a large network of sun photometers, but there are many other sun photometers run by other people/groups, and these can be good sources of data. Indeed, you may even have acquired your own measurements using your own sun photometer (such as the Microtops sun photometer pictured below). The advantages and disadvantages above still apply, although if you’ve taken measurements yourself then I assume they are in the right place and at the right time!

Data available from: There is a (slightly old now) list of other sun photometer data sources online. If you’ve got Microtops data that you want to process then have a look at my PyMicrotops module that makes the process a lot easier.


Data from the MODIS sensor on the Aqua and Terra satellites is operationally processed to produce AOT data at 10km and 3km resolution, the MOD04 and MOD04_3K products. AOT estimates are produced over all land surfaces, but measurements over bright surfaces (such as deserts and snow) tend to have a higher error.

The Aqua and Terra satellites are in a sun-synchronous orbit, so passes over each location at approximately 10:30 and 13:30 local solar time, giving a total of two measurements a day. Again, cloud cover is an issue, and the MODIS algorithm masks out any pixel containing cloud.

The MODIS data is provided as HDF files with loads of Scientific Data Sets (SDS, basically bands) – the one you want to use is called Optical_Depth_Land_And_Ocean, which is high quality AOT data (where the QAC, the measure of retrieval quality, is 3) from both land and ocean, using all algorithms.

The MOD08 dataset provides the same data at a lower resolution (1 degree), if spatial resolution is not important for you.

Good for: Moderate spatial resolution. Operational data availability. Reasonable accuracy.

Bad for: Temporal resolution. Cloudy areas.

Data available from: LAADSWeb (Unfortunately it comes in an annoying projection, but you can reproject it using the MODIS Reprojection Tool or HEG. MOD08 doesn’t have this issue, and can also be accessed through Google Earth Engine)


The Multi-Angle Implementation of Atmospheric Correction (MAIAC) algorithm (Lyapustin et al., 2011) produces AOT data at 1km resolution, through a novel method which combines a range of MODIS observations with differing view geometries.

Good for: High spatial resolution.

Bad for: Not currently operationally available.

Data available from: Data is not currently available online, but I think it is due to be released as an official MODIS product within the next six months or so.

Aerosol Type

There are lots of ways of parameterising aerosol types in radiative transfer models (see here for a list of the ways you can do this in Py6S) – and it is often difficult to work out what the best option to use is. I’ve listed some options below, but this is unlikely to be a comprehensive list – so please leave a comment if you know of any other data sources.


As well as providing AOT data, AERONET sun photometers also acquire other measurements which can be used to infer the aerosol type. This is almost certainly the most accurate way of setting the aerosol type – and there is a built-in function in Py6S that will help you do it. See here for full details on how to use it: basically you just download the AERONET data and give Py6S the filename and date/time and it does the rest.

Data available from: NASA AERONET site (find AERONET Inversions on the left-hand side)


The MOD04 product mentioned above also provides an estimation of aerosol type, in five categories: dust, sulfate, smoke, heavy absorbing smoke and mixed. It’s not entirely clear how best to match these to the predefined 6S aerosol types (which are biomass burning, continental, desert, maritime, stratospheric and urban) – although there is an option in 6S to set a user-defined aerosol profile, which allows you to specify the relative fraction of each of dust, water, oceanic and soot aerosols (see AeroProfile.User)

Data available from: LAADSWeb (Unfortunately it comes in an annoying projection, but you can reproject it using the MODIS Reprojection Tool or HEG.)

Aerosol type climatologies

A number of aerosol type climatologies have been produced, giving an indication of the typical aerosol type observed at each location, over time. These climatologies vary in the level of detail – both in terms of spatial resolution (which is often coarse) and temporal resolution (often monthly or seasonally). The best way to find these is to search for papers about aerosol climatologies, and then either look for a URL in the paper or contact the authors to ask for the data. I’ve only included a couple of examples here as new climatologies are being published frequently:

So, that’s it for this part. Please let me know of any datasets that I’ve missed – and stay tuned for Part 2 which will cover other parameters (water vapour, ozone and so on).


Non-traditional references to my work – and why they’re important

If someone wants to see how many times my work has been referenced, they’d probably go and look at my citation statistics, for example on my Google Scholar profile. At the time of writing, that shows that I have 16 citations overall, and a h-index of 2. However, I don’t think this tells the whole story.

Specifically, it only counts references to academic papers, books or conference proceedings (‘traditionally-citable items’), and it doesn’t take into account the far wider use of some of these items beyond traditional citations in other papers, books or conference proceedings. This misses out a lot of uses of my work – many of which are uses that I think are actually important (possibly more important than citations in other academic work).

(Please note that I’m not trying to point to other non-traditional references because I feel that my citation count is “too low”, and I’m also not trying to say that a mention of my website URL is equivalent of a citation of my paper in another journal article – but I think these things should be taken into account, and it is hard to do so at the moment).

Anyway, I’ve been doing some investigation into some of the other uses of my work – mostly using various Google and Google Scholar searches – and I’ve found a lot of non-traditional references, some of which I didn’t know about before. For example:

A full list of non-traditional references to my work is available here, but I hope you’ll agree that – although they are not traditional references in traditional academic works – these uses and applications of my work are important, and show a real impact – often beyond the academic community.

Resources for learning Python for Remote Sensing – or switching from IDL

I’m supervising an MSc student for her thesis this summer, and the work she’s doing with me is going to involve a fair amount of programming, in the context of remote sensing & GIS processing. She’s got experience programming in IDL from a programming course during the taught part of her Masters, but has no experience of Python.

I’ve just sent her an email with links to some useful resources, but in the spirit of Matt Might’s Blog tips for busy academics, I thought it would be worth doing a ‘reply to public’, and putting the list of resources here. So, here goes…

  1. Software Carpentry Python course:- this is designed for people new to programming, so some of it will be very easy for you. However, it takes you through a good example of scientific programming with Python, including plotting graphs and dealing with arrays. I suggest that you use the Jupyter Notebook to run through this – there are instructions on how to do that in the course, and there is a brief intro to the notebook in this YouTube video (start at approx 3 minutes – the stuff before that is irrelevant for you)
  2. Lewis’s Scientific Programming for RS course (from UCL): – The most useful bits will probably be Python 101, Plotting and Numerical Python and Geospatial Data. Click the ‘Course Notes’ under each section to see the detailed notes.
  3. Python Scripting with Spatial Data is also good – it gives a good intro to Python, and then covers spatial analysis using GDAL and RIOS (we won’t be using RIOS, but the GDAL stuff is good).
  4. Geoprocessing with Python using Open Source GIS: This is a very good set of slides and tutorials – along with assignments, homework tasks and solutions – from a course run at Utah State University. Unfortunately it is now very old – it was written in 2008/9 – and so it refers to a number of out of date things (such as the ‘numeric’ library for Python, which has been replaced by numpy). A lot of the GDAL content is still useful though – for example, this set of slides on reading raster data with GDAL is pretty good.
  5. There are various good tutorials from conferences such as SciPy (Scientific Python) and FOSS4G (Free and Open Source Software for Geospatial). For example, you can watch a video of a three-hour tutorial from SciPy 2015 called Geospatial Data with Open Source tools in Python, and you can find the slides and other resources here. This goes into quite a lot of depth, and new Python programmers may find it all quite daunting – but it demonstrates the nice modern ways of doing things (using libraries like Fiona and rasterio) rather than the less-nice and lower-level GDAL library.
  6. Another couple of resources (suggested by Chris Holden in the comments – thanks!) are the WUR Geoscripting resources, which start with R and then moves on to Python, and his own Open Source Geoprocessing tutorial, which covers GDAL and OGR, and demonstrates how to calculate NDVI and even goes as far as land cover classification.

There are also a few useful resources for switching from IDL to Python. Specifically:

  1. A Numpy reference for IDL users:  (Numpy is the Python library that provides functions to manipulate arrays – unlike IDL, this isn’t included by default in Python – but it does come with the Anaconda distribution I mentioned above).
  2. I wrote a blog post comparing a set of ‘Ten Little Programs’ in IDL with equivalents in Python, which should give you an idea of the similarities and differences, and how you can translate some of your code.

This is a very short list of a few resources – I’m sure there are some better ones out there, and so please let me know if you’ve got any recommendations!