Robin's Blog

Job: Research Assistant with Python & Remote Sensing/Image Processing experience

This isn’t normal content for my blog, but I thought a post here might reach people who would be interested in the job. Don’t worry, normal service will be resumed shortly – this isn’t going to turn into a job listing site!

A research assistant is required to assist with the development of an algorithm to monitor air quality at high-resolution from satellite images. Your day-to-day work will involve investigating extensions to the algorithm, implementing these extensions in Python code, and automating the processing and validation of large volumes of satellite imagery. You should have good scientific Python programming skills (including experience with libraries such as numpy, scipy and pandas), along with experience with one or more of remote sensing, image processing, computer vision and GIS.

You will be collaborating closely with the original developer of the algorithm, who is based at the University of Southampton and working closely with the Flowminder Foundation. The work is currently fixed-term for 2-5 months full-time on a contract basis, with pay equivalent to a salary of around £30,000pa. There is potential for contract extension and/or further work with the Flowminder Foundation. Part-time, remote or flexible working may be possible, and this role may suit a recent MSc or PhD graduate.

For further information, please contact me at [email protected]

Hacking the Worcester Wave thermostat in Python – Part 3

So, last time we worked out how communications were encrypted and managed to read the current status of the heating system (whether the boiler is on or not, the current temperature, and so on). That’s great – but it’d be even better if we could actually control the thermostat from Python: set the temperature, change from timer mode to manual mode etc. That’s what we’re going to focus on today.

So, using the same ‘man-in-the-middle’ approach I monitored the communications from the app as I changed various settings. When changing the temperature I got a message like this:

<message id="lj8Vq-31" to="[email protected]" type="chat"><subject>/heatingCircuits/hc1/manualTempOverride/temperature</subject><body>PUT /heatingCircuits/hc1/manualTempOverride/temperature HTTP:/1.0\nContent-Type: application/json\nContent-Length: 25\nUser-Agent: NefitEasy\n\n\n\nXmuIR7wCfDZpPrPkrb/CqQ==\n</body></message>

Again, tidying it up a bit, and removing the XMPP headers we see the body is:

PUT /heatingCircuits/hc1/manualTempOverride/temperature

Content-Type: application/json
Content-Length: 25
User-Agent: NefitEasy


If we decode the text at the bottom we find that it decodes as:


This looks like JSON, with a bit of null-padding at the end (presumably so that the encryption routine is given data with the length divisible by a certain number) – and it makes sense, as I set the thermostat to 16 degrees.

So, if we send this message (but with a different number) then presumably the temperature setting will change? Well…sort of!

You see, changing the temperature depends on what mode you are in. There are two modes, and the UMD part of the status message tells you which one you are in: manual or clock. If you’re in manual mode then you can change the temperature by simply sending a PUT message (like the one above) to /heatingCircuits/hc1/temperatureRoomManual, with JSON of {"value":21} (or whatever temperature you want).

However, if you’re in clock mode then you have to set the ‘override temperature’ (a PUT message to /heatingCircuits/hc1/manualTempOverride/temperature) with the same JSON as above, and you then have to turn the ‘temperature override function’ on (a PUT message to /heatingCircuits/hc1/manualTempOverride/status with the JSON {"value": "on"}).

Oh, and if you want to change the mode then you can just send a PUT message to /heatingCircuits/hc1/usermode with the JSON {"value": "clock"} or {"value":"manual"}.

You may be wondering whether you get a response from these messages or not: you do, but they’re not very interesting. Unless there has been an error, all you get is:

No Content
Content-Type: application/json
connection: close

There are loads of other messages you can send to do all sorts of complicated things (such as changing the timer programme), but I haven’t bothered to investigate any of those yet. I know they will use the same format as these messages, they’ll just have slightly more complicated JSON payloads, and may require the sending of multiple messages. I’m just happy that I can read the status of my thermostat, and control basic settings (mode and temperature)!

So, I haven’t actually mentioned Python that much in this series so far (sorry!) – although, in fact, most of my ‘trial and error’ work was done through Python using the great sleekxmpp library. I have to confess here that I haven’t written the code as well as I should have done: I should really have designed it to implement a proper Finite State Machine, and send and receive the appropriate messages at the appropriate times, all while updating internal information in the Python classes…but I didn’t! Sorry – dealing with all of that, working asynchronously, was too much like hard work.

So, I wrote a BaseWaveMessageBot class that implemented connecting, sending messages, encoding and decoding message payloads and a bit of simple error handling. That class has all of the complicated stuff in it, so I then wrote a couple of very simple classes (StatusBot and SetBot) that send the appropriate messages and process the responses. I then combined these in a nice class called WaveThermo. Currently there aren’t many methods in WaveThermo, but I will gradually add more functionality as I need it.

The code is available on Github and is fairly easy to use:

Of course, I’ve only tested it with my thermostat – so if it doesn’t work for you then please let me know!

So, that’s it for this time – next time I’ll talk about a bit of the work I’ve done with automated monitoring of the temperature and the thermostat state, and some of the interesting patterns I’ve found.

How to: add simple new commands to LaTeX to help with writing papers

Like a lot of academics, I write many documents in LaTeX – including almost all of my academic papers, and my PhD thesis! So, anything that can make my life easier is of interest to me. I was recently discussing this with a colleague (a co-author on a paper actually), and realised that lots of people don’t know how easy it is to create new custom commands in LaTeX. It’s actually really simple – and can be really useful – as I hope this post will explain.

So, one simple use for custom commands is to save you typing the same thing over and over again. In my thesis I talked a lot about air quality, and particularly PM2.5. Now, the correct way to typeset that in LaTeX is as:


This uses math-mode, but keeps the PM and the 2.5 as ‘normal text’ (otherwise it will look like P multiplied by M rather than PM as an abbreviation for particulate matter). This is a pain to keep typing all of the time – so a lot of people just write


which looks far worse (the top is ‘right’, the bottom is ‘wrong’):

Screen Shot 2016-02-06 at 20.03.10

Luckily, LaTeX’s ability to create new commands really easily can be used to save us a lot of work. Adding the following to your preamble (the section of your document before the \begin{document} line) will create a new command called \pmtext which will insert the ‘proper’ version of the PM2.5 code from above:


As you can see, \newcommand is quite simple, you give it two arguments: the name of the new command, and what the new command ‘does’. In this case the second argument is just some LaTeX code that typesets the PM text properly. I created this command fairly early on in the process of writing my PhD, and it made life a lot easier: I had less to type, I didn’t have to remember exactly how to typeset PM2.5 properly, and if I wanted to change how it was displayed I just had to change it in one place (basically all of the standard benefits of commands or functions in any programming language!).

Talking of programming languages, you can do far fancier things with \newcommand, including passing parameters. A good example here is some code that I add to every LaTeX document I write, which creates a new command called \todo. See if you can work out what it does:

\newcommand{\todo}[1] {\textbf{\textcolor{red}{#1}}}

Yup, that’s right: it displays the given text as bold and red – a perfect way to mark something that still needs doing. For example:

\todo{Somewhere here put in the radiosonde type boxplot}

will produce something that looks like this:

Screen Shot 2016-02-06 at 20.09.57

If you look at the \newcommand call, you can see that we give it an extra parameter in square brackets: this is the number of parameters that the command has. We can then reference the value of these parameters as #1 for the first parameter, #2 for the second etc. The rest of the ‘body’ of the command is just normal LaTeX code (if you want to use this code, make sure you add \usepackage{xcolor} too, so that the \textcolor command will work!).

The other command I add to almost every document I use is this:

\newcommand{\rh}{\todo{REF HERE }}

It depends on the \todo command, and allows me to mark the fact that I need to add a reference to a bit of text, without having to stop writing and actually find the reference. I find this has actually helped my writing a lot – as I can quite happily write multiple paragraphs and just scatter \rh around to remind me to come back and put in references later.

The final command I used a lot in my thesis, which shows a slightly more advanced use of \newcommand is


You may know that the \caption command can take an optional argument in square brackets for a short caption to be used in the List of Figures. Something like

\caption[Example Landsat image over Southampton]{Example Landsat image over Southampton, shown as a true colour composite, with XYZ procedure applied. Image data from USGS.}

As in that example, I often found myself repeating bits of text: I normally wanted the short caption to be the first part of the long caption, without all of the details given in the long caption. So, the \xcaption command (‘extra caption’) command I’ve created just takes two arguments, the first of which is the short caption, and the second of which is the text to add to the short caption to get the long caption – saving a lot of typing and mistakes!

So, I hope that’s shown how easy it can be to create simple new commands in LaTeX that can save you a lot of time and effort.

Hacking the Worcester Wave thermostat in Python – Part 2

In the previous part we had established that the Worcester Wave thermostat app communicates with a remote server (run by Worcester Bosch) using the XMPP protocol, with TLS encryption. However, because of the encryption we haven’t yet managed to see the actual content of any of these messages!

To decrypt the messages we need to do a man in the middle attack. In this sort of attack we put insert ourselves between the two ends of the conversation and intercept messages, decrypt them and then forward them on to the original destination. Although it is called an ‘attack’, here I am basically attacking myself – because I’m basically monitoring communications going from a phone I own through a network I own. Beware that doing this to networks that you do not own or have permission to use in this way is almost certainly illegal.

There is a good guide to setting all of this up using a tool called sslsplit, although I had to do things slightly differently as I couldn’t get sslsplit to work with the STARTTLS method used by the Worcester Wave (as you may remember from the previous part, STARTTLS is a way of starting the communication in an unencrypted manner, and then ‘turning on’ the encryption part-way through the communication).

The summary of the approach that I used is:

  1. I configured a Linux server on my network to use Network Address Translation (NAT) – so it works almost like a router. This means that I can then set up another device on the network to use that server as a ‘gateway’, which means it will send all traffic to that server, which will then forward it on to the correct place.
  2. I created a self-signed root certificate on the server. A root certificate is a ‘fully trusted’ certificate that can be used to trust any other certificates or keys derived from it (that explanation is probably technically wrong, but it’s conceptually right).
  3. I installed this root certificate on a spare Android phone, connected the phone to my home wifi and configured the Linux server as the gateway. I then tested, and could access the internet fine from the phone, with all of the communications going through the server.

Now, if I use the Worcester Wave app on the phone, all of the communications will go through the server – and the phone will believe the server when it says that it is the Bosch server at the other end, because of the root certificate we installed.

Now we’ve got all of the certificates and networking stuff configured, we just need to actually decrypt the messages. As I said above, I tried using SSLSplit, but it couldn’t seem to cope with STARTTLS. I found the same with Wireshark itself, so looked for another option.

Luckily, I found a great tool called starttls-mitm which does work with STARTTLS. Even more impressively it’s barely 80 lines of Python, and so it’s very easy to understand the code. This does mean that it is less configurable than a more complex tool like SSLSplit – but that’s not a problem for me, as the tool does exactly what I want . It is even configured for XMPP by default too! (Of course, as it is written in Python I could always modify the code myself if I needed to).

So, running starttls-mitm with the appropriate command-line parameters (basically your keys, certificates etc) will print out all communications: both the unencrypted ones before the STARTTLS call, and the decrypted version of the encrypted ones after the STARTTLS call. If we then start doing something with the app while this is running, what do we get?

Well, first we get the opening logging information starttls-mitm telling us what it is doing:

LISTENER ready on port 8443
CLIENT CONNECT from: ('', 57913)

We then start getting the beginnings of the communication:

C->S 129 '<stream:stream to="" xmlns="jabber:client" xmlns:stream="" version="1.0">'
S->C 442 '<?xml version=\'1.0\' encoding=\'UTF-8\'?><stream:stream xmlns:stream="" xmlns="jabber:client" from="" id="260d2859" xml:lang="en" version="1.0"><stream:features><starttls xmlns="urn:ietf:params:xml:ns:xmpp-tls"></starttls><mechanisms xmlns="urn:ietf:params:xml:ns:xmpp-sasl"><mechanism>DIGEST-MD5</mechanism></mechanisms><auth xmlns=""/></stream:features>'

Fairly obviously, C->S are messages from the client to the server, and S->C are messages back from the server (the numbers directly afterwards are just the length of the message). These are just the initial handshaking communications for the start of XMPP communication, and aren’t particularly exciting as we saw this in Wireshark too.

However, now we get to the interesting bit:

C->S 51 '<starttls xmlns="urn:ietf:params:xml:ns:xmpp-tls"/>'
S->C 50 '<proceed xmlns="urn:ietf:params:xml:ns:xmpp-tls"/>'
Wrapping sockets.

The STARTTLS message is sent, and the server says PROCEED, and – crucially – starttls-mitm notices this and announces that it is ‘wrapping sockets’ (basically enabling the decryption of the communication from this point onwards).

I’ll skip the boring TLS handshaking messages now, and skip to the initialisation of the XMPP protocol itself. I’m no huge XMPP expert, but basically the iq messages are ‘info/query’ messages which are part of the handshaking process with each side saying who they are, what they support etc. This part of the communication finishes with each side announcing its ‘presence’ (remember, XMPP is originally a chat protocol, so this is the equivalent of saying you are ‘Online’ or ‘Active’ on Skype, Facebook Messenger or whatever).

C->S 110 '<iq id="lj8Vq-1" type="set"><bind xmlns="urn:ietf:params:xml:ns:xmpp-bind"><resource>70</resource></bind></iq>'
S->C 188 '<iq type="result" id="lj8Vq-1" to=""><bind xmlns="urn:ietf:params:xml:ns:xmpp-bind"><jid>[email protected]/70</jid></bind></iq>'
C->S 87 '<iq id="lj8Vq-2" type="set"><session xmlns="urn:ietf:params:xml:ns:xmpp-session"/></iq>'
S->C 86 '<iq type="result" id="lj8Vq-2" to="[email protected]/70"/>'
C->S 74 '<iq id="lj8Vq-3" type="get"><query xmlns="jabber:iq:roster" ></query></iq>'
S->C 123 '<iq type="result" id="lj8Vq-3" to="[email protected]/70"><query xmlns="jabber:iq:roster"/></iq>'
C->S 34 '<presence id="lj8Vq-4"></presence>'
C->S 34 '<presence id="lj8Vq-5"></presence>'

Now all of the preliminary messages are dealt with, we get to the good bit. The message below is sent from the client (the phone app) to the server:

C->S 162 '<message id="lj8Vq-6" to="[email protected]" type="chat"><body>GET /ecus/rrc/uiStatus HTTP /1.0\nUser-Agent: NefitEasy</body></message>'

It basically seems to be a HTTP GET request embedded within an XMPP message. That seems a bit strange to me – why not just use HTTP directly? – but at least it is easy to understand. The URL that is being requested also makes sense – I was on the ‘home screen’ of the app at that point, so it was grabbing the status for displaying in the user-interface (things like current temperature, set-point temperature, whether the boiler is on or not, etc).

Now we can see the response from the server:

S->C 904 '<message to="[email protected]/70" type="chat" xml:lang="en" from="[email protected]/RRC-RestApi"><body>HTTP/1.0 200 OK\nContent-Length: 640\nContent-Type: application/json\nconnection: close\n\n5EBW5RuFo7QojD4F1Uv0kOde1MbeVA46P3RDX6ZEYKaKkbLxanqVR2I8ceuQNbxkgkfzeLgg6D5ypF9jo7yGVRbR/ydf4L4MMTHxvdxBubG5HhiVqJgSc2+7iPvhcWvRZrRKBEMiz8vAsd5JleS4CoTmbN0vV7kHgO2uVeuxtN5ZDsk3/cZpxiTvvaXWlCQGOavCLe55yQqmm3zpGoNFolGPTNC1MVuk00wpf6nbS7sFaRXSmpGQeGAfGNxSxfVPhWZtWRP3/ETi1Z+ozspBO8JZRAzeP8j0fJrBe9u+kDQJNXiMkgzyWb6Il6roSBWWgwYuepGYf/dSR9YygF6lrV+iQdZdyF08ZIgcNY5g5XWtm4LdH8SO+TZpP9aocLUVR1pmFM6m19MKP+spMg8gwPm6L9YuWSvd62KA8ASIQMtWbzFB6XjanGBQpVeMLI1Uzx4wWRaRaAG5qLTda9PpGk8K6LWOxHwtsuW/CDST/hE5jXvWqfVmrceUVqHz5Qcb0sjKRU5TOYA+JNigSf0Z4CIh7xD1t7bjJf9m6Wcyys/NkwZYryoQm99J2yH2khWXyd2DRETbsynr1AWrSRlStZ5H9ghPoYTqvKvgWsyMVTxbMOht86CzoufceI2W+Rr9</body></message>'

Oh. This looks a bit more complicated, and not very easily interpretable. Lets format the main body of the message formatted a bit more nicely:

HTTP/1.0 200 OK
Content-Length: 640
Content-Type: application/json
connection: close


So, it seems to be a standard HTTP response (200 OK), but the body looks like it is encoded somehow. I assume that the decoded body would be something like JSON or XML or something containing the various status values – but how do we decode it to get that?

I tried all sorts of things like Base64, MD5 and so on but nothing seemed to work. I gave up on this for a few days, while gently pondering it in the back of my mind. When I came back to it, I realised that the data here was probably actually encrypted, using the Access Code that comes with the Wave and the password that you set up when you first connect the Wave. Of course, to decrypt it we need to know how it was encrypted…so time to break out the next tool: a decompiler.

Yes, that’s right: to fully understand exactly what the Wave app is doing, I needed to decompile the Android app’s APK file and look at the code. I did this using the aptly named Android APK Decompiler, and got surprisingly readable Java code out of it! (I mean, it had a lot of goto statements, but at least the variables had sensible names!)

It’s difficult to explain the full details of the encryption/decryption algorithm in prose – so I’ve included the Python code I implemented to do this below. However, a brief summary is that: the main encryption is AES using ECB, with keys generated from the MD5 sums of combinations of the Access Code, the password and a ‘secret’ (a value hard-coded into the app).

def encode(s):
    abyte1 = get_md5(access + secret)
    abyte2 = get_md5(secret + password)

    key = abyte1 + abyte2

    a =
    a =, AES.MODE_ECB)
    res = a.encrypt(s)

    encoded = base64.b64encode(res)

    return encoded

def decode(data):
    decoded = base64.b64decode(data)

    abyte1 = get_md5(access + secret)
    abyte2 = get_md5(secret + password)

    key = abyte1 + abyte2

    a =
    a =, AES.MODE_ECB)
    res = a.decrypt(decoded)

    return res

Using these functions we can decrypt the response to the GET /ecus/rrc/uiStatus message that we saw earlier, and we get this:

{'id': '/ecus/rrc/uiStatus',
 'recordable': 0,
 'type': 'uiUpdate',
 'value': {'ARS': 'init',
  'BAI': 'CH',
  'BBE': 'false',
  'BLE': 'false',
  'BMR': 'false',
  'CPM': 'auto',
  'CSP': '31',
  'CTD': '2014-12-26T12:34:27+00:00 Fr',
  'CTR': 'room',
  'DAS': 'off',
  'DHW': 'on',
  'ESI': 'off',
  'FPA': 'off',
  'HED_DB': '',
  'HED_DEV': 'false',
  'HED_EN': 'false',
  'HMD': 'off',
  'IHS': 'ok',
  'IHT': '16.70',
  'MMT': '15.5',
  'PMR': 'false',
  'RS': 'off',
  'TAS': 'off',
  'TOD': '0',
  'TOR': 'on',
  'TOT': '17.0',
  'TSP': '17.0',
  'UMD': 'clock'},
 'writeable': 0}

This makes far more sense!

It may not be immediately apparent what each field is (three character variable names – great!), but some of them are fairly obvious (CTD presumably stands for something like Current Time/Date), or can be established by decoding a number of messages with the boiler in different states (showing that DHW stands for Domestic Hot Water and BAI for Burner Active Indicator).

We’ve made a lot of progress in the second part of this guide: we’ve now decrypted the communications, and worked out how to get all of the status information that is shown on the app home screen. At this point I set up a simple temperature monitoring system to produce nice graphs of temperature over time – but I’ll leave the description of that to later in the series. In the next part (Part 3) we’re going to look at sending messages to actually change the state of the thermostat (such as setting a new temperature, or switching to manual mode), and then have a look at the Python library I’ve written to control the thermostat.

Hacking the Worcester Wave thermostat in Python – Part 1

When we bought a new boiler last year, we decided to install a ‘smart thermostat’. There are a wide range available these days, including the Google Nest, the Hive (from British Gas), and the Worcester Bosch ‘Wave’. As we had a Worcester Bosch boiler we got the Wave – and it wasn’t much more expensive than a standard programmer/thermostat.

The thermostat looks like this:


It is mounted on the wall at the bottom of our stairs, and is connected by a cable to the boiler. As you can see, it has a simple touch-sensitive interface: you can increase or decrease the thermostat setting, see the current temperature and the current setting, change from manual to program mode, and see whether the boiler is on or not. For anything more advanced (such as setting the programmable timer) you have to use the mobile app, which looks like this:

Looks familiar, doesn’t it? You have the same simple interface on the app ‘home screen’, but you can also configure the more advanced settings, such as the timer programme:

screen568x568 (1)

Overall, I’m pleased with the Wave thermostat: it is far easier to configure the timer settings on a phone app than on a tiny fiddly controller, and there are all sorts of advanced features you can use (for example, ‘optimisation’, where the system gradually learns how long it takes your house to warm up, and turns the heating on at the right time to get it to the temperature you want at the time you want) – combined with being able to control the heating when away from home.

Anyway, of course one of the first things that I looked for once it had been installed was an API that I could use to monitor and control the thermostat from my home server, ideally through a Python script. Unfortunately there was nothing about an API on the Worcester Bosch website, and when I phoned Customer Services they told me than there was no API available. So, I thought I’d try and reverse engineer the protocol that was used, and see if I could hack together a Python interface.

How does the system overall work?

When starting to investigate this, my a priori understanding of the system was that the app must send messages of some sort to a remote server (presumably run or controlled by Bosch), which then forwards the messages on to the thermostat itself, and vice-versa. This is because you can access the thermostat via the app wherever you are: you don’t have to be on the same wireless network. Whatever protocol or ports that are used must be ones that can reliably get through home firewalls so that the remote server can actually talk to the thermostat.

What technologies do they use?

When exploring the app I had noticed that there was a Product Information screen which showed information about the software and hardware versions, and included a button labelled Worcester Wave uses Open Source software. Tapping this shows a list of open-source packages that are used by the system, including their various license agreements. The list consisted of:

  • AndroidPlot – a plotting library for Android
  • Guava – Google Java core libraries
  • XMP Toolkit – eXtensible Metadata Platform
  • Smack – a Java XMPP (also known as Jabber) chat protocol implementation
  • JSR305 Expert Group – Annotations for Software Defect Detection in Java
  • Lucent technologies – Not sure, couldn’t find this one!
  • Chromium – I’m not sure why the Chromium web-browser was used
  • Takayua OOURA – Ooura’s Mathematical Software
  • Eigen software – matrix manipulation tools
  • libresample – resampling of data (primarily audio, but presumably other datatypes too)

So, looking at that list suggests that the eXtensible Messaging and Presence Protocol (XMPP) is used for communication, potentially with some metadata embedded with the eXtensible Metadata Platform. The rest of the libraries aren’t very relevant to our work, but are interesting to see (I suspect the mathematical ones are for doing the maths for some of the advanced features like optimisation).

What is being sent, and where to?

So, we think the communication is using XMPP – now we need to confirm that, and work out what is being sent and where it is being sent to. So, I opened up Wireshark to start sniffing the traffic. After playing around a bit with the filters, I got this (click to enlarge):

Screen Shot 2016-01-02 at 22.45.15

This shows that my guess was correct: the XMPP protocol is being used to send messages between the phone running the Wave Android app ( on my local network) and a server run by Bosch ( So, we now know what protocols we are dealing with: that is the good news.

However, the bad news is the STARTTLS messages that we see there. You may recognise this from email configuration dialog boxes, as it is one of the security options for connecting to POP3/IMAP/SMTP servers. It stands for ‘Start Transport Layer Security’, and is basically a way of saying “I want to encrypt this conversation from now onwards, is that ok?” (for reference, the alternative is to do everything with encryption right from the start of the communication). Anyway, in this case the Bosch server responds with PROCEED (“Yes, I’m happy to do this encrypted”). From that point onwards, we see the TLS security negotiation (“What sort of encryption do you support?”, “I support X, Y and Z”, “Ok, lets use Z”, “Here are my keys” etc) followed by a lot of ‘Application Data’ messages:

Screen Shot 2016-01-02 at 22.54.06

We can’t actually see any of the content of the messages, as they are encrypted, so all we get is the raw hex dump: 80414a90ca64968de3a0acc5eb1b50108bbc5b26973626daaa….(and so on). Not very useful!

I think this is a good place to finish Part 1 of the story. We have worked out that communication goes from the app to a Bosch server, and it uses the XMPP protocol, but with TLS encryption using STARTTLS, so we haven’t been able to work out what any of the messages actually contain. Come back for Part 2 to hear how I managed to read the messages…

My MSc project: a simple ray-tracing radiative transfer model (with code!)

My PhD was done through a Doctoral Training Centre, and as part of this I had a taught year at the beginning of my PhD. During the summer of this year I had to do a ‘Summer Project’, which was basically a MSc thesis. My thesis was called “Can a single cloud spoil the view?”: modelling the effect of an isolated cumulus cloud on calculated surface solar irradiance” (I think the best thing about my thesis was the title!). It gained a ‘Distinction’ mark, and I was very pleased that it won the RSPSoc MSc thesis prize (again, I think they gave me the prize based on the title alone!). This is the story of this project…

Paths taken by light in the ray-tracing Radiative Transfer Model I developed for my MSc

Paths taken by light in the ray-tracing Radiative Transfer Model I developed for my MSc

During the year before my MSc, I had been doing a lot of investigation into Radiative Transfer Modelling. These models (known as RTMs) are used to estimate how light travels through the atmosphere: you basically tell them what the atmosphere is like (how hazy, how much water vapour etc), and they simulate the effect on light passing through the atmosphere.

I’d investigated a range of models – including simple models like SPCTRL2, good standard models like 6S, and advanced models like SCIATRAN. One thing that I noticed was that none of these models seemed to be able to incorporate clouds very well. SCIATRAN did support modelling clouds, but you had to have either an entirely clear or entirely overcast sky – you could have vertical variation in optical properties (ie. multiple layers of clouds) but not horizontal variation. I decided that I’d write some code to use one of these RTMs to model light passing through an atmosphere which is partially cloudy and partially clear – like most of the skies that we have in the UK.

So, I started reading more about these RTMs, and tried to work out how I might be able to extend them so that I could have a spatially-variable amount of cloud (eg. cloud on the left-hand side of my ‘simulated atmosphere’ and clear sky on the right-hand side – or even a single small cloud moving across the sky). I struggled with this for a while and didn’t really have much success at working out what I’d need to do. I naively thought that I would be able to write a tiny bit of code to replace the ‘homogenous atmosphere’ that SCIATRAN used with a ‘spatially-variable atmosphere’ and that would be it. Unfortunately, that wasn’t the case…

It turns out that the horizontally-homogeneous nature of the atmosphere is a fundamental assumption of the mathematics that these models are based upon. And I’d come up with a MSc project that required me to break this assumption. Ooops.

So – after panicking a bit and deciding that I can’t do a PhD or even a MSc – I decided that if the standard models wouldn’t let me do what I wanted then I’d have to implement a new Radiative Transfer Model myself, from scratch! A bit of a tall order…but I decided to have a go, figuring that even if it didn’t work I’d learn a lot in the process.

With a bit more reading I realised that the only way to take into account horizontally-varying atmospheric properties would be to implement a ray-tracing model. In this sort of model, individual rays of light are followed from their source (the sun), and each individual interaction with the atmosphere is simulated (scattering, absorption etc). They are far more time-consuming to run than ‘standard’ models – but allow far more interesting parameterisations, as the atmosphere is basically represented by a grid, which can be ‘filled’ with different atmospheric properties in each cell. Normally the grid would be a 3D grid, but I decided to use a 2D grid to make life easier – just taking into account vertical height into the atmosphere (sun at the top, sensor at the bottom) and horizontal distance along the ground.

I decided to base my model on SPECTRL2, which meant I could use a lot of the constants that were defined in the SPECTRL2 code – for things like Earth-sun distances, absorption at different wavelengths, etc. I then built a simple Monte Carlo Ray Tracing procedure that followed the flowchart below (click to enlarge):


I say “simple” – it took quite a while to get it working properly! Of course, after actually getting the model working, I had to work out how to actually parameterise the atmosphere, how to create a realistic cloud to put in this atmosphere, and then what computational experiments to run to actually answer my original question.

Answering the actual question I set out to answer was actually very easy to do once I’d got the model running – and it gave some interesting answers. For example, if there is a cloud near a sensor on the ground, but not directly between the sensor and sun, then the sensor will actually record a higher irradiance than it would if there were no cloud. This makes perfect sense when you think about it (the cloud will scatter some extra light towards the sensor), but I hadn’t thought of it before I did the experiment!

I don’t think there’s really much more to say here, apart from to direct you to my thesis for more information (including my detailed conclusions) and to say that the code is now available on Github. I have tested the code in the latest version of IDL (v8.5) and it is still working – although I should warn you that the code runs rather slowly!

Also, if you’re interested in using a proper 3D ray tracing RTM then look into the Discrete Anisotropic Radiative Transfer Model (DART). I’ve always been meaning to replicate my experiments with DART, but have never quite had the time…that’s another thing for my ‘list of papers to publish when I have time’.

Deprecating AutoZotBib

One of the most important things when developing software – particularly open-source software – is knowing when to stop working on something, relinquish responsibility and suggest that it should not be used any more. This is what I have recently done with AutoZotBib.

For those who don’t remember AutoZotBib, it is a tool that I first wrote a few years ago to automatically export Zotero bibliography libraries to the BibTeX format (used for LaTeX documents). I used it during the writing of my PhD, but there have always been some fairly significant problems with it. The main problem was performance: if you had a reasonably large Zotero library, then adding a new item would freeze the library for at least thirty seconds, while the export was performed. In fact, the Zotero developers added a warning to the plugins list page saying that my plugin may slow Zotero down significantly if you use it with a large library!

Over the years I did various updates to try and make this work better – and, in fact, I wrote most of a new version that did intelligent caching and only wrote out the bits that had changed…but I never quite got it fully working and never found the time to push it through to completion.

I haven’t had any time to work on it recently (the last commit was February 2014), and I haven’t even had the time to respond to the occasional support emails that I receive. I’d been thinking of dropping support for a while, but I wasn’t aware of anything else that did the same job, and I always had a hope that maybe, one day, I might find time to work on it.

Two things snapped me out of this, and made me accept that I needed to drop it:

  1. The latest version of Firefox requires all plugins to be signed, and as Zotero is a Firefox plugin (although it can run in standalone mode too), any Zotero plugins will also have to be signed. I tried to get AutoZotBib signed, but a lot has changed in the Firefox extension world since I first wrote it, and I couldn’t get it to work.
  2. I’ve found another plugin that does similar things, but does them better – and is well-maintained! It’s called Zotero Better BibTeX, and its Push Export feature does exactly what AutoZotBib did, but allows you to configure it to only export specific collections, allows export only when the computer is idle, etc.

So, I took a deep breath and updated the website to say that AutoZotBib is deprecated, and marked it as such on the Zotero Plugins page, suggesting that users use BetterBibTeX instead. Doing this has really taken a load off my mind – and I am now happy that I don’t have to try and motivate myself to work on it. Of course, the beauty of open-source software is that the code is still available, and so if anyone else wants to pick it up (or is desperate to use it, even though it is no longer supported) then they can!

Behind the paper: Are visibility-derived AOT estimates suitable for parameterising satellite data atmospheric correction algorithms?

This has been a bit slow coming, but I am now sticking to my promise to write a Behind the paper post for each of my published academic papers. This is about:

Wilson, R. T., E. J. Milton, and J. M. Nield (2015). Are visibility-derived AOT estimates suitable for parameterising satellite data atmospheric correction algorithms? International Journal of Remote Sensing 36 (6) 1675-1688

Publishers Link
Post-print PDF

Screen Shot 2015-10-22 at 21.17.03

Layman’s summary

Aerosol Optical Thickness (AOT) is a measure of how hazy the atmosphere is – that is, how easy it is for light to pass through it. It’s important to measure this accurately, as it has a range of applications. In this paper we focus on satellite image atmospheric correction. When satellites collect images of the Earth, the light has to pass through the atmosphere twice (once on the way from the sun to the Earth, once again on the way back from the Earth to the satellite) and this affects the light significantly. For example, it can be scattered or absorbed by various atmospheric constituents. We have to correct for this before we can use satellite images – and one of the ways to do that is to simulate what happens to light under the atmospheric conditions at the time, and then use this simulated information to remove the effects from the image.

To do this simulation we need various bits of information on what the atmosphere was like when the image was acquired – and one of these is the AOT value. The problem is that it’s quite difficult to get hold of AOT values. There are some ground measurements sites – but only about 300 of them across the whole world. Therefore, quite a lot of people use measurements of atmospheric visibility as a proxy for AOT. This has many benefits, as loads of these measurements are taken (by airports, and local meteorological organisations), but is a bit scientifically questionable, because atmospheric visibility is measured horizontally (as in “I can see for miles!”) and AOT is measured vertically. There are various ‘standard’ ways of estimating AOT from visibility – and some of these are built in to tools that do atmospheric correction of images – and I wanted to investigate how well these worked.

I used a few different datasets which had both visibility and AOT measurements, collected at the same time and place, and investigated the relationship. I found that the relationship was often very poor – and the error in the estimated AOT was never less than half of the mean AOT value (that is, if the mean AOT was 0.2, then the error would be 0.1 – not great!), and sometimes more than double the mean value! Simulating the effect on atmospheric correction showed that significant errors could result – and I recommended that visibility-derived AOT data should only be used for atmospheric correction as a last resort.

Key conclusions

  • Estimation of AOT from horizontal visibility measurements can produce significant errors.
  • Radiative transfer simulations using different models (eg. MODTRAN and 6S) with the same visibility may produce significantly different results due to the differing methods used for estimating AOT from visibility
  • Errors can be significant for both radiance values (significantly larger than the noise level of the sensor) and vegetation indices such as NDVI and ARVI.
  • Overall: other methods for estimating AOT should be used wherever possible – as they nearly all have smaller errors than visibility-based estimates – and great care should be taken at low visibilities, when the error is even higher.

Key results

  • Error in visibility-derived AOT is highest at low visibilities
  • Root Mean Square Error ranges from 1.05 for visibilities < 10km to 0.05 for visibilities > 40km
  • The error for low visibilities is many times the mean AOT at those visibilities (for example, an average error of 0.76 for visibilities < 10km, when the average AOT is only 0.38!)
  • Overall, MODTRAN appears to perform poorly compared to the other methods (6S and the Koschmieder formula) – and this is particularly pronounced at low visibilities
  • Atmospheric correction with these erroneously-estimate AOT values can produce significant errors in radiance, which range from three times the Noise Equivalent Delta Radiance to over thirty times the NEDR!
  • There can still be significant errors in vegetation indices (NDVI/ARVI), of up to 0.12 and 0.09 respectively.

History and Comments

This work developed from one of my previous papers (Spatial variability of the atmosphere over southern England, and its effects on scene-based atmospheric corrections). In that paper I investigated a range of methods for estimating AOT – and one of these was by estimating it from from visibility.

I had always been a bit frustrated that lots of atmospheric correction tools required visibility as an input parameter – and wouldn’t allow you enter AOT even if you had an actual AOT measurement (eg. from an AERONET site or a Microtops measurement). I started to wonder about the error involved in the use of visibility rather than AOT – and did a very brief assessment of the accuracy as part of my investigation for the spatial variability paper. An extension of that accuracy assessment turned into this paper.

Data, Code & Methods

The good news is that the analysis performed for this paper was designed from the beginning to be reproducible. I used the R programming language, and the ProjectTemplate package to make this really nice and easy. All of the code is available on Github, and the README file there explains that all you need to do to reproduce the analysis is run:


to initialise everything, install all of the packages etc. You can then run any of the files in the src directory to reproduce a specific analysis.

That’s all good news, so what is the bad news? Well, the problem with reproducing this work is that you need access to the data – and most of the data I used is only available to academics, or cannot be ‘rehosted’ by me. There are instructions in the repository showing how to get hold of the data, but it’s quite a complex process which requires registering with the British Atmospheric Data Centre, requesting access to datasets, and then downloading the various pieces of data.

This is a real pain, but there’s absolutely nothing I can do about it – sorry! I was really hoping I could use this as an example of reproducible research…but the data access situation makes that a bit difficult.

Previously Unpublicised Code: PyMicrotops

Continuing my series of code that I’ve written in the past, and stuck up on Github, but never actually talked about…this post is about PyMicrotops: a Python library for processing data from the Microtops Sun Photometer.


The Microtops (pictured above) measures light coming from the sun in a number of narrow wavebands, and then calculates various atmospheric parameters including the Aerosol Optical Thickness (AOT) and Precipitable Water Content (PWC). The instrument has it’s own built-in data logger, and the stored data can be downloaded by a serial connection to a computer (yes, it’s a fairly old – but very good – piece of kit!).

I’ve done a lot of work with Microtops instruments over the last few years, and put together the PyMicrotops library to help me. It’s functionality can be split into two main parts: downloading Microtops data from the instrument, and then processing the data itself.

Downloading the data from the Microtops instrument is difficult to do these days – as the original software provided by the manufacturer seems to no longer work. However, with PyMicrotops it is easy – just run read_microtops from the terminal, or the following snippet of Python code:

from PyMicrotops import Microtops
m = Microtops.read_from_serial('COM3', 'output_filename.csv')

Changing the port (you may want something like /dev/serial0 on Linux/Mac) and filename as appropriate).

Once you’ve read the data you can process it nice and easily in Python. If you read it using the Python code snippet above then you’ll already have the m object created, but if not then run the code below:

from PyMicrotops import Microtops

m = Microtops('filename.csv')

You can then do a range of useful tasks with the data, for example:

# Plot all of the AOT data
# Plot for a specific time period
# Get AOT at a specific wavelength - interpolating
# if needed, using the Angstrom coefficient

All outputs are returned as Pandas Series or DataFrames, and if you want to do more advanced processing then you can access the raw data read from the file as

PyMicrotops can be installed from PyPI by running pip install PyMicrotops, and the code is available on Github, which is also where any bug reports/feature requests should be posted.

My 2015 Python life

My last post about my favourite ‘new’ (well, new to me) Python packages seemed to be very well received. I’ll post a ‘debrief’ post within the next few weeks, reflecting on the various comments that were made on Hacker News, Reddit and so on, but before that I want to post a slightly more personal post about my work with Python during the last year. It’s been quite a big year for me in terms of my Python work – so lets get started!

My packages

At the beginning of 2015 I had four modules published on PyPI; now I have seven! Lets have a look at each of them in turn:


Py6S is my Python interface to the 6S Radiative Transfer Model (a model used to simulate how light passes through the atmosphere). It is a fairly specialist tool so will never get record numbers of downloads, but I’ve been impressed with number of users it has gained over the last year. It’s taken a while, but it seems to have started being used routinely by a range of scientists at various organisations including NASA!

The major change to Py6S in 2015 has been adding support for Python 3. This was actually meant to have been done in July 2014, but I think managed to make some sort of awful mistake with Git and revert the wrong commit… Anyway, it’s all working for Python 2 and Python 3 now! Most of the rest of the changes have just been small bugfixes, adding tests and so on.

Downloads in last month: ~1700


daterangeparser is a module to parse, you guessed it…date ranges. For example, it will take something like ’20th – 29th July 2014′ and create two datetime objects to represent the start and end dates. I first released it in April 2012 and it reached 1.0 in October 2013. I haven’t used it much recently, but other people seem to have found it and started using it – and submitting bug reports and pull requests! This really shows one of the huge benefits of open-source code: I’ve done very little with the code, but I’ve had nine pull requests adding new functionality and fixing bugs, which is great! I’m really glad that others can benefit from it.

Downloads in last month: ~1700


recipy is a new module for reproducible research in Python which was first released in August 2015. I’ve explained the story behind recipy elsewhere, but in brief: I created the first version at the Collaborations Workshop 2015 Hack Day (which my team won!), and we worked to improve it, before presenting at EuroSciPy 2015. The best way to find out how it works is to watch the video of my talk at EuroSciPy 2015.

Anyway, recipy was the first time that anything I had created went ‘viral’ (well, in a relatively minor way!). There were a lot of tweets about it, and a good discussion on reddit, and this led to a lot of downloads, quite a few contributions, and even a Linux Journal article! Overall there have been over 200 commits, 7 merged PRs, 52 closed issues and 7 released versions – not bad for a module released only six months ago!

Downloads in the last month: ~8800


I’ve already written a post about my implementation of the van Heuklon Ozone model – and this year I turned it into a proper package and released it on PyPI. Basically, I just got fed up with having to install it from git on new machines, as a lot of my scientific code requires it! It’s not particularly impressive – but it is useful to get a simple (computationally efficient) estimation of atmospheric ozone concentrations.

The only real change that I did this year was to make a conda package for it – see my previous post for details.

Downloads in the last month: ~110


Two other modules of mine have been ticking along with no work done to them: PyProSAIL (an interface to the ProSAIL model of canopy reflectance) and bib2coins (a tool to convert from BibTeX to COINS metadata for inclusion on webpages). Whenever I’ve used them they’ve worked fine for me – and I assume that either no-one else is using them, or that everything is working fine for them too (probably the former!).

Downloads in the last month: ~260 and ~100 respectively


For the first time this year I attended a conference designed entirely for programmers (all the conferences I’ve attended before have been for academics)…and it was great! I really enjoyed myself at EuroSciPy 2015: lots of the talks were interesting, my talk seemed to go down well (I won one of the Best Talk awards!), I managed to contribute usefully during the sprints, I met loads of interesting people…and even took my wife with me and introduced her to the Python programming community (you can see us both in the photo below). She was very proud to contribute a couple of PRs to the scikit-image project.


My wife and I at the EuroSciPy skimage sprint (photo by Juan Nunez-Iglesias)

Things I’ve (finally) started using…

This year I have finally started doing various things that I should have been doing for many years. A non-exhaustive list includes:

  • I’m now using Python 3 by default (hooray!). I can blame one of my colleagues in the Flowminder team for this – he insisted that we should write our code for mobile phone data analysis in Python 3 (not even trying to support Python 2) and I agreed, and now haven’t looked back! I’m now trying to ensure that as much of my code as possible (and definitely all of my code released to PyPI) supports Python 3…and I always start with Python 3 by default. The only code that I haven’t yet moved to Python 3 is the main algorithm I developed during my PhD…not because it can’t be converted, but just because I’ve got more important things to do to the code at the moment, and as I’m the only person using it (at the moment…) it isn’t a major problem.
  • I’m writing code that conforms to PEP8…at long last! I was doing quite a lot of the ‘really stupidly obvious’ bits of PEP8 anyway, but I’m now trying to stick very closely to PEP8, and it has made a noticeable difference in terms of the readability of my code. I have chosen to ignore a few bits of PEP8 (such as the 80 character line length, which I find is too short by the time you’ve had a couple of levels of indentation), but overall I like it. I use the flake8 tool to check this, and the linter and linter-flake8 packages for Atom to automatically check it while I’m coding. I must admit that most of this is my wife’s ‘fault’, as she was introduced to PEP8 as a non-negotiable requirement for submitting PRs to skimage while she was at EuroSciPy, and never considered that there was another way to write Python code (I sometimes wondered if she thought Python would raise a NotPEP8Compliant exception if she didn’t use PEP8!), and so – like a lot of good things – I picked it up from her.
  • I’ve written a lot of Python code this year, and for the first time a lot of this code has been written in a team – so I’ve learnt a lot about how to write maintainable, understandable code – and what problems will come back and bite me later and should really be sorted at the beginning.
  • I’ve started blogging a lot more, and some of my blog posts have actually become quite popular (particularly my post about my favourite 5 ‘new’ Python packages). I’ve also started a series called Previously Unpublicised Code, in which I talk about various pieces of code that I’ve put on Github over the years but have never really publicised or talked about at all. This has forced me to tidy up a lot of my code on Github – and I’ve made it a rule that I won’t post about any code unless it has a README, a LICENSE, and is mostly PEP8-compliant.

So, that’s mostly what I’ve been up-to in 2015 from a Python perspective – what have you been doing in Python in 2015?