Robin's Blog

Hacking the Worcester Wave thermostat in Python – Part 1

When we bought a new boiler last year, we decided to install a ‘smart thermostat’. There are a wide range available these days, including the Google Nest, the Hive (from British Gas), and the Worcester Bosch ‘Wave’. As we had a Worcester Bosch boiler we got the Wave – and it wasn’t much more expensive than a standard programmer/thermostat.

The thermostat looks like this:

Bosch-Wave-small

It is mounted on the wall at the bottom of our stairs, and is connected by a cable to the boiler. As you can see, it has a simple touch-sensitive interface: you can increase or decrease the thermostat setting, see the current temperature and the current setting, change from manual to program mode, and see whether the boiler is on or not. For anything more advanced (such as setting the programmable timer) you have to use the mobile app, which looks like this:
screen568x568

Looks familiar, doesn’t it? You have the same simple interface on the app ‘home screen’, but you can also configure the more advanced settings, such as the timer programme:

screen568x568 (1)

Overall, I’m pleased with the Wave thermostat: it is far easier to configure the timer settings on a phone app than on a tiny fiddly controller, and there are all sorts of advanced features you can use (for example, ‘optimisation’, where the system gradually learns how long it takes your house to warm up, and turns the heating on at the right time to get it to the temperature you want at the time you want) – combined with being able to control the heating when away from home.

Anyway, of course one of the first things that I looked for once it had been installed was an API that I could use to monitor and control the thermostat from my home server, ideally through a Python script. Unfortunately there was nothing about an API on the Worcester Bosch website, and when I phoned Customer Services they told me than there was no API available. So, I thought I’d try and reverse engineer the protocol that was used, and see if I could hack together a Python interface.

How does the system overall work?

When starting to investigate this, my a priori understanding of the system was that the app must send messages of some sort to a remote server (presumably run or controlled by Bosch), which then forwards the messages on to the thermostat itself, and vice-versa. This is because you can access the thermostat via the app wherever you are: you don’t have to be on the same wireless network. Whatever protocol or ports that are used must be ones that can reliably get through home firewalls so that the remote server can actually talk to the thermostat.

What technologies do they use?

When exploring the app I had noticed that there was a Product Information screen which showed information about the software and hardware versions, and included a button labelled Worcester Wave uses Open Source software. Tapping this shows a list of open-source packages that are used by the system, including their various license agreements. The list consisted of:

  • AndroidPlot – a plotting library for Android
  • Guava – Google Java core libraries
  • XMP Toolkit – eXtensible Metadata Platform
  • Smack – a Java XMPP (also known as Jabber) chat protocol implementation
  • JSR305 Expert Group – Annotations for Software Defect Detection in Java
  • Lucent technologies – Not sure, couldn’t find this one!
  • Chromium – I’m not sure why the Chromium web-browser was used
  • Takayua OOURA – Ooura’s Mathematical Software
  • Eigen software – matrix manipulation tools
  • libresample – resampling of data (primarily audio, but presumably other datatypes too)

So, looking at that list suggests that the eXtensible Messaging and Presence Protocol (XMPP) is used for communication, potentially with some metadata embedded with the eXtensible Metadata Platform. The rest of the libraries aren’t very relevant to our work, but are interesting to see (I suspect the mathematical ones are for doing the maths for some of the advanced features like optimisation).

What is being sent, and where to?

So, we think the communication is using XMPP – now we need to confirm that, and work out what is being sent and where it is being sent to. So, I opened up Wireshark to start sniffing the traffic. After playing around a bit with the filters, I got this (click to enlarge):

Screen Shot 2016-01-02 at 22.45.15

This shows that my guess was correct: the XMPP protocol is being used to send messages between the phone running the Wave Android app (192.168.0.42 on my local network) and a server run by Bosch (wa2-mz36-qrmzh6.bosch.de). So, we now know what protocols we are dealing with: that is the good news.

However, the bad news is the STARTTLS messages that we see there. You may recognise this from email configuration dialog boxes, as it is one of the security options for connecting to POP3/IMAP/SMTP servers. It stands for ‘Start Transport Layer Security’, and is basically a way of saying “I want to encrypt this conversation from now onwards, is that ok?” (for reference, the alternative is to do everything with encryption right from the start of the communication). Anyway, in this case the Bosch server responds with PROCEED (“Yes, I’m happy to do this encrypted”). From that point onwards, we see the TLS security negotiation (“What sort of encryption do you support?”, “I support X, Y and Z”, “Ok, lets use Z”, “Here are my keys” etc) followed by a lot of ‘Application Data’ messages:

Screen Shot 2016-01-02 at 22.54.06

We can’t actually see any of the content of the messages, as they are encrypted, so all we get is the raw hex dump: 80414a90ca64968de3a0acc5eb1b50108bbc5b26973626daaa….(and so on). Not very useful!

I think this is a good place to finish Part 1 of the story. We have worked out that communication goes from the app to a Bosch server, and it uses the XMPP protocol, but with TLS encryption using STARTTLS, so we haven’t been able to work out what any of the messages actually contain. Come back for Part 2 to hear how I managed to read the messages…


My MSc project: a simple ray-tracing radiative transfer model (with code!)

My PhD was done through a Doctoral Training Centre, and as part of this I had a taught year at the beginning of my PhD. During the summer of this year I had to do a ‘Summer Project’, which was basically a MSc thesis. My thesis was called “Can a single cloud spoil the view?”: modelling the effect of an isolated cumulus cloud on calculated surface solar irradiance” (I think the best thing about my thesis was the title!). It gained a ‘Distinction’ mark, and I was very pleased that it won the RSPSoc MSc thesis prize (again, I think they gave me the prize based on the title alone!). This is the story of this project…

Paths taken by light in the ray-tracing Radiative Transfer Model I developed for my MSc

Paths taken by light in the ray-tracing Radiative Transfer Model I developed for my MSc

During the year before my MSc, I had been doing a lot of investigation into Radiative Transfer Modelling. These models (known as RTMs) are used to estimate how light travels through the atmosphere: you basically tell them what the atmosphere is like (how hazy, how much water vapour etc), and they simulate the effect on light passing through the atmosphere.

I’d investigated a range of models – including simple models like SPCTRL2, good standard models like 6S, and advanced models like SCIATRAN. One thing that I noticed was that none of these models seemed to be able to incorporate clouds very well. SCIATRAN did support modelling clouds, but you had to have either an entirely clear or entirely overcast sky – you could have vertical variation in optical properties (ie. multiple layers of clouds) but not horizontal variation. I decided that I’d write some code to use one of these RTMs to model light passing through an atmosphere which is partially cloudy and partially clear – like most of the skies that we have in the UK.

So, I started reading more about these RTMs, and tried to work out how I might be able to extend them so that I could have a spatially-variable amount of cloud (eg. cloud on the left-hand side of my ‘simulated atmosphere’ and clear sky on the right-hand side – or even a single small cloud moving across the sky). I struggled with this for a while and didn’t really have much success at working out what I’d need to do. I naively thought that I would be able to write a tiny bit of code to replace the ‘homogenous atmosphere’ that SCIATRAN used with a ‘spatially-variable atmosphere’ and that would be it. Unfortunately, that wasn’t the case…

It turns out that the horizontally-homogeneous nature of the atmosphere is a fundamental assumption of the mathematics that these models are based upon. And I’d come up with a MSc project that required me to break this assumption. Ooops.

So – after panicking a bit and deciding that I can’t do a PhD or even a MSc – I decided that if the standard models wouldn’t let me do what I wanted then I’d have to implement a new Radiative Transfer Model myself, from scratch! A bit of a tall order…but I decided to have a go, figuring that even if it didn’t work I’d learn a lot in the process.

With a bit more reading I realised that the only way to take into account horizontally-varying atmospheric properties would be to implement a ray-tracing model. In this sort of model, individual rays of light are followed from their source (the sun), and each individual interaction with the atmosphere is simulated (scattering, absorption etc). They are far more time-consuming to run than ‘standard’ models – but allow far more interesting parameterisations, as the atmosphere is basically represented by a grid, which can be ‘filled’ with different atmospheric properties in each cell. Normally the grid would be a 3D grid, but I decided to use a 2D grid to make life easier – just taking into account vertical height into the atmosphere (sun at the top, sensor at the bottom) and horizontal distance along the ground.

I decided to base my model on SPECTRL2, which meant I could use a lot of the constants that were defined in the SPECTRL2 code – for things like Earth-sun distances, absorption at different wavelengths, etc. I then built a simple Monte Carlo Ray Tracing procedure that followed the flowchart below (click to enlarge):

Flowchart

I say “simple” – it took quite a while to get it working properly! Of course, after actually getting the model working, I had to work out how to actually parameterise the atmosphere, how to create a realistic cloud to put in this atmosphere, and then what computational experiments to run to actually answer my original question.

Answering the actual question I set out to answer was actually very easy to do once I’d got the model running – and it gave some interesting answers. For example, if there is a cloud near a sensor on the ground, but not directly between the sensor and sun, then the sensor will actually record a higher irradiance than it would if there were no cloud. This makes perfect sense when you think about it (the cloud will scatter some extra light towards the sensor), but I hadn’t thought of it before I did the experiment!

I don’t think there’s really much more to say here, apart from to direct you to my thesis for more information (including my detailed conclusions) and to say that the code is now available on Github. I have tested the code in the latest version of IDL (v8.5) and it is still working – although I should warn you that the code runs rather slowly!

Also, if you’re interested in using a proper 3D ray tracing RTM then look into the Discrete Anisotropic Radiative Transfer Model (DART). I’ve always been meaning to replicate my experiments with DART, but have never quite had the time…that’s another thing for my ‘list of papers to publish when I have time’.


Deprecating AutoZotBib

One of the most important things when developing software – particularly open-source software – is knowing when to stop working on something, relinquish responsibility and suggest that it should not be used any more. This is what I have recently done with AutoZotBib.

For those who don’t remember AutoZotBib, it is a tool that I first wrote a few years ago to automatically export Zotero bibliography libraries to the BibTeX format (used for LaTeX documents). I used it during the writing of my PhD, but there have always been some fairly significant problems with it. The main problem was performance: if you had a reasonably large Zotero library, then adding a new item would freeze the library for at least thirty seconds, while the export was performed. In fact, the Zotero developers added a warning to the plugins list page saying that my plugin may slow Zotero down significantly if you use it with a large library!

Over the years I did various updates to try and make this work better – and, in fact, I wrote most of a new version that did intelligent caching and only wrote out the bits that had changed…but I never quite got it fully working and never found the time to push it through to completion.

I haven’t had any time to work on it recently (the last commit was February 2014), and I haven’t even had the time to respond to the occasional support emails that I receive. I’d been thinking of dropping support for a while, but I wasn’t aware of anything else that did the same job, and I always had a hope that maybe, one day, I might find time to work on it.

Two things snapped me out of this, and made me accept that I needed to drop it:

  1. The latest version of Firefox requires all plugins to be signed, and as Zotero is a Firefox plugin (although it can run in standalone mode too), any Zotero plugins will also have to be signed. I tried to get AutoZotBib signed, but a lot has changed in the Firefox extension world since I first wrote it, and I couldn’t get it to work.
  2. I’ve found another plugin that does similar things, but does them better – and is well-maintained! It’s called Zotero Better BibTeX, and its Push Export feature does exactly what AutoZotBib did, but allows you to configure it to only export specific collections, allows export only when the computer is idle, etc.

So, I took a deep breath and updated the website to say that AutoZotBib is deprecated, and marked it as such on the Zotero Plugins page, suggesting that users use BetterBibTeX instead. Doing this has really taken a load off my mind – and I am now happy that I don’t have to try and motivate myself to work on it. Of course, the beauty of open-source software is that the code is still available, and so if anyone else wants to pick it up (or is desperate to use it, even though it is no longer supported) then they can!


Behind the paper: Are visibility-derived AOT estimates suitable for parameterising satellite data atmospheric correction algorithms?

This has been a bit slow coming, but I am now sticking to my promise to write a Behind the paper post for each of my published academic papers. This is about:

Wilson, R. T., E. J. Milton, and J. M. Nield (2015). Are visibility-derived AOT estimates suitable for parameterising satellite data atmospheric correction algorithms? International Journal of Remote Sensing 36 (6) 1675-1688

Publishers Link
Post-print PDF

Screen Shot 2015-10-22 at 21.17.03

Layman’s summary

Aerosol Optical Thickness (AOT) is a measure of how hazy the atmosphere is – that is, how easy it is for light to pass through it. It’s important to measure this accurately, as it has a range of applications. In this paper we focus on satellite image atmospheric correction. When satellites collect images of the Earth, the light has to pass through the atmosphere twice (once on the way from the sun to the Earth, once again on the way back from the Earth to the satellite) and this affects the light significantly. For example, it can be scattered or absorbed by various atmospheric constituents. We have to correct for this before we can use satellite images – and one of the ways to do that is to simulate what happens to light under the atmospheric conditions at the time, and then use this simulated information to remove the effects from the image.

To do this simulation we need various bits of information on what the atmosphere was like when the image was acquired – and one of these is the AOT value. The problem is that it’s quite difficult to get hold of AOT values. There are some ground measurements sites – but only about 300 of them across the whole world. Therefore, quite a lot of people use measurements of atmospheric visibility as a proxy for AOT. This has many benefits, as loads of these measurements are taken (by airports, and local meteorological organisations), but is a bit scientifically questionable, because atmospheric visibility is measured horizontally (as in “I can see for miles!”) and AOT is measured vertically. There are various ‘standard’ ways of estimating AOT from visibility – and some of these are built in to tools that do atmospheric correction of images – and I wanted to investigate how well these worked.

I used a few different datasets which had both visibility and AOT measurements, collected at the same time and place, and investigated the relationship. I found that the relationship was often very poor – and the error in the estimated AOT was never less than half of the mean AOT value (that is, if the mean AOT was 0.2, then the error would be 0.1 – not great!), and sometimes more than double the mean value! Simulating the effect on atmospheric correction showed that significant errors could result – and I recommended that visibility-derived AOT data should only be used for atmospheric correction as a last resort.

Key conclusions

  • Estimation of AOT from horizontal visibility measurements can produce significant errors.
  • Radiative transfer simulations using different models (eg. MODTRAN and 6S) with the same visibility may produce significantly different results due to the differing methods used for estimating AOT from visibility
  • Errors can be significant for both radiance values (significantly larger than the noise level of the sensor) and vegetation indices such as NDVI and ARVI.
  • Overall: other methods for estimating AOT should be used wherever possible – as they nearly all have smaller errors than visibility-based estimates – and great care should be taken at low visibilities, when the error is even higher.

Key results

  • Error in visibility-derived AOT is highest at low visibilities
  • Root Mean Square Error ranges from 1.05 for visibilities < 10km to 0.05 for visibilities > 40km
  • The error for low visibilities is many times the mean AOT at those visibilities (for example, an average error of 0.76 for visibilities < 10km, when the average AOT is only 0.38!)
  • Overall, MODTRAN appears to perform poorly compared to the other methods (6S and the Koschmieder formula) – and this is particularly pronounced at low visibilities
  • Atmospheric correction with these erroneously-estimate AOT values can produce significant errors in radiance, which range from three times the Noise Equivalent Delta Radiance to over thirty times the NEDR!
  • There can still be significant errors in vegetation indices (NDVI/ARVI), of up to 0.12 and 0.09 respectively.

History and Comments

This work developed from one of my previous papers (Spatial variability of the atmosphere over southern England, and its effects on scene-based atmospheric corrections). In that paper I investigated a range of methods for estimating AOT – and one of these was by estimating it from from visibility.

I had always been a bit frustrated that lots of atmospheric correction tools required visibility as an input parameter – and wouldn’t allow you enter AOT even if you had an actual AOT measurement (eg. from an AERONET site or a Microtops measurement). I started to wonder about the error involved in the use of visibility rather than AOT – and did a very brief assessment of the accuracy as part of my investigation for the spatial variability paper. An extension of that accuracy assessment turned into this paper.

Data, Code & Methods

The good news is that the analysis performed for this paper was designed from the beginning to be reproducible. I used the R programming language, and the ProjectTemplate package to make this really nice and easy. All of the code is available on Github, and the README file there explains that all you need to do to reproduce the analysis is run:

source('go.r')

to initialise everything, install all of the packages etc. You can then run any of the files in the src directory to reproduce a specific analysis.

That’s all good news, so what is the bad news? Well, the problem with reproducing this work is that you need access to the data – and most of the data I used is only available to academics, or cannot be ‘rehosted’ by me. There are instructions in the repository showing how to get hold of the data, but it’s quite a complex process which requires registering with the British Atmospheric Data Centre, requesting access to datasets, and then downloading the various pieces of data.

This is a real pain, but there’s absolutely nothing I can do about it – sorry! I was really hoping I could use this as an example of reproducible research…but the data access situation makes that a bit difficult.


Previously Unpublicised Code: PyMicrotops

Continuing my series of code that I’ve written in the past, and stuck up on Github, but never actually talked about…this post is about PyMicrotops: a Python library for processing data from the Microtops Sun Photometer.

Microtops

The Microtops (pictured above) measures light coming from the sun in a number of narrow wavebands, and then calculates various atmospheric parameters including the Aerosol Optical Thickness (AOT) and Precipitable Water Content (PWC). The instrument has it’s own built-in data logger, and the stored data can be downloaded by a serial connection to a computer (yes, it’s a fairly old – but very good – piece of kit!).

I’ve done a lot of work with Microtops instruments over the last few years, and put together the PyMicrotops library to help me. It’s functionality can be split into two main parts: downloading Microtops data from the instrument, and then processing the data itself.

Downloading the data from the Microtops instrument is difficult to do these days – as the original software provided by the manufacturer seems to no longer work. However, with PyMicrotops it is easy – just run read_microtops from the terminal, or the following snippet of Python code:

from PyMicrotops import Microtops
m = Microtops.read_from_serial('COM3', 'output_filename.csv')

Changing the port (you may want something like /dev/serial0 on Linux/Mac) and filename as appropriate).

Once you’ve read the data you can process it nice and easily in Python. If you read it using the Python code snippet above then you’ll already have the m object created, but if not then run the code below:

from PyMicrotops import Microtops

m = Microtops('filename.csv')

You can then do a range of useful tasks with the data, for example:

# Plot all of the AOT data
m.plot()
# Plot for a specific time period
m.plot('2014-07-10','2014-07-19')
# Get AOT at a specific wavelength - interpolating
# if needed, using the Angstrom coefficient
m.aot(550)

All outputs are returned as Pandas Series or DataFrames, and if you want to do more advanced processing then you can access the raw data read from the file as m.data.

PyMicrotops can be installed from PyPI by running pip install PyMicrotops, and the code is available on Github, which is also where any bug reports/feature requests should be posted.


My 2015 Python life

My last post about my favourite ‘new’ (well, new to me) Python packages seemed to be very well received. I’ll post a ‘debrief’ post within the next few weeks, reflecting on the various comments that were made on Hacker News, Reddit and so on, but before that I want to post a slightly more personal post about my work with Python during the last year. It’s been quite a big year for me in terms of my Python work – so lets get started!

My packages

At the beginning of 2015 I had four modules published on PyPI; now I have seven! Lets have a look at each of them in turn:

Py6S

Py6S is my Python interface to the 6S Radiative Transfer Model (a model used to simulate how light passes through the atmosphere). It is a fairly specialist tool so will never get record numbers of downloads, but I’ve been impressed with number of users it has gained over the last year. It’s taken a while, but it seems to have started being used routinely by a range of scientists at various organisations including NASA!

The major change to Py6S in 2015 has been adding support for Python 3. This was actually meant to have been done in July 2014, but I think managed to make some sort of awful mistake with Git and revert the wrong commit… Anyway, it’s all working for Python 2 and Python 3 now! Most of the rest of the changes have just been small bugfixes, adding tests and so on.

Downloads in last month: ~1700

daterangeparser

daterangeparser is a module to parse, you guessed it…date ranges. For example, it will take something like ’20th – 29th July 2014′ and create two datetime objects to represent the start and end dates. I first released it in April 2012 and it reached 1.0 in October 2013. I haven’t used it much recently, but other people seem to have found it and started using it – and submitting bug reports and pull requests! This really shows one of the huge benefits of open-source code: I’ve done very little with the code, but I’ve had nine pull requests adding new functionality and fixing bugs, which is great! I’m really glad that others can benefit from it.

Downloads in last month: ~1700

recipy

recipy is a new module for reproducible research in Python which was first released in August 2015. I’ve explained the story behind recipy elsewhere, but in brief: I created the first version at the Collaborations Workshop 2015 Hack Day (which my team won!), and we worked to improve it, before presenting at EuroSciPy 2015. The best way to find out how it works is to watch the video of my talk at EuroSciPy 2015.

Anyway, recipy was the first time that anything I had created went ‘viral’ (well, in a relatively minor way!). There were a lot of tweets about it, and a good discussion on reddit, and this led to a lot of downloads, quite a few contributions, and even a Linux Journal article! Overall there have been over 200 commits, 7 merged PRs, 52 closed issues and 7 released versions – not bad for a module released only six months ago!

Downloads in the last month: ~8800

vanHOzone

I’ve already written a post about my implementation of the van Heuklon Ozone model – and this year I turned it into a proper package and released it on PyPI. Basically, I just got fed up with having to install it from git on new machines, as a lot of my scientific code requires it! It’s not particularly impressive – but it is useful to get a simple (computationally efficient) estimation of atmospheric ozone concentrations.

The only real change that I did this year was to make a conda package for it – see my previous post for details.

Downloads in the last month: ~110

Others

Two other modules of mine have been ticking along with no work done to them: PyProSAIL (an interface to the ProSAIL model of canopy reflectance) and bib2coins (a tool to convert from BibTeX to COINS metadata for inclusion on webpages). Whenever I’ve used them they’ve worked fine for me – and I assume that either no-one else is using them, or that everything is working fine for them too (probably the former!).

Downloads in the last month: ~260 and ~100 respectively

Conferences

For the first time this year I attended a conference designed entirely for programmers (all the conferences I’ve attended before have been for academics)…and it was great! I really enjoyed myself at EuroSciPy 2015: lots of the talks were interesting, my talk seemed to go down well (I won one of the Best Talk awards!), I managed to contribute usefully during the sprints, I met loads of interesting people…and even took my wife with me and introduced her to the Python programming community (you can see us both in the photo below). She was very proud to contribute a couple of PRs to the scikit-image project.

CNg3PuJWgAAj0Qq

My wife and I at the EuroSciPy skimage sprint (photo by Juan Nunez-Iglesias)

Things I’ve (finally) started using…

This year I have finally started doing various things that I should have been doing for many years. A non-exhaustive list includes:

  • I’m now using Python 3 by default (hooray!). I can blame one of my colleagues in the Flowminder team for this – he insisted that we should write our code for mobile phone data analysis in Python 3 (not even trying to support Python 2) and I agreed, and now haven’t looked back! I’m now trying to ensure that as much of my code as possible (and definitely all of my code released to PyPI) supports Python 3…and I always start with Python 3 by default. The only code that I haven’t yet moved to Python 3 is the main algorithm I developed during my PhD…not because it can’t be converted, but just because I’ve got more important things to do to the code at the moment, and as I’m the only person using it (at the moment…) it isn’t a major problem.
  • I’m writing code that conforms to PEP8…at long last! I was doing quite a lot of the ‘really stupidly obvious’ bits of PEP8 anyway, but I’m now trying to stick very closely to PEP8, and it has made a noticeable difference in terms of the readability of my code. I have chosen to ignore a few bits of PEP8 (such as the 80 character line length, which I find is too short by the time you’ve had a couple of levels of indentation), but overall I like it. I use the flake8 tool to check this, and the linter and linter-flake8 packages for Atom to automatically check it while I’m coding. I must admit that most of this is my wife’s ‘fault’, as she was introduced to PEP8 as a non-negotiable requirement for submitting PRs to skimage while she was at EuroSciPy, and never considered that there was another way to write Python code (I sometimes wondered if she thought Python would raise a NotPEP8Compliant exception if she didn’t use PEP8!), and so – like a lot of good things – I picked it up from her.
  • I’ve written a lot of Python code this year, and for the first time a lot of this code has been written in a team – so I’ve learnt a lot about how to write maintainable, understandable code – and what problems will come back and bite me later and should really be sorted at the beginning.
  • I’ve started blogging a lot more, and some of my blog posts have actually become quite popular (particularly my post about my favourite 5 ‘new’ Python packages). I’ve also started a series called Previously Unpublicised Code, in which I talk about various pieces of code that I’ve put on Github over the years but have never really publicised or talked about at all. This has forced me to tidy up a lot of my code on Github – and I’ve made it a rule that I won’t post about any code unless it has a README, a LICENSE, and is mostly PEP8-compliant.

So, that’s mostly what I’ve been up-to in 2015 from a Python perspective – what have you been doing in Python in 2015?


My top 5 ‘new’ Python modules of 2015

As I’ve been blogging a lot more about Python over the last year, I thought I’d list a few of my favourite ‘new’ Python modules from 2015. These aren’t necessarily modules that were newly released in 2015, but modules that were ‘new to me’ this year – and may be new to you too!

tqdm

This module is so simplbut so useful – it makes it stupidly easy to display progress bars for loops in your code. So, if you have some code like this:

for item in items:
    process(item)

Just wrap the iterable in the tqdm function:

from tqdm import tqdm

for item in tqdm(items):
    process(item)

and you’ll get a beautiful progress bar display like this:
18%|█████████                               | 9/50 [00:09<;00:41,  1.00it/s

I introduced my wife to this recently and it blew her mind – it’s so beautifully simple, and so useful.

joblib

I had come across joblib previously, but I never really ‘got’ it – it seemed a bit of a mismash of various functions. I still feel like it’s a bit of a mishmash, but it’s very useful. I was re-introduced to it by one of my Flowminder colleagues, and we used it extensively in our data analysis code. So, what does it do? Well, three main things: 1) caching, 2) parallelisation, and 3) persistence (saving/loading data). I must admit that I haven’t used the parallel programming functionality yet, but I have used the other functions extensively.

The caching functionality allows you to easily ‘memoize’ functions with a simple decorator. This caches the results, and loads them from the cache when calling the function again using the same parameters – saving a lot of time. One tip for this is to choose the arguments of the function that you memoize carefully: although joblib uses a fairly fast hashing function to compare the arguments, it can still take a while if it is processing an absolutely enormous array (many Gigabytes!). In this case, it is often better to memoize a function that takes arguments of filenames, dates, model parameters or whatever else is used to create the large array – cutting out the loading of the large array and the hashing of that array on each call.

The persistence functionality is strongly linked to the memoization functions – as it is what is used to save the cached results to file. It basically performs the same function as the built-in pickle module (or the dill module – see below), but works really efficiently for objects that contain numpy arrays. The interface is exactly the same as the pickle interface (simple load and dump functions), so it’s really easy to switch over. One thing I didn’t realise before is that if you set compressed=True then a) your output files will be smaller (obviously!) and b) the output will all be in one file (as opposed to the default, which produces a .pkl file along with many .npy files).

folium

folium

I’ve barely scratched the surface of this library, but it’s been really helpful for doing quick visualisations of geographic data from within Python – and it even plays well with the Jupyter Notebook!

One of the pieces of example code from the documentation shows how easily it can be used:

map_1 = folium.Map(location=[45.372, -121.6972])
map_1.simple_marker([45.3288, -121.6625], popup='Mt. Hood Meadows')
map_1.simple_marker([45.3311, -121.7113], popup='Timberline Lodge')
map_1.create_map(path='output.html')

You can easily configure almost every aspect of the map above, including the background map used (any leaflet tileset will work), point icons, colours, sizes and pretty-much anything else. You can visualise GeoJSON data and do choropleth maps too (even linking to Pandas data frames!).

Again, I used this in my work with Flowminder, but have since used it in all sorts of other contexts too. Just taking the code above and putting the call to simple_marker in a loop makes it really easy to visualise a load of points.

The example above shows how to save a map to a specified HTML file – but to use it within the Jupyter Notebook just make sure that the map object (map_1 in the example above) is by itself on the final line in a cell, and the notebook will work its magic and display it inline…perfect!

tinydb

The first version of my ‘new’ module recipy (as presented at the Collaborations Workshop 2015) used MongoDB as the backend data store. However, this added significant complexity to the installation/set-up process, as you needed to install a MongoDB server first, get it running etc. I went looking for a pure-Python NoSQL database and came across TinyDB…which had a simple interface, and has handled everything I’ve thrown at it so far!

In the longer-term we are thinking of making the backend for recipy configurable – so that people can use MongoDB if they want some of the advantages that brings (being able to share the database easily across a network, better performance), but we’ll still keep TinyDB as the default as it just makes life so much easier!

dill

dill is a better pickle (geddit?). You’ve probably used the built-in pickle module to store various Python objects to disk but every so often you may have received an error like this:

---------------------------------------------------------------------------
PicklingError                             Traceback (most recent call last)
<ipython-input-5-aa42e6ee18b1> in <module>()
----> 1 pickle.dumps(x)

PicklingError: Can't pickle <function <lambda> at 0x103671e18>: attribute lookup <lambda> on __main__ failed

That’s because the pickle object is relatively limited in what it can pickle: it can’t cope with nested functions, lambdas, slices, and more. You may not often want to pickle those objects directly – but it is fairly common to come across these inside other objects you want to pickle, thus causing the pickling to fail.

dill solves this by simply being able to pickle a lot more stuff – almost everything, in fact, apart from frames, generators and tracebacks. As with joblib above, the interface is exactly the same as pickle (load and dump), so it’s really easy to switch.

Also, for those of you who like R’s ability to save the entire session in a .RData file, dill has dump_session and load_session functions too – which do exactly what you’d expect!

Bonus: Anaconda

This is a ‘bonus’ as it isn’t actually a Python module, but is something that I started using for the first time in 2015 (yes, I was late to the party!) but couldn’t manage without now!

Anaconda can be a little confusing because it consists of a number of separate things – multiple of which are called ‘Anaconda’. So, these are:

  • A cross-platform Scientific Python distribution – along the same lines as the Enthought Python Distribution, WinPython and so on. Once you’ve downloaded and installed it you get a full scientific Python stack including all of the standard libraries (numpy, scipy, pandas, matplotlib, sklearn, skimage…and many more). It is available in four flavours overall: Python 2 or Python 3, each of which has the option of the full Anaconda distribution, or the reduced-size Miniconda distribution. This leads nicely on to…
  • The conda package management system. This is designed to work with Python packages, but can be used for installing any binaries. You may think this sounds very much like pip, but it’s far better because a) it installs binaries (no more compiling numpy from source!), b) it deals with dependencies better, and c) you can easily create multiple environments which can have different packages installed and run different Python versions
  • The anaconda.org repository (previously called binstar) where you can create an account and upload binary conda packages to easily share with others. For example, I’ve got a couple of conda packages hosted there, which makes them really easy to install for anyone running conda.

So, there we go – my top five Python modules that were new to me in 2015. Hope you found it useful – and Merry Christmas and Happy New Year to you all!

(If you liked this post, then you may also like An easy way to install Jupyter Notebook extensions, Bokeh plots with DataFrame-based tooltips, and Orthogonal Distance Regression in Python)


An easy way to install Jupyter Notebook extensions

I wrote a post a few months ago about a couple of useful Jupyter (formerly known as IPython) notebook extensions, and commented that they were a bit of a pain to install. Well, I’ve found a great way to get around that problem – an extension called nbextensions that will help you manage your notebook extensions.

After installing it (see below for more about how to do this), you can go to a new URL in the notebook /nbextensions, and you get the wonderful page below (click to enlarge):

Screen Shot 2015-12-13 at 12.18.32

Here you can browse through a list of extensions, read their documentation and enable/disable them – all very easily.

Installing the nbextensions can be a little tricky, but there are detailed instructions at the nbextensions Github page. I’ve tried to make it easier by building a conda package which you can install by running:

conda install -c https://conda.anaconda.org/robintw nbextensions

I haven’t tested this on many other computers – so beware that it may not work for you, and it will definitely only work on Jupyter 4.0 or newer (there are instructions on the nbextensions github page for how to get it working with IPython 2.0 and 3.0). Let me know if this package works for you – and enjoy your new extensions!


Convolution in python – which function to use?

Slightly boringly, this very similar to my last post – but it’s also something useful that you may want to know, and that I’ll probably forget if I don’t write it down somewhere.

Basically, scipy.ndimage.filters.convolve is about twice as fast as scipy.signal.convolve2d.

I run convolutions a lot on satellite images, and Landsat images are around 8000 x 8000 pixels. Using a random 8000 x 8000 pixel image, with a 3 x 3 kernel (a size I often use), I find that convolve2d takes 2.9 seconds, whereas convolve takes 1.1 seconds – a useful speedup, particularly if you’re running a loop of various convolutions. The same sort of speed-up can be seen with larger kernel sizes, for example, a 9 x 9 kernel takes 14.7 seconds and 6.8 seconds – an even more useful speed-up.

When I switched my code from convolve2d to convolve, I had to do a bit of playing around to get the various options set to ensure that I got the same results. For reference, the following two lines of code are equivalent:

res1 = convolve2d(img, kernel, mode='same')
res2 = convolve(img, kernel, mode='constant', cval=0.0)

Interestingly these functions are both within scipy, so there must be a reason for them both to exist. Can anyone enlighten me? My naive view is that they could be merged – or if not, then a note could be added to the documentation saying that convolve is far faster!


Previously Unpublicised Code: manifestoclouds

This entry in my series covers manifestoclouds: my code for producing word clouds from political party manifestos.

This is very simple, generic code that just ties together a few libraries – and is by no means restricted to just political party manifestos – but I keep it around because I find it useful occasionally. I use the Python WordCloud library and matplotlib to produce the word clouds – after manually extracting the text from the PDFs using one of a myriad of tools depending on your platform.

The results look quite nice – and the results for the political party manifestos are quite politically interesting:


Green Labour

Everyone is very focused on people and what the party will do, but there are some interesting differences. The Conservatives have the joint focus on continuing things they have been doing during their years in coalition, but also doing various new things now that they’re no longer ‘held back’ by the Lib Dems. The Green party focus a lot on local things, and have a focus on what the public (both in terms of people, and in terms of the public sector) can do. The Labour party are talking a lot about needs, but also talk about work more than the Tories (surprisingly!).

Anyway, ignoring the politics, the code is available on Github as usual.