Robin's Blog

Favourite books I read in 2019

I made a concerted effort to read more in 2019 – and I succeeded, reading a total of 42 books over the year (and this doesn’t even include the many books I read to my toddler).

I’ve chosen a selection of my favourites from the year to highlight in this blog post in the hope that you may enjoy some of them too.

Non-fiction

Countdown to Zero Day: Stuxnet and the Launch of the World’s First Digital Weapon by Kim Zetter

This book covers the fascinating story of the Stuxnet virus which was created to attack nuclear enrichment plants in Iran. It has enough technical details to satisfy me (particularly if you read all the footnotes), but is still accessible to less technical readers. The story is told in a very engaging manner, and the subject matter is absolutely fascinating. My only criticism would be that the last couple of chapters get a bit repetitive – but that’s a minor issue.
Amazon link

Just Mercy by Bryan Stephenson

I saw this book recommended on many other people’s reading lists, and was glad it read it. It was well-written and easy to read from a language point of view, but very hard to read from an emotional point of view. The stories of miscarriages of justice – particularly for black people – are terrifying, and really reinforced my opposition to capital punishment.
Amazon link

The Hut 6 Story by Gordon Welchman

I visited Bletchley Park last year – on a rare child-free weekend with my wife – and saw this book referred to a number of times in the various exhibitions there. I’d read a lot of books about the Bletchley Park codebreakers before but this one is far more technical than most and gives a really detailed description of the method that Gordon worked out for cracking one of the Enigma codes. I must admit that the appendix covering how the ‘diagonal board’ addition to the Bombes worked went a bit over my head – but the rest of it was great.
Amazon link

Atomic Accidents by James Mahaffey

I was recommended this book by various people on Twitter who kept quoting bits about how people thought ‘Plutonium fires wouldn’t be a big deal’, alongside other amusing quotes. I thought I knew quite a bit about nuclear accidents – given that I worked for a nuclear power station company, and have quite an interest in accident investigations – but I really enjoyed this book and learned a lot about various accidents that I hadn’t heard of before. It’s very readable – although occasionally a bit repetitive – and a fun read.
Amazon link

Prisoners of Geography by Tim Marshall

I can’t remember how I came across this book, but I’m glad that I did – it’s a fascinating look at how geography (primarily physical geography) affects countries and their relationships with each other. Things like the locations of mountain ranges, or places where you can access deep-water ports, have huge geopolitical consequences – and this book explores this for a selection of ten countries/regions. This book really helped me understand a number of world events in their geopolitical context, and I think of it often when listening to the news or reading about current events.
Amazon link

The Matter of the Heart: A History of the Heart in Eleven Operations by Thomas Morris

This is a big book – and fairly heavy-going in places – but it’s worth the effort. It’s a fascinating look at the heart and how humans have learnt to ‘fix’ it in various ways. It’s split into chapters about various different operations – such as implanting pacemakers, replacing valves, or transplanting an entire heart – and each chapter covers the whole historical development of that operation, from first conception to eventual widespread success. There are lot of fascinating stories (did you know that CPR was only really introduced in the 1960s?) and it’s amazing how informally a lot of these operations started – and how many people unfortunately died before the operations became successful.
Amazon link

The Dam Busters by Paul Brickhill

I’d enjoyed some of Paul Brickhill’s other books (such as The Great Escape), and yet this book had been sitting on my shelf, unread, for years. I finally got round to reading it, and enjoyed it more than I thought. A lot of the first half of the book is actually about the development of the bomb – I thought it would be all about the actual raid itself – and I found this very enjoyable from a technical perspective. The story of the raid is well-written – but I found the later chapters about some of the other things that the squadron did less interesting.
Amazon link

The Vaccine Race: How Scientists Used Human Cells to Combat Killer Viruses by Meredith Wadman


I’d never really thought about how vaccines were made – but I found this book around the time that I took my son for some of his childhood vaccinations, and found it fascinating. There are a lot of great stories in this book, but the reason it’s at the end of my list is that it is a bit heavy-going at times, and some of the stories are probably a bit gruesome for some people. Still, it’s a good read.
Amazon link

Fiction

I’ve read far more fiction in the last year than I have for quite a while – but they were mostly books by two authors, so I’ll deal with those two authors separately below.

Robert Harris

I read Robert Harris’ book Enigma many years ago and really enjoyed it, but never got round to reading any of his other work. This year I made up for this, reading Conclave – which is about the intrigue surrounding the election of a new Pope, Pompeii – which focuses on an aqueduct engineer noticing changes around Vesuvius before the eruption, and An Officer and a Spy – which tells the true story of a miscarriage of justice in 19th century France. I thoroughly enjoyed all of these – there’s something about the way that Harris sets a scene and really helps you to get the atmosphere of Roman Pompeii or the Sistine Chapel during the vote for a new Pope.

Rosie Lewis

I came across Rosie Lewis through a free book available on the Kindle store and was gripped. Rosie is a foster carer and writes with clarity and feeling about her experiences fostering various children. Her books include Torn, Taken, Broken and Betrayed, and each of them has thoroughly engaged me. As with Just Mercy above, it is an easy, but emotional, read – I cried multiple times while reading these. My favourite was probably Taken, but they were all good.

The Last Days of Night by Graham Moore

This book is a novelisation of true events around the development of electricity and the electric light bulb, focusing particularly on the patent dispute between Tesla, Westinghouse and Edison over who invented the lightbulb – and also their arguments over the best sort of current to use (AC vs DC). The book has everything: nice technical detail on electrical engineering, and a love story with lots of intrigue along the way.
Amazon link


Talks I’m giving this year

As I’ve mentioned before, I give talks on a range of topics to various different audiences, including local science groups, school students and at programming conferences.

I’ve already got a number of talks in the calendar for this year, as detailed below. I’ll try and keep this post up-to-date as I agree to do more talks. All of these talks (so far) are in southern England – so if you’re local then please do come along and listen.

So far all of my bookings are for one of my talks – an introduction to satellite imaging and remote sensing called Monitoring the environment from space. I do a number of other talks (see list here) and I’d love the opportunity to present them to your group: please get in touch to find out more details.


Southampton Cafe Scientifique
21st January @ 20:00
St Denys, Southampton
Title: Monitoring the environment from space
More details

Isle of Wight Cafe Scientifique
10th February @ 19:00
Shanklin, Isle of Wight
Title: Monitoring the environment from space
More details

Three Counties Science Group
17th February @ 13:45
Chiddingford, near Godalming, Surrey
Title: Monitoring the environment from space
More details

Southampton Astronomy Society
9th April @ 19:30
Shirley, Southampton
Title: Monitoring the environment from space
More details


Creating an email service for my son’s childhood memories with Python

For a number of years – since my now-toddler son was a small baby – I’ve been keeping track of various childhood achievements or memories. When I first came up with this I was rather sleep-deprived, and couldn’t decide what the best way to store this information would be – so I went with a very simple option. I just created a Word document with a table in it with two columns: the date, and the activity/achievement/memory. For example:

file

This was very flexible as it allowed me to keep anything else I wanted in this document – and it was portable (to anyone who have access to some way of reading Word documents) – and accessible to non-technical people such as my son’s grandparents.

After a while though, I wondered if I’d made the right decision: shouldn’t I have put it into some other format that could be accessed programmatically? After all, if I kept doing this for his entire childhood then I’d have a lot of interesting data in there…

Well, it turns out that a Word table isn’t too awful a format to store this sort of data in – and you can access it fairly easily from Python.

Once I realised this, I worked out what I wanted to create: a service that would email me every morning listing the things I’d put as diary entries for that day in previous years. I was modelling this very much on the Timehop app that does a similar thing with photographs, tweets and so on, so I called it julian_timehop.

If you just want to go to the code then have a look at the github repo – otherwise, read on to find out how I did it.

Steps

Let’s start by thinking about what the main steps we need to take are:

  1. First we need to get hold of the document. I update it fairly regularly, and it lives on my laptop – whereas this script would need to run on my Linux server, so it can easily run at the same time each day. The easiest way around this was to store the document in Dropbox and use the Dropbox API to grab a copy when we run the script.

  2. We then need to parse the document to extract the table of diary entries.

  3. Once we’ve got the table, we can subset it to the rows that match today’s date (ignoring the year)

  4. We then need to prepare the text of an email based on these rows, and then send the email

Let’s look at each of these in turn now.

Getting the file from Dropbox

We want to do pretty-much the simplest operation possible using the Dropbox API: login and retrieve the latest version of a file. I’ve used the Dropbox API from Python before (see my post about analysing my thesis-writing timeline with Dropbox) and it’s pretty easy. In fact, you can accomplish this task with just four lines of code.

First, we need to connect to Dropbox and authenticate. To do this, we’ll use a Dropbox API key (see here for instructions on how to get one). We don’t want to include this API key directly in the code – as we could accidentally share it with someone else (for example, by uploading the code to Github) – so we store in an environment variable called DROPBOX_KEY.

We can get the key from this environment variable with

dropbox_key = os.environ.get('DROPBOX_KEY')

We can then create a Dropbox API connection and authenticate

dbx = dropbox.Dropbox(dropbox_key)

To download a file, we just call the files_download_to_file method

dbx.files_download_to_file(output_filename, path)

In this case the path argument is the path of the file inside the Dropbox folder – in my case the path is /Notes and diary entries for Julian.docx as the file is in the root Dropbox folder.

Putting this together we get a function to download a file from Dropbox

def download_file(path):
    """
    Download a file from Dropbox, returning the local filename of the downloaded file

    Requires the DROPBOX_KEY env var to be set to a valid Dropbox API key
    """
    dropbox_key = os.environ.get('DROPBOX_KEY')
    dbx = dropbox.Dropbox(dropbox_key)
    output_filename = 'document.docx'
    dbx.files_download_to_file(output_filename, path)

    return output_filename

That’s the first step completed; next we need to extract the table from the Word document.

Extracting the table

In a previous job I did some work which involved automating the creation of Powerpoint presentations, and I used the excellent python-pptx library for reading and writing Powerpoint files. Conveniently, there is a sister library available for Word documents called python-docx which works in a very similar way.

We’re going to convert the Word table to a pandas DataFrame, so after installing python-docx we need to import the main Document class, along with pandas itself

from docx import Document
import pandas as pd

We can parse the document by creating an instance of the Document class with the filename as a parameter

doc = Document(filename)

The doc object has various useful methods and attributes – and one of these is a list of tables in the document. We know that we want to parse the first table – so we just select the 0th index

tab = doc.tables[0]

To create a pandas DataFrame, we need a list containing the contents of each column: here that means a list of dates and a list of entries.

tab.column_cells(0) gives us an iterator over all the cells in column 0, and each cell has a .text method to give the text content of that cell – so we can write a list comprehension to extract all of the contents into a list

dates = [cell.text for cell in tab.column_cells(0)]

We can then use the very handy pd.to_datetime function to convert these to actual date objects. We pass the argument errors='coerce' to force it to parse all entries in the list, without giving errors if one of them isn’t a valid date (in this case it will return NaT or Not a Time).

We can do the same for descriptions, and then put the descriptions and dates together into a DataFrame.

Here is the full code:

def read_table_from_doc(filename):
    doc = Document(filename)
    tab = doc.tables[0]

    dates = [cell.text for cell in tab.column_cells(0)]
    dates = pd.to_datetime(dates, errors='coerce')

    descs = [cell.text for cell in tab.column_cells(1)]

    df = pd.DataFrame({'desc':descs}, index=dates)

    return df

Creating the text for an email

The next step is to create the text to put in an email message, listing the date and the various memories. I wanted an output that looked like this:


01 December

2018:
Memory from 2018

2018:
Another memory from 2018

2017:
A memory from 2017


The code for this is fairly simple, and I’ll only mention the interesting bits.

Firstly, we create a subset of the DataFrame, where we only have the rows where the date was the same as today’s date (ignoring the year):

today = datetime.datetime.now()
subdf = df[(df.index.month == today.month) & (df.index.day == today.day)]

Here we’re combining two boolean indexing operations with & – though do remember to use brackets, as the order of precedence inside these boolean expressions doesn’t always work in the way you’d expect (I’ve been caught out by this a number of times).

As I knew this would only be running on my server, I could use new Python 3.7 features – so I used f-strings. This means that the body of my loop to create the HTML for the email body looks like this

text += f"<p><b>{i.year!s}:</b></br>{row['desc']}</p>\n\n"

Here we’re including variables such as i.year (the year value of the datetime index) and row['desc'] (the value of the desc column of this row).

Putting it together into a function gives the following code, which either returns the HTML text of the email or None if there are no events matching this date

def get_formatted_message(df):
    today = datetime.datetime.now()

    subdf = df[(df.index.month == today.month) & (df.index.day == today.day)]

    if len(subdf) == 0:
        return

    title_date = datetime.datetime.now().strftime('%d %B')
    text = f'<h2>{title_date}</h2>\n\n'
    for i, row in subdf.iterrows():
        text += f"<p><b>{i.year!s}:</b></br>{row['desc']}</p>\n\n"

    return text

Sending the email

I’ve written code to send emails in Python before, and had great difficulty. Emails are a lot more difficult than a lot of people think, and often my emails never got sent, or never got to their destination, or broke in some other way.

This time I managed to avoid all of those problems by using the emails library. Writing a send_email function using this library was so easy:

def send_email(text):
    message = emails.html(html=text,
                          subject='Julian Timehop',
                          mail_from=('Julian Timehop Emailer', '[email protected]'))

    password = os.environ.get('SMTP_PASSWORD')

    r = message.send(to=('R Wilson', '[email protected]'),
                     smtp={'host':'mail.rtwilson.com',
                           'port': 465, 'ssl': True,
                           'user': '[email protected]', 'password': password})

All of the code above is self-explanatory: we’re creating a HTML email message (it details with all of the escaping and encoding necessary), grabbing the password from another environment variable and then sending the email. Easy!

Putting it all together

We’ve now written a function for each individual step – and now we just need to put the functions together. The benefit of writing your script this way is that the main part of your script is just a few lines.

In this case, all we need is

filename = download_file('/Notes and diary entries for Julian.docx')
df = read_table_from_doc(filename)
text = get_formatted_message(df)
if text is not None:
    send_email(text)

Another benefit of this is that for anyone (including yourself) coming back to it in the future, it is very easy to get an overview of what the script does, before delving into the details.

So, I now have this set up on my server to send me an email every morning with some memories of my son’s childhood. Here’s to many more happy memories – and remember to check out the code if you’re interested.


I do freelance work on Python programming and data science – please see my freelance website for more details.


Automatically downloading nursery photos from ParentZone using Selenium

My son goes to a nursery part-time, and the nursery uses a system called ParentZone from Connect Childcare to send information between us (his parents) and nursery. Primarily, this is used to send us updates on the boring details of the day (what he’s had to eat, nappy changes and so on), and to send ‘observations’ which include photographs of what he’s been doing at nursery. The interfaces include a web app (pictured below) and a mobile app:

file

I wanted to be able to download all of these photos easily to keep them in my (enormous) set of photos of my son, without manually downloading each one. So, I wrote a script to do this, with the help of Selenium.

If you want to jump straight to the script, then have a look at the ParentZonePhotoDownloader Github repository. The script is documented and has a nice command-line interface. For more details on how I created it, read on…

Selenium is a browser automation tool that allows you to control pretty-much everything a browser does through code, while also accessing the underlying HTML that the browser is displaying. This makes it the perfect choice for scraping websites that have a lot of Javascript – like the ParentZone website.

To get Selenium working you need to install a ‘webdriver’ that will connect to a particular web browser and do the actual controlling of the browser. I’ve chosen to use chromedriver to control Google Chrome. See the Getting Started guide to see how to install chromedriver – but it’s basically as simple as downloading a binary file and putting it in your PATH.

My script starts off fairly simply, by creating an instance of the Chrome webdriver, and navigating to the ParentZone homepage:

driver = webdriver.Chrome()
driver.get("https://www.parentzone.me/")

The next line: driver.implicitly_wait(10) tells Selenium to wait up to 10 seconds for elements to appear before giving up and giving an error. This is useful for sites that might be slightly slow to load (eg. those with large pictures).

We then fill in the email address and password in the login form:

email_field = driver.find_element_by_xpath('//*[@id="login"]/fieldset/div[1]/input')
email_field.clear()
email_field.send_keys(email)

Here we’re selecting the email address field using it’s XPath, which is a sort of query language for selecting nodes from an XML document (or, by extension, an HTML document – as HTML is a form of XML). I have some basic knowledge of XPath, but usually I just copy the expressions I need from the Chrome Dev Tools window. To do this, select the right element in Dev Tools, then right click on the element’s HTML code and choose ‘Copy->Copy XPath’:

file

We then clear the field, and fake the typing of the email string that we took as a command-line argument.

We then repeat the same thing for the password field, and then just send the ‘Enter’ key to submit the field (easier than finding the right submit button and fake-clicking it).

Once we’ve logged in and gone to the correct page (the ‘timeline’ page) we want to narrow down the page to just show ‘Observations’ (as these are usually the only posts that have photographs). We do this by selecting a dropdown, and then choosing an option from the dropdown box:

dropdown = Select(driver.find_element_by_xpath('//*[@id="filter"]/div[2]/div[4]/div/div[1]/select'))
dropdown.select_by_value('7')

I found the right value (7) to set this to by reading the HTML code where the options were defined, which included this line: <option value="7">Observation</option>.

We then click the ‘Submit’ button:

submit_button = driver.find_element_by_id('submit-filter')
submit_button.click()

Now we get to the bit that had me stuck for a while… The page has ‘infinite scrolling’ – that is, as you scroll down, more posts ‘magically’ appear. We need to scroll right down to the bottom so that we have all of the observations before we try to download them.

I tried using various complicated Javascript functions, but none of them seemed to work – so I settled on a naive way to do it. I simply send the ‘End’ key (which scrolls to the end of the page), wait a few seconds, and then count the number of photos on the page (in this case, elements with the class img-responsive, which is used for photos from observations). When this number stops increasing, I know I’ve reached the point where there are no more pictures to load.

The code that does this is fairly easy to understand:

html = driver.find_element_by_tag_name('html')
old_n_photos = 0
while True:
    # Scroll
    html.send_keys(Keys.END)
    time.sleep(3)
    # Get all photos
    media_elements = driver.find_elements_by_class_name('img-responsive')
    n_photos = len(media_elements)

    if n_photos > old_n_photos:
        old_n_photos = n_photos
    else:
        break

We’ve now got a page with all the photos on it, so we just need to extract them. In fact, we’ve already got a list of all of these photo elements in media_elements, so we just iterate through this and grab some details for each image. Specifically, we get the image URL with element.get_attribute('src'), and then extract the unique image ID from that URL. We then choose the filename to save the file as based on the type of element that was used to display it on the web page (the element.tag_name). If it was a <img> tag then it’s an image, if it was a <video> tag then it was a video.

We then download the image/video file from the website using the requests library (that is, not through Selenium, but separately, just using the URL obtained through Selenium):

# For each image that we've found
for element in media_elements:
    image_url = element.get_attribute('src')
    image_id = image_url.split("&d=")[-1]

    # Deal with file extension based on tag used to display the media
    if element.tag_name == 'img':
        extension = 'jpg'
    elif element.tag_name == 'video':
        extension = 'mp4'
    image_output_path = os.path.join(output_folder,
                                        f'{image_id}.{extension}')

    # Only download and save the file if it doesn't already exist
    if not os.path.exists(image_output_path):
        r = requests.get(image_url, allow_redirects=True)
        open(image_output_path, 'wb').write(r.content)

Putting this all together into a command-line script was made much easier by the click library. Adding the following decorators to the top of the main function creates a whole command-line interface automatically – even including prompts to specify parameters that weren’t specified on the command-line:

@click.command()
@click.option('--email', help='Email address used to log in to ParentZone',
              prompt='Email address used to log in to ParentZone')
@click.option('--password', help='Password used to log in to ParentZone',
              prompt='Password used to log in to ParentZone')
@click.option('--output_folder', help='Output folder',
              default='./output')

So, that’s it. Less than 100 lines in total for a very useful script that saves me a lot of tedious downloading. The full script is available on Github

_I do freelance work in Python programming and data science – see my freelance website for more details._


Using SQLAlchemy to access MySQL without frustrating library installation issues

This is more a ‘note to myself’ than anything else, but I expect some other people might find it useful.

I’ve often struggled with accessing MySQL from Python, as the ‘default’ MySQL library for Python is MySQLdb. This library has a number of problems: 1) it is Python 2 only, and 2) it requires compiling against the MySQL C library and header files, and so can’t be simply installed using pip.

There is a Python 3 version of MySQLdb called mysqlclient, but this also requires compiling against the MySQL libraries and header files, so can be complicated to install.

The best library I’ve found as a replacement is PyMySQL which is a pure Python library (so no need to install MySQL libraries and header files). It’s API is basically exactly the same as MySQLdb, so it’s easy to switch across.

Right, that’s the introduction – and we’re really at the actual point of this post, which is how to go about using the PyMySQL library ‘under the hood’ when you’re accessing databases through SQLAlchemy.

The weird thing is that I’m not actually using SQLAlchemy by choice in my code – but it is used by pandas to convert between SQL and data frames.

For example, you can write code like this:

from sqlalchemy import create_engine
eng = create_engine('mysql://user:[email protected]/database')
df.to_sql('table', eng, if_exists='append', index=False)

which will append the data in df to a table in a database running on the local machine.

The create_engine call is a SQLAlchemy function which creates an engine to handle all of the complex communication to and from a specific database.

Now, when you specify a database connection string with the mysql:// prefix, SQLAlchemy tries to use the MySQLdb library to do the underlying communication with the MySQL database – and fails if it can’t be found.

So, now we’re at the actual solution: which is that you can give SQLAlchemy a ‘dialect’ to use to connect to a database – and this can be used to change the underlying library that is used to talk to the database.

So, you can change your connection string to mysql+pymysql://user:[email protected]0.0.1/database and it will use the PyMySQL library. It’s as simple as that!

There are other dialects that you can use to connect to MySQL using different underlying libraries – although these aren’t recommended by the authors of SQLAlchemy. You can find a list of them here.

_I do data science work – including processing data in MySQL databases – as part of my freelance work. Please contact me for more details._


A couple of handy zsh/bash functions for Python programmers

Just a quick post today, to tell you about a couple of simple zsh functions that I find handy as a Python programmer.

First, pyimp – a very simple function that tries to import a module in Python and displays the output. If there is no output then the import succeeded, otherwise you’ll see the error. This saves constantly going into a Python interpreter and trying to import something, making that ‘has it worked or not’ cycle a bit quicker when installing a tricky package.

The function is defined as

function pyimp() { python -c "import $1" }

This just calls Python with the -c flag which tells it to execute the code you’ve given on the command line – which in this case is just an import command.

You can see below that it returns nothing for a module which is importable, but returns the error for anything which fails:

$ pyimp numpy
$ pyimp blah
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'blah'

The second is pycd which changes directory to the folder where a particular module is defined. This can be useful if you want to inspect the code of the module in depth, or if you’ve installed the module in ‘develop mode’ and want to actually edit the code.

It’s defined as:

function pycd () {
    pushd python -c "import os.path, $1; print(os.path.dirname($1.__file__))";
}

It just changes to the directory that modulename.file is located in – again, fairly simple but quite useful.

As you’ve read to here, I’ll drop in a bonus function to display the column names of a CSV file:

function csvcols() { head -n 1 $1 | tr , \\n }

This combines the unix head tool to get the line of a file and the tr tool to convert commas to newlines, to make a handy little command.


Easily specifying colours from the default colour cycle in matplotlib

Another quick matplotlib tip today: specifically, how easily specify colours from the standard matplotlib colour cycle.

A while back, when matplotlib overhauled their themes and colour schemes, they changed the default cycle of colours used for lines in matplotlib. Previously the first line was pure blue (color='b' in matplotlib syntax), then red, then green etc. They, very sensibly, changed this to a far nicer selection of colours.

However, this change made one thing a bit more difficult – as I found recently. I had plotted a couple of simple lines:

x_values = [0, 1, 2]
line1 = np.array([10, 20, 30])
line2 = line1[::-1]

plt.plot(x_values, line1)
plt.plot(x_values, line2)

which gives

file

I then wanted to plot a shaded area around the second line (the yellow one) – for example, to show the uncertainty in that line.

You can do this with the plt.fill_between function, like this:

plt.fill_between(x_values, line2 - 5, line2 + 5, alpha=0.3, color='y')

This produces a shaded line which extends from 5 below the line to 5 above the line:

file

Unfortunately the colours don’t look quite right: the line isn’t yellow, so doing a partially-transparent yellow background doesn’t look quite right.

I spent a while looking into how to extract the colour of the line so I could use this for the shading, before finding a really easy way to do it. To get the colours in the default colour cycle you can simply use the strings 'C0', 'C1', 'C2' etc. So, in this case just

plt.fill_between(x_values, line2 - 5, line2 + 5, alpha=0.3, color='C1')

The result looks far better now the colours match:

file

I found out about this from a wonderful graphical matplotlib cheatsheet created by Nicolas Rougier – I’d strongly suggest you check it out, there are all sorts of useful things on there that I never knew about!

Just in case you need to do this the manual way, then there are two fairly straightforward ways to get the colour of the second line.

The first is to get the default colour cycle from the matplotlib settings, and extract the relevant colour:

cycle_colors = plt.rcParams['axes.prop_cycle'].by_key()['color']

which gives a list of colours like this:

['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', ...]

You can then just use one of these colours in the call to plt.fill_between – for example:

plt.fill_between(x_values, line2 - 5, line2 + 5, alpha=0.3, color=cycle_colors[1])

The other way is to actually extract the colour of the actual line you plotted, and then use that for the plt.fill_between call

x_values = [0, 1, 2]
line1 = np.array([10, 20, 30])
line2 = line1[::-1]

plt.plot(x_values, line1)
plotted_line = plt.plot(x_values, line2)
plt.fill_between(x_values, line2 - 5, line2 + 5, alpha=0.3, color=plotted_line[0].get_color())

Here we save the result of the plt.plot call when we plot the second line. This gives us a list of the Line2D objects that were created, and we then extract the first (and only) element and call the get_color() method to extract the colour.

I do freelance work in data science and data visualisation – including using matplotlib. If you’d like to work with me, have a look at my freelance website or email me.


Markdown in WordPress – and writing blog posts is fun again

As you may have noticed, I hadn’t blogged here for quite a while, but have recently started blogging regularly again. This is mostly due to sorting out various WordPress issues I was having, and installing some new plugins to make writing blog posts fun again.

Ever since I installed the WordPress update that added the ‘Gutenberg’ editor, I had various problems with editing and creating new posts. I eventually switched back to the Classic Editor (following these instructions), but still wasn’t really happy. I’ve never really been a huge fan of the WordPress editor – it has been a fiddly to get things formatted the way I want, and it’s never dealt with code snippets very well.

I’ve had some plugins installed to do syntax highlighting, but these have required typing the code into a separate little dialog, and not being able to edit it easily after adding it. I really wanted to be able to include code as easily as I do in Markdown documents using ‘code fence’ syntax. For example, something like this:

  `python
def func(x):
    print(x)
    return 2*x
  `

(with no spaces between the backticks – I had to include those or that example would have been syntax highlighted for me)

Basically, I wanted to write my posts in Markdown. I investigated static blog generators, but didn’t want to deal with converting all of my previous posts, and trying to make sure URLs still redirected properly and so on.

Anyway, I found a solution which works really well for me: the WP Githuber MD plugin.

This allows you to write your posts in Markdown, and it supports Github-style fenced code blocks, with syntax highlighting.

All you need to do is install it and then enable the correct settings. To do this:

  1. Go to the Plugins -> Installed Plugins page
  2. Find ‘WP Githuber MD’ and click ‘Settings’
  3. Go to the ‘Modules’ tab at the top
  4. Turn the switch on the right-hand side of the ‘Syntax Highlight’ heading on
  5. Fiddle with the syntax highlighting settings to your own preferences
  6. (Optional) Turn on the switch next to ‘Image Paste’ to make it really easy to add images to your posts

That’s all that needs doing – now your code blocks will be nicely formatted, and you don’t have to bother with typing code into silly dialogs, just write the post in Markdown and insert code as usual and everything ‘just works’.

As a brief postscript, the ‘Image Paste’ functionality is also really useful. Simply copy an image from somewhere on your computer – often I’m copying something like a matplotlib graph produced by a Python script – and then switch to the Markdown editor and paste. The image will then be uploaded to your WordPress Media Library and the right code to include the image will be inserted. All done with a single keypress!

So yes, overall, I am a big fan of WP Githuber MD – I’ve not been asked to say this, but it has really transformed my blog editing experience!


Five new-ish Python things – Part 1

I keep gathering links of interesting Python things I’ve seen around the internet: new packages, good tutorials, and so on – and so I thought I’d start a series where I share them every so often.

Not all of these are new new – some have been around for a while but are new to me – and so they might be new to you too!

Also, there is a distinct ‘PyData’ flavour to these things – they’re all things I’ve come across in my work in data science and geographic processing with Python.

So, on with the list:

removestar

I try really hard to follow the PEP8 style guide for my Python code – but I wasn’t so disciplined in the past, and so I’ve got a lot of old code sitting around which isn’t styled particularly well.

One of the things PEP8 recommends against is using: from blah import *. In my code I used to do a lot of from matplotlib.pyplot import *, and from Py6S import * – but it’s a pain to go through old code and work out what functions are actually used, and replace the import with something like from matplotlib.pyplot import plot, xlabel, title.

removestar is a tool that will do that for you! Just install it with pip install removestar and then it provides a command-line tool to fix your imports for you.

For example, using removestar on the Py6S case study code by running:

removestar ncaveo.py

Gives the following diff as output:

--- original/ncaveo.py
+++ fixed/ncaveo.py
@@ -1,7 +1,7 @@
 # Import Py6S
-from Py6S import *
+from Py6S import Geometry, GroundReflectance, SixS, SixSHelpers
 # Import the Matplotlib plotting environment
-from matplotlib.pyplot import *
+from matplotlib.pyplot import clf, legend, plot, savefig, xlabel, ylabel
 # Import the functions for copying objects
 import copy

To run it on all of the Python files in a module, and do the edits inplace rather than just showing the diffs, you can run it as follows:

removestar -i module_folder/

ipynb-quicklook

file
If you use OS X then you’ll know about the very handy ‘quicklook’ feature that shows you a preview of the selected file in Finder when pressing the spacebar. You can add support for new filetypes to quicklook using quicklook plugins – and I’d already set up a number of useful plugins which will show syntax-highlighted code, preview JSON, CSV and Markdown files nicely, and so on.

I only discovered ipynb-quicklook last week, and it does what you’d expect: it provides previews of Jupyter Notebook files from the Finder. Simply follow the instructions to place the ipynb-quicklook.qlgenerator file in your ~/Library/QuickLook folder, and it ‘Just Works’ – and it’s really quick to render the files too!

Nicolas Rougier’s Matplotlib Cheatsheet

file

This is a great cheatsheet for the matplotlib plotting library from Nicolas Rougier. It’s a great quick reference for all the various matplotlib settings and functions, and reminded me of a number of things matplotlib can do that I’d forgotten about.

Find the high-resolution cheatsheet image here and the repository with all the code used to create it here. Nicolas is also writing a book called Scientific Visualization – Python & Matplotlib which looks great – and it’ll be released open-access once it’s finished (you can donate to see it ‘in progress’).

PyGEOS

If you’re not interested in geographic data processing using Python then this probably won’t interest you…but for those who are interested this looks great. PyGEOS provides native Python bindings to the GEOS library which is used for geometry manipulation by many geospatial tools (such as calculating distances, or finding out whether one geometry contains another). However, by using the underlying C library PyGEOS bypasses the Python interpreter for a lot of the calculations, allowing them to be vectorised efficiently and making it very fast to apply these geometry functions: their preliminary performance tests show speedups ranging from 4x to 136x. The interface is very simple too – for example:

import pygeos
import numpy as np

points = [
    pygeos.Geometry("POINT (1 9)"),
    pygeos.Geometry("POINT (3 5)"),
    pygeos.Geometry("POINT (7 6)")
]
box = pygeos.box(2, 2, 7, 7)
pygeos.contains(box, points)

This project is still in the early days – but definitely one to watch as I think it will have a big impact on the efficiency of Python-based spatial analysis.

napari

file
napari is a fast multi-dimensional image viewer for Python. I found out about it through an extremely comprehensive blog post written by Juan Nunez-Iglesias where he explains the background to the project and what problems it is designed to solve.

One of the key features of napari is that it has a full Python API, allowing you to easily visualise images from within Python – as easily as using imshow() from matplotlib, but with far more features. For example, to view three of the scikit-image sample images just run:

from skimage import data
import napari

with napari.gui_qt():
    viewer = napari.Viewer()
    viewer.add_image(data.astronaut(), name='astronaut')
    viewer.add_image(data.moon(), name='moon')
    viewer.add_image(data.camera(), name='camera')

You can then add some vector points over the image – for example, to use as starting points for a segmentation:

points = np.array([[100, 100], [200, 200], [300, 100]])
viewer.add_points(points, size=30)

That is very useful for me already, and it’s just a tiny taste of what napari has to offer. I’ve only played with it for a short time, but I can already see it being really useful for me next time I’m doing a computer vision project, and I’m already planning to discuss some potential new features to help with satellite imagery work. Definitely something to check out if you’re involved in image processing in any way.


If you liked this, then get me to work for you! I do freelance work in data science, Python development and geospatial analysis – please contact me for more details, or look at my freelance website


Setting limits for a choropleth layer manually with leaflet-choropleth

Following on from my last post on plotting choropleth maps with the leaflet-choropleth library, I’m now going to talk about a small addition I’ve made to the library.

Leaflet-choropleth has built-in functionality to automatically categorise your data: you tell it how many categories you’d like and it splits it up. However, once I’d set up my webmap with leaflet-choropleth, using the automatically generated categories, my client said she wanted specific categories to be used. Unfortunately leaflet-choropleth didn’t support that…so I added it!

(It always pleases me a lot that if you’re in a situation where some open-source code doesn’t do what you want it to do, you can just modify it – and then you can contribute the code back to the project too!)

The pull request for this new functionality hasn’t yet been merged, but the updated code is available from my fork. The specific file you need is the updated choropleth.js file. Once you’ve replaced the original choropleth.js with this new version, you will be able to use a new limits option when calling L.choropleth. For example:

var layer_IMD = L.choropleth(geojson, {
    valueProperty: 'IMDRank',
    limits: [1000, 5000, 30000],
    scale: ['red', 'orange', 'yellow'],
    style: {
        color: '#111111', // border color
        weight: 1,
        fillOpacity: 0.5,
        fillColor: '#ffffff'
    }
}).addTo(map);

The value of the limits property should be the ‘dividing lines’ for the limits: so in this case there will be categories of < 1000, 1000-5000, etc.

I think that’s pretty-much all I can say about this – the code for an example map using this new functionality is available on Github and you can see a live map demo here.

This work was done while analysing GIS data and producing a webmap for a freelancing client. If you’d like me to do something similar for you, have a look at my freelance website or email me.