Robin's Blog

Automatically downloading nursery photos from ParentZone using Selenium

My son goes to a nursery part-time, and the nursery uses a system called ParentZone from Connect Childcare to send information between us (his parents) and nursery. Primarily, this is used to send us updates on the boring details of the day (what he’s had to eat, nappy changes and so on), and to send ‘observations’ which include photographs of what he’s been doing at nursery. The interfaces include a web app (pictured below) and a mobile app:

file

I wanted to be able to download all of these photos easily to keep them in my (enormous) set of photos of my son, without manually downloading each one. So, I wrote a script to do this, with the help of Selenium.

If you want to jump straight to the script, then have a look at the ParentZonePhotoDownloader Github repository. The script is documented and has a nice command-line interface. For more details on how I created it, read on…

Selenium is a browser automation tool that allows you to control pretty-much everything a browser does through code, while also accessing the underlying HTML that the browser is displaying. This makes it the perfect choice for scraping websites that have a lot of Javascript – like the ParentZone website.

To get Selenium working you need to install a ‘webdriver’ that will connect to a particular web browser and do the actual controlling of the browser. I’ve chosen to use chromedriver to control Google Chrome. See the Getting Started guide to see how to install chromedriver – but it’s basically as simple as downloading a binary file and putting it in your PATH.

My script starts off fairly simply, by creating an instance of the Chrome webdriver, and navigating to the ParentZone homepage:

driver = webdriver.Chrome()
driver.get("https://www.parentzone.me/")

The next line: driver.implicitly_wait(10) tells Selenium to wait up to 10 seconds for elements to appear before giving up and giving an error. This is useful for sites that might be slightly slow to load (eg. those with large pictures).

We then fill in the email address and password in the login form:

email_field = driver.find_element_by_xpath('//*[@id="login"]/fieldset/div[1]/input')
email_field.clear()
email_field.send_keys(email)

Here we’re selecting the email address field using it’s XPath, which is a sort of query language for selecting nodes from an XML document (or, by extension, an HTML document – as HTML is a form of XML). I have some basic knowledge of XPath, but usually I just copy the expressions I need from the Chrome Dev Tools window. To do this, select the right element in Dev Tools, then right click on the element’s HTML code and choose ‘Copy->Copy XPath’:

file

We then clear the field, and fake the typing of the email string that we took as a command-line argument.

We then repeat the same thing for the password field, and then just send the ‘Enter’ key to submit the field (easier than finding the right submit button and fake-clicking it).

Once we’ve logged in and gone to the correct page (the ‘timeline’ page) we want to narrow down the page to just show ‘Observations’ (as these are usually the only posts that have photographs). We do this by selecting a dropdown, and then choosing an option from the dropdown box:

dropdown = Select(driver.find_element_by_xpath('//*[@id="filter"]/div[2]/div[4]/div/div[1]/select'))
dropdown.select_by_value('7')

I found the right value (7) to set this to by reading the HTML code where the options were defined, which included this line: <option value="7">Observation</option>.

We then click the ‘Submit’ button:

submit_button = driver.find_element_by_id('submit-filter')
submit_button.click()

Now we get to the bit that had me stuck for a while… The page has ‘infinite scrolling’ – that is, as you scroll down, more posts ‘magically’ appear. We need to scroll right down to the bottom so that we have all of the observations before we try to download them.

I tried using various complicated Javascript functions, but none of them seemed to work – so I settled on a naive way to do it. I simply send the ‘End’ key (which scrolls to the end of the page), wait a few seconds, and then count the number of photos on the page (in this case, elements with the class img-responsive, which is used for photos from observations). When this number stops increasing, I know I’ve reached the point where there are no more pictures to load.

The code that does this is fairly easy to understand:

html = driver.find_element_by_tag_name('html')
old_n_photos = 0
while True:
    # Scroll
    html.send_keys(Keys.END)
    time.sleep(3)
    # Get all photos
    media_elements = driver.find_elements_by_class_name('img-responsive')
    n_photos = len(media_elements)

    if n_photos > old_n_photos:
        old_n_photos = n_photos
    else:
        break

We’ve now got a page with all the photos on it, so we just need to extract them. In fact, we’ve already got a list of all of these photo elements in media_elements, so we just iterate through this and grab some details for each image. Specifically, we get the image URL with element.get_attribute('src'), and then extract the unique image ID from that URL. We then choose the filename to save the file as based on the type of element that was used to display it on the web page (the element.tag_name). If it was a <img> tag then it’s an image, if it was a <video> tag then it was a video.

We then download the image/video file from the website using the requests library (that is, not through Selenium, but separately, just using the URL obtained through Selenium):

# For each image that we've found
for element in media_elements:
    image_url = element.get_attribute('src')
    image_id = image_url.split("&d=")[-1]

    # Deal with file extension based on tag used to display the media
    if element.tag_name == 'img':
        extension = 'jpg'
    elif element.tag_name == 'video':
        extension = 'mp4'
    image_output_path = os.path.join(output_folder,
                                        f'{image_id}.{extension}')

    # Only download and save the file if it doesn't already exist
    if not os.path.exists(image_output_path):
        r = requests.get(image_url, allow_redirects=True)
        open(image_output_path, 'wb').write(r.content)

Putting this all together into a command-line script was made much easier by the click library. Adding the following decorators to the top of the main function creates a whole command-line interface automatically – even including prompts to specify parameters that weren’t specified on the command-line:

@click.command()
@click.option('--email', help='Email address used to log in to ParentZone',
              prompt='Email address used to log in to ParentZone')
@click.option('--password', help='Password used to log in to ParentZone',
              prompt='Password used to log in to ParentZone')
@click.option('--output_folder', help='Output folder',
              default='./output')

So, that’s it. Less than 100 lines in total for a very useful script that saves me a lot of tedious downloading. The full script is available on Github


Using SQLAlchemy to access MySQL without frustrating library installation issues

This is more a ‘note to myself’ than anything else, but I expect some other people might find it useful.

I’ve often struggled with accessing MySQL from Python, as the ‘default’ MySQL library for Python is MySQLdb. This library has a number of problems: 1) it is Python 2 only, and 2) it requires compiling against the MySQL C library and header files, and so can’t be simply installed using pip.

There is a Python 3 version of MySQLdb called mysqlclient, but this also requires compiling against the MySQL libraries and header files, so can be complicated to install.

The best library I’ve found as a replacement is PyMySQL which is a pure Python library (so no need to install MySQL libraries and header files). It’s API is basically exactly the same as MySQLdb, so it’s easy to switch across.

Right, that’s the introduction – and we’re really at the actual point of this post, which is how to go about using the PyMySQL library ‘under the hood’ when you’re accessing databases through SQLAlchemy.

The weird thing is that I’m not actually using SQLAlchemy by choice in my code – but it is used by pandas to convert between SQL and data frames.

For example, you can write code like this:

from sqlalchemy import create_engine
eng = create_engine('mysql://user:[email protected]/database')
df.to_sql('table', eng, if_exists='append', index=False)

which will append the data in df to a table in a database running on the local machine.

The create_engine call is a SQLAlchemy function which creates an engine to handle all of the complex communication to and from a specific database.

Now, when you specify a database connection string with the mysql:// prefix, SQLAlchemy tries to use the MySQLdb library to do the underlying communication with the MySQL database – and fails if it can’t be found.

So, now we’re at the actual solution: which is that you can give SQLAlchemy a ‘dialect’ to use to connect to a database – and this can be used to change the underlying library that is used to talk to the database.

So, you can change your connection string to mysql+pymysql://user:[email protected]/database and it will use the PyMySQL library. It’s as simple as that!

There are other dialects that you can use to connect to MySQL using different underlying libraries – although these aren’t recommended by the authors of SQLAlchemy. You can find a list of them here.

_I do data science work – including processing data in MySQL databases – as part of my freelance work. Please contact me for more details._


A couple of handy zsh/bash functions for Python programmers

Just a quick post today, to tell you about a couple of simple zsh functions that I find handy as a Python programmer.

First, pyimp – a very simple function that tries to import a module in Python and displays the output. If there is no output then the import succeeded, otherwise you’ll see the error. This saves constantly going into a Python interpreter and trying to import something, making that ‘has it worked or not’ cycle a bit quicker when installing a tricky package.

The function is defined as

function pyimp() { python -c "import $1" }

This just calls Python with the -c flag which tells it to execute the code you’ve given on the command line – which in this case is just an import command.

You can see below that it returns nothing for a module which is importable, but returns the error for anything which fails:

$ pyimp numpy
$ pyimp blah
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'blah'

The second is pycd which changes directory to the folder where a particular module is defined. This can be useful if you want to inspect the code of the module in depth, or if you’ve installed the module in ‘develop mode’ and want to actually edit the code.

It’s defined as:

function pycd () {
    pushd python -c "import os.path, $1; print(os.path.dirname($1.__file__))";
}

It just changes to the directory that modulename.file is located in – again, fairly simple but quite useful.

As you’ve read to here, I’ll drop in a bonus function to display the column names of a CSV file:

function csvcols() { head -n 1 $1 | tr , \\n }

This combines the unix head tool to get the line of a file and the tr tool to convert commas to newlines, to make a handy little command.


Easily specifying colours from the default colour cycle in matplotlib

Another quick matplotlib tip today: specifically, how easily specify colours from the standard matplotlib colour cycle.

A while back, when matplotlib overhauled their themes and colour schemes, they changed the default cycle of colours used for lines in matplotlib. Previously the first line was pure blue (color='b' in matplotlib syntax), then red, then green etc. They, very sensibly, changed this to a far nicer selection of colours.

However, this change made one thing a bit more difficult – as I found recently. I had plotted a couple of simple lines:

x_values = [0, 1, 2]
line1 = np.array([10, 20, 30])
line2 = line1[::-1]

plt.plot(x_values, line1)
plt.plot(x_values, line2)

which gives

file

I then wanted to plot a shaded area around the second line (the yellow one) – for example, to show the uncertainty in that line.

You can do this with the plt.fill_between function, like this:

plt.fill_between(x_values, line2 - 5, line2 + 5, alpha=0.3, color='y')

This produces a shaded line which extends from 5 below the line to 5 above the line:

file

Unfortunately the colours don’t look quite right: the line isn’t yellow, so doing a partially-transparent yellow background doesn’t look quite right.

I spent a while looking into how to extract the colour of the line so I could use this for the shading, before finding a really easy way to do it. To get the colours in the default colour cycle you can simply use the strings 'C0', 'C1', 'C2' etc. So, in this case just

plt.fill_between(x_values, line2 - 5, line2 + 5, alpha=0.3, color='C1')

The result looks far better now the colours match:

file

I found out about this from a wonderful graphical matplotlib cheatsheet created by Nicolas Rougier – I’d strongly suggest you check it out, there are all sorts of useful things on there that I never knew about!

Just in case you need to do this the manual way, then there are two fairly straightforward ways to get the colour of the second line.

The first is to get the default colour cycle from the matplotlib settings, and extract the relevant colour:

cycle_colors = plt.rcParams['axes.prop_cycle'].by_key()['color']

which gives a list of colours like this:

['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', ...]

You can then just use one of these colours in the call to plt.fill_between – for example:

plt.fill_between(x_values, line2 - 5, line2 + 5, alpha=0.3, color=cycle_colors[1])

The other way is to actually extract the colour of the actual line you plotted, and then use that for the plt.fill_between call

x_values = [0, 1, 2]
line1 = np.array([10, 20, 30])
line2 = line1[::-1]

plt.plot(x_values, line1)
plotted_line = plt.plot(x_values, line2)
plt.fill_between(x_values, line2 - 5, line2 + 5, alpha=0.3, color=plotted_line[0].get_color())

Here we save the result of the plt.plot call when we plot the second line. This gives us a list of the Line2D objects that were created, and we then extract the first (and only) element and call the get_color() method to extract the colour.

I do freelance work in data science and data visualisation – including using matplotlib. If you’d like to work with me, have a look at my freelance website or email me.


Markdown in WordPress – and writing blog posts is fun again

As you may have noticed, I hadn’t blogged here for quite a while, but have recently started blogging regularly again. This is mostly due to sorting out various WordPress issues I was having, and installing some new plugins to make writing blog posts fun again.

Ever since I installed the WordPress update that added the ‘Gutenberg’ editor, I had various problems with editing and creating new posts. I eventually switched back to the Classic Editor (following these instructions), but still wasn’t really happy. I’ve never really been a huge fan of the WordPress editor – it has been a fiddly to get things formatted the way I want, and it’s never dealt with code snippets very well.

I’ve had some plugins installed to do syntax highlighting, but these have required typing the code into a separate little dialog, and not being able to edit it easily after adding it. I really wanted to be able to include code as easily as I do in Markdown documents using ‘code fence’ syntax. For example, something like this:

  `python
def func(x):
    print(x)
    return 2*x
  `

(with no spaces between the backticks – I had to include those or that example would have been syntax highlighted for me)

Basically, I wanted to write my posts in Markdown. I investigated static blog generators, but didn’t want to deal with converting all of my previous posts, and trying to make sure URLs still redirected properly and so on.

Anyway, I found a solution which works really well for me: the WP Githuber MD plugin.

This allows you to write your posts in Markdown, and it supports Github-style fenced code blocks, with syntax highlighting.

All you need to do is install it and then enable the correct settings. To do this:

  1. Go to the Plugins -> Installed Plugins page
  2. Find ‘WP Githuber MD’ and click ‘Settings’
  3. Go to the ‘Modules’ tab at the top
  4. Turn the switch on the right-hand side of the ‘Syntax Highlight’ heading on
  5. Fiddle with the syntax highlighting settings to your own preferences
  6. (Optional) Turn on the switch next to ‘Image Paste’ to make it really easy to add images to your posts

That’s all that needs doing – now your code blocks will be nicely formatted, and you don’t have to bother with typing code into silly dialogs, just write the post in Markdown and insert code as usual and everything ‘just works’.

As a brief postscript, the ‘Image Paste’ functionality is also really useful. Simply copy an image from somewhere on your computer – often I’m copying something like a matplotlib graph produced by a Python script – and then switch to the Markdown editor and paste. The image will then be uploaded to your WordPress Media Library and the right code to include the image will be inserted. All done with a single keypress!

So yes, overall, I am a big fan of WP Githuber MD – I’ve not been asked to say this, but it has really transformed my blog editing experience!


Five new-ish Python things – Part 1

I keep gathering links of interesting Python things I’ve seen around the internet: new packages, good tutorials, and so on – and so I thought I’d start a series where I share them every so often.

Not all of these are new new – some have been around for a while but are new to me – and so they might be new to you too!

Also, there is a distinct ‘PyData’ flavour to these things – they’re all things I’ve come across in my work in data science and geographic processing with Python.

So, on with the list:

removestar

I try really hard to follow the PEP8 style guide for my Python code – but I wasn’t so disciplined in the past, and so I’ve got a lot of old code sitting around which isn’t styled particularly well.

One of the things PEP8 recommends against is using: from blah import *. In my code I used to do a lot of from matplotlib.pyplot import *, and from Py6S import * – but it’s a pain to go through old code and work out what functions are actually used, and replace the import with something like from matplotlib.pyplot import plot, xlabel, title.

removestar is a tool that will do that for you! Just install it with pip install removestar and then it provides a command-line tool to fix your imports for you.

For example, using removestar on the Py6S case study code by running:

removestar ncaveo.py

Gives the following diff as output:

--- original/ncaveo.py
+++ fixed/ncaveo.py
@@ -1,7 +1,7 @@
 # Import Py6S
-from Py6S import *
+from Py6S import Geometry, GroundReflectance, SixS, SixSHelpers
 # Import the Matplotlib plotting environment
-from matplotlib.pyplot import *
+from matplotlib.pyplot import clf, legend, plot, savefig, xlabel, ylabel
 # Import the functions for copying objects
 import copy

To run it on all of the Python files in a module, and do the edits inplace rather than just showing the diffs, you can run it as follows:

removestar -i module_folder/

ipynb-quicklook

file
If you use OS X then you’ll know about the very handy ‘quicklook’ feature that shows you a preview of the selected file in Finder when pressing the spacebar. You can add support for new filetypes to quicklook using quicklook plugins – and I’d already set up a number of useful plugins which will show syntax-highlighted code, preview JSON, CSV and Markdown files nicely, and so on.

I only discovered ipynb-quicklook last week, and it does what you’d expect: it provides previews of Jupyter Notebook files from the Finder. Simply follow the instructions to place the ipynb-quicklook.qlgenerator file in your ~/Library/QuickLook folder, and it ‘Just Works’ – and it’s really quick to render the files too!

Nicolas Rougier’s Matplotlib Cheatsheet

file

This is a great cheatsheet for the matplotlib plotting library from Nicolas Rougier. It’s a great quick reference for all the various matplotlib settings and functions, and reminded me of a number of things matplotlib can do that I’d forgotten about.

Find the high-resolution cheatsheet image here and the repository with all the code used to create it here. Nicolas is also writing a book called Scientific Visualization – Python & Matplotlib which looks great – and it’ll be released open-access once it’s finished (you can donate to see it ‘in progress’).

PyGEOS

If you’re not interested in geographic data processing using Python then this probably won’t interest you…but for those who are interested this looks great. PyGEOS provides native Python bindings to the GEOS library which is used for geometry manipulation by many geospatial tools (such as calculating distances, or finding out whether one geometry contains another). However, by using the underlying C library PyGEOS bypasses the Python interpreter for a lot of the calculations, allowing them to be vectorised efficiently and making it very fast to apply these geometry functions: their preliminary performance tests show speedups ranging from 4x to 136x. The interface is very simple too – for example:

import pygeos
import numpy as np

points = [
    pygeos.Geometry("POINT (1 9)"),
    pygeos.Geometry("POINT (3 5)"),
    pygeos.Geometry("POINT (7 6)")
]
box = pygeos.box(2, 2, 7, 7)
pygeos.contains(box, points)

This project is still in the early days – but definitely one to watch as I think it will have a big impact on the efficiency of Python-based spatial analysis.

napari

file
napari is a fast multi-dimensional image viewer for Python. I found out about it through an extremely comprehensive blog post written by Juan Nunez-Iglesias where he explains the background to the project and what problems it is designed to solve.

One of the key features of napari is that it has a full Python API, allowing you to easily visualise images from within Python – as easily as using imshow() from matplotlib, but with far more features. For example, to view three of the scikit-image sample images just run:

from skimage import data
import napari

with napari.gui_qt():
    viewer = napari.Viewer()
    viewer.add_image(data.astronaut(), name='astronaut')
    viewer.add_image(data.moon(), name='moon')
    viewer.add_image(data.camera(), name='camera')

You can then add some vector points over the image – for example, to use as starting points for a segmentation:

points = np.array([[100, 100], [200, 200], [300, 100]])
viewer.add_points(points, size=30)

That is very useful for me already, and it’s just a tiny taste of what napari has to offer. I’ve only played with it for a short time, but I can already see it being really useful for me next time I’m doing a computer vision project, and I’m already planning to discuss some potential new features to help with satellite imagery work. Definitely something to check out if you’re involved in image processing in any way.


If you liked this, then get me to work for you! I do freelance work in data science, Python development and geospatial analysis – please contact me for more details, or look at my freelance website


Setting limits for a choropleth layer manually with leaflet-choropleth

Following on from my last post on plotting choropleth maps with the leaflet-choropleth library, I’m now going to talk about a small addition I’ve made to the library.

Leaflet-choropleth has built-in functionality to automatically categorise your data: you tell it how many categories you’d like and it splits it up. However, once I’d set up my webmap with leaflet-choropleth, using the automatically generated categories, my client said she wanted specific categories to be used. Unfortunately leaflet-choropleth didn’t support that…so I added it!

(It always pleases me a lot that if you’re in a situation where some open-source code doesn’t do what you want it to do, you can just modify it – and then you can contribute the code back to the project too!)

The pull request for this new functionality hasn’t yet been merged, but the updated code is available from my fork. The specific file you need is the updated choropleth.js file. Once you’ve replaced the original choropleth.js with this new version, you will be able to use a new limits option when calling L.choropleth. For example:

var layer_IMD = L.choropleth(geojson, {
    valueProperty: 'IMDRank',
    limits: [1000, 5000, 30000],
    scale: ['red', 'orange', 'yellow'],
    style: {
        color: '#111111', // border color
        weight: 1,
        fillOpacity: 0.5,
        fillColor: '#ffffff'
    }
}).addTo(map);

The value of the limits property should be the ‘dividing lines’ for the limits: so in this case there will be categories of < 1000, 1000-5000, etc.

I think that’s pretty-much all I can say about this – the code for an example map using this new functionality is available on Github and you can see a live map demo here.

This work was done while analysing GIS data and producing a webmap for a freelancing client. If you’d like me to do something similar for you, have a look at my freelance website or email me.


Automatically generating a legend for a choropleth layer in Leaflet

Some work I’ve been doing recently has involved putting together a webmap using the Leaflet library. I’ve been very impressed with how Leaflet works, and the range of plugins available for it.

leaflet-choropleth is an extension for Leaflet that allows easy generation of choropleth maps in Leaflet. The docs for this module are pretty good, so I’ll just show a quick example of how to use it in a fairly basic way:

var layer_IMD = L.choropleth(geojson, {
        valueProperty: 'IMDRank',
        scale: ['red', 'orange', 'yellow'],
        style: {
            color: '#111111', // border color
            weight: 1,
            fillOpacity: 0.5,
            fillColor: '#ffffff'
        }
    }).addTo(map);

This displays a choropleth based on the GeoJSON data in geojson, and uses a red-orange-yellow colourmap, basing the colours on the IMDRank property of each GeoJSON feature.

This will produce something like this – a map of Index of Multiple Deprivation values in Southampton, UK (read later if you want to see a Github repository of a full map):


One thing I wanted to do was create a legend for this layer in the Leaflet layers control. The leaflet-choropleth docs give an example of creating a legend, but I don’t really like the style, and the legend appears in a separate box rather than in the layers control for the map.

So, I put together a javascript function to create the sort of legend I wanted. For those who just want to use the function, it’s below. For those who want more details, read on…

function legend_for_choropleth_layer(layer, name, units, id) {
    // Generate a HTML legend for a Leaflet layer created using choropleth.js
    //
    // Arguments:
    // layer: The leaflet Layer object referring to the layer - must be a layer using
    //        choropleth.js
    // name: The name to display in the layer control (will be displayed above the legend, and next
    //       to the checkbox
    // units: A suffix to put after each numerical range in the layer - for example to specify the
    //        units of the values - but could be used for other purposes)
    // id: The id to give the <ul> element that is used to create the legend. Useful to allow the legend
    //     to be shown/hidden programmatically
    //
    // Returns:
    // The HTML ready to be used in the specification of the layers control
    var limits = layer.options.limits;
    var colors = layer.options.colors;
    var labels = [];

    // Start with just the name that you want displayed in the layer selector
    var HTML = name

    // For each limit value, create a string of the form 'X-Y'
    limits.forEach(function (limit, index) {
        if (index === 0) {
            var to = parseFloat(limits[index]).toFixed(0);
            var range_str = "< " + to;
        }
        else {
            var from = parseFloat(limits[index - 1]).toFixed(0);
            var to = parseFloat(limits[index]).toFixed(0);
            var range_str = from + "-" + to;
        }

        // Put together a <li> element with the relevant classes, and the right colour and text
        labels.push('<li class="sublegend-item"><div class="sublegend-color" style="background-color: ' +
            colors[index] + '"> </div> ' + range_str + units + '</li>');
    })

    // Put all the <li> elements together in a <ul> element
    HTML += '<ul id="' + id + '" class="sublegend">' + labels.join('') + '</ul>';

    return HTML;
}

This function is fairly simple: it loops through the limits that have been defined for each of the categories in the choropleth map, and generates a chunk of HTML for each of the different categories (specifically, a <li> element), and these elements are put together and wrapped in a <ul> to produce the final HTML for the legend. We also set CSS classes for each element of the legend, so we can style them nicely later.

When setting up the layers control in Leaflet you pass an object mapping display names (the text you want displayed in the layers control) to Layer objects – something like this:

var layers = {
    'OpenStreetMap': layer_OSM,
    'IMD': layer_IMD
};

var layersControl = L.control.layers({},
    layers,
    { collapsed: false }).addTo(map);

To use the function to generate a legend, replace the simple display name with a call to the function, wrapped in []‘s because of javascript’s weird inability to parse function calls in object keys. For example:

var layers = {
    'OpenStreetMap': layer_OSM,
    [legend_for_choropleth_layer(layer_IMD, 'IMD', '', 'legend_IMD')]: layer_IMD
};

Here we’re passing layer_IMD as the Layer object, IMD as the name to display above the legend, no units (so the empty string), and telling it to give the legend HTML element an ID of legend_IMD.

This produces a legend that looks something like this:

To get this nice looking legend, we use the following CSS:

.sublegend-color {
    width: 20px;
    border: 1px solid #666666;
    display: inline-block;
    opacity: 0.5;
}

.sublegend-item {
    padding-top: 0.2em;
}

.sublegend {
    list-style: none;
    padding-inline-start: 24px;
    margin-top: 0px;
}

Just for one final touch, I’d like the legend to disappear when the layer is ‘turned off’, and appear again when it is ‘turned on’ again. This is particularly useful when you have multiple choropleth layers on a map and the combined length of the legends make the layers control very long.

We can do this with a quick bit of jQuery (yes, I know it can be done in pure javascript, but I prefer using jQuery as it’s generally easier). Remember that one of the parameters to the legend_for_choropleth_layer function was the HTML ID to give the legend? Now you know why: we need to use that ID to hide and show the legend.

We connect to some of the Leaflet events to find out when the layers are turned on or off, and then use the jQuery hide and show methods. There’s one little niggle though: we have to use the setTimeout function to ensure that we only run this once – otherwise we get multiple events raised and it causes problems. So, the code to do this is:

layer_IMD.on('add', function () {
    // Need setTimeout so that we don't get multiple
    // onadd/onremove events raised
    setTimeout(function () {
        $('#legend_IMD').show();
    });
});

layer_IMD.on('remove', function () {
    // Need setTimeout so that we don't get multiple
    // onadd/onremove events raised
    setTimeout(function () {
        $('#legend_IMD').hide();
    });
});

You can see how this works by looking at the final map here – try turning the IMD layer off and on again.

All of the code behind this example is available on Github if you want to check how it all fits together.

This work was done while analysing GIS data and producing a webmap for a freelancing client. If you’d like me to do something similar for you, have a look at my freelance website or email me.


Easily hiding items from the legend in matplotlib

When producing some graphs for a client recently, I wanted to hide some labels from a legend in matplotlib. I started investigating complex arguments to the plt.legend function, but it turned out that there was a really simple way to do it…

If you start your label for a plot item with an underscore (_) then that item will be hidden from the legend.

For example:

plt.plot(np.random.rand(20), label='Random 1')
plt.plot(np.random.rand(20), label='Random 2')
plt.plot(np.random.rand(20), label='_Hidden label')
plt.legend()

produces a plot like this:
file

You can see that the third line is hidden from the legend – just because we started its label with an underscore.

I found this particularly useful when I wanted to plot a load of lines in the same colour to show all the data for something, and then highlight a few lines that meant specific things. For example:

for i in range(20):
    plt.plot(np.random.rand(20), label='_Hidden', color='gray', alpha=0.3)
plt.plot(np.random.rand(20), label='Important Line 1')
plt.plot(np.random.rand(20), label='Important Line 2')
plt.legend()

file

My next step was to do this when plotting from pandas. In this case I had a dataframe that had a column for each line I wanted to plot in the ‘background’, and then a separate dataframe with each of the ‘special’ lines to highlight.

This code will create a couple of example dataframes:

df = pd.DataFrame()

for i in range(20):
    df[f'Data{i}'] = np.random.rand(20)

special = pd.Series(data=np.random.rand(20))

Plotting this produces a legend with all the individual lines showing:

df.plot(color='gray', alpha=0.3)

file

However, just by changing the column names to start with an underscore you can hide all the entries in the legend. In this example, I actually set one of the columns to a name without an underscore, so that column can be used as a label to represent all of these lines:

cols = ["_" + col for col in df.columns]
cols[0] = 'All other data'
df.columns = cols

Plotting again using exactly the same command as above gives us this – along with some warnings saying that a load of legend items are going to be ignored (in case we accidentally had pandas columns starting with _)

file

Putting it all together, we can plot both dataframes, with a sensible legend:

ax = df.plot(color='gray', alpha=0.3)
special.plot(ax=ax, label='Special data')
plt.legend()

file

Advert: I do freelance data science work – please see here for more details.


Calculating Rayleigh Reflectance using Py6S

A user of Py6S recently contacted me to ask if it was possible to get an output of Rayleigh reflectance from Py6S. Unfortunately this email wasn’t sent to the Py6s Google Group, so I thought I’d write a blog post explaining how to do this, and showing a few outputs (reminder: please post Py6S questions there rather than emailing me directly, then people with questions in the future can find the answers there rather than asking again).

So, first of all, what is Rayleigh reflectance? Well, it’s the reflectance (as measured at the top-of-atmosphere) that is caused by Rayleigh scattering in the atmosphere. This is the wavelength-dependent scattering of light by gas molecules in the atmosphere – and it is an inescapable effect of light passing through the atmosphere.

So, on to how to calculate it in Py6S. Unfortunately the underlying 6S model doesn’t provide Rayleigh reflectance as an output, so we have to do a bit more work to calculate it.

First, let’s import Py6S and set up a few basic parameters:

from Py6S import *

s = SixS()

# Standard altitude settings for the sensor
# and target
s.altitudes.set_sensor_satellite_level()
s.altitudes.set_target_sea_level()

# Wavelength of 0.5nm
s.wavelength = Wavelength(0.5)

Now, to calculate the reflectance which is entirely due to Rayleigh scattering we need to ‘turn off’ everything else that is going on that could contribute to the reflectance. First, we ‘turn off’ the ground reflectance by setting it to zero, so we won’t have any contribution from the ground reflectance:

s.ground_reflectance = GroundReflectance.HomogeneousLambertian(0)

Then we turn off aerosol scattering:

s.aero_profile = AeroProfile.PredefinedType(AeroProfile.NoAerosols)

and also atmospheric absorption by gases:

s.atmos_profile = AtmosProfile.PredefinedType(AtmosProfile.NoGaseousAbsorption)

We can then run the simulation (using s.run()) and look at the outputs. The best way to do this is to just run:

print(s.outputs.fulltext)

to look at the ‘pretty’ text output that Py6S provides. The value we want is the ‘apparent reflectance’ – which is the reflectance at the top-of-atmosphere. Because we’ve turned off everything else, this will be purely caused by the Rayleigh reflectance.

We can access this value programmatically as s.outputs.apparent_reflectance.

So, that’s how to get the Rayleigh reflectance – but there are a few more interesting things to say…

Firstly, we don’t actually have to set the ground reflectance to zero. If we set the ground reflectance to something else – for example:

s.ground_reflectance = GroundReflectance.HomogeneousLambertian(GroundReflectance.GreenVegetation)

and run the simulation, then we will get a different answer for the apparent radiance – because the ground reflectance is now being taken into account – but we will see the value we want as the atmospheric intrinsic reflectance. This is the reflectance that comes directly from the atmosphere (in this case just from Rayleigh scattering, but in normal situations this would include aerosol scattering as well). This can be accessed programmatically as s.outputs.atmospheric_intrinsic_reflectance.

One more thing, just to show that Rayleigh reflectance in Py6S behaves in the manner that we’d expect from what we know of the physics… We can put together a bit of code that will extract the Rayleigh reflectance at various wavelengths and plot a graph – we’d expect an exponentially-decreasing curve, showing high Rayleigh reflectance at low wavelengths, and vice versa.

The code below will do this:

from Py6S import *
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

s = SixS()

s.altitudes.set_sensor_satellite_level()
s.altitudes.set_target_sea_level()
s.aero_profile = AeroProfile.PredefinedType(AeroProfile.NoAerosols)
s.atmos_profile = AtmosProfile.PredefinedType(AtmosProfile.NoGaseousAbsorption)

wavelengths = np.arange(0.3, 1.0, 0.05)
results = []

for wv in wavelengths:
    s.wavelength = Wavelength(wv)
    s.run()

    results.append({'wavelength': wv,
                   'rayleigh_refl': s.outputs.atmospheric_intrinsic_reflectance})

results = pd.DataFrame(results)

results.plot(x='wavelength', y='rayleigh_refl', style='x-', label='Rayleigh Reflectance', grid=True)
plt.xlabel('Wavelength ($\mu m$)')
plt.ylabel('Rayleigh Reflectance (no units)')

This produces the following graph, which shows exactly what the physics predicts:

file

There’s nothing particularly revolutionary in that chunk of code – we’ve just combined the code I demonstrated earlier, and then looped through various wavelengths and run the model for each wavelength.

The way that we’re storing the results from the model deserves a brief explanation, as this is a pattern I use a lot. Each time the model is run, a new dict is appended to a list – and this dict has entries for the various parameters we’re interested in (in this case just wavelength) and the various results we’re interested in (in this case just Rayleigh reflectance). After we’ve finished the loop we can simply pass this list of dicts to pd.DataFrame() and get a nice pandas DataFrame back – ready to display, plot or analyse further.