Robin's Blog

Using SQLAlchemy to access MySQL without frustrating library installation issues

November 26, 2019

This is more a ‘note to myself’ than anything else, but I expect some other people might find it useful.

I’ve often struggled with accessing MySQL from Python, as the ‘default’ MySQL library for Python is MySQLdb. This library has a number of problems: 1) it is Python 2 only, and 2) it requires compiling against the MySQL C library and header files, and so can’t be simply installed using pip.

There is a Python 3 version of MySQLdb called mysqlclient, but this also requires compiling against the MySQL libraries and header files, so can be complicated to install.

The best library I’ve found as a replacement is PyMySQL which is a pure Python library (so no need to install MySQL libraries and header files). It’s API is basically exactly the same as MySQLdb, so it’s easy to switch across.

Right, that’s the introduction – and we’re really at the actual point of this post, which is how to go about using the PyMySQL library ‘under the hood’ when you’re accessing databases through SQLAlchemy.

The weird thing is that I’m not actually using SQLAlchemy by choice in my code – but it is used by pandas to convert between SQL and data frames.

For example, you can write code like this:

from sqlalchemy import create_engine
eng = create_engine('mysql://user:pass@127.0.0.1/database')
df.to_sql('table', eng, if_exists='append', index=False)

which will append the data in df to a table in a database running on the local machine.

The create_engine call is a SQLAlchemy function which creates an engine to handle all of the complex communication to and from a specific database.

Now, when you specify a database connection string with the mysql:// prefix, SQLAlchemy tries to use the MySQLdb library to do the underlying communication with the MySQL database – and fails if it can’t be found.

So, now we’re at the actual solution: which is that you can give SQLAlchemy a ‘dialect’ to use to connect to a database – and this can be used to change the underlying library that is used to talk to the database.

So, you can change your connection string to mysql+pymysql://user:pass@127.0.0.1/database and it will use the PyMySQL library. It’s as simple as that!

There are other dialects that you can use to connect to MySQL using different underlying libraries – although these aren’t recommended by the authors of SQLAlchemy. You can find a list of them here.

_I do data science work – including processing data in MySQL databases – as part of my freelance work. Please contact me for more details._

1 Comment

A couple of handy zsh/bash functions for Python programmers

November 20, 2019

Just a quick post today, to tell you about a couple of simple zsh functions that I find handy as a Python programmer.

First, pyimp – a very simple function that tries to import a module in Python and displays the output. If there is no output then the import succeeded, otherwise you’ll see the error. This saves constantly going into a Python interpreter and trying to import something, making that ‘has it worked or not’ cycle a bit quicker when installing a tricky package.

The function is defined as

function pyimp() { python -c "import $1" }

This just calls Python with the -c flag which tells it to execute the code you’ve given on the command line – which in this case is just an import command.

You can see below that it returns nothing for a module which is importable, but returns the error for anything which fails:

$ pyimp numpy
$ pyimp blah
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'blah'

The second is pycd which changes directory to the folder where a particular module is defined. This can be useful if you want to inspect the code of the module in depth, or if you’ve installed the module in ‘develop mode’ and want to actually edit the code.

It’s defined as:

function pycd () {
    pushd python -c "import os.path, $1; print(os.path.dirname($1.__file__))";
}

It just changes to the directory that modulename.file is located in – again, fairly simple but quite useful.

As you’ve read to here, I’ll drop in a bonus function to display the column names of a CSV file:

function csvcols() { head -n 1 $1 | tr , \\n }

This combines the unix head tool to get the line of a file and the tr tool to convert commas to newlines, to make a handy little command.

Share your thoughts

Easily specifying colours from the default colour cycle in matplotlib

November 14, 2019

Another quick matplotlib tip today: specifically, how easily specify colours from the standard matplotlib colour cycle.

A while back, when matplotlib overhauled their themes and colour schemes, they changed the default cycle of colours used for lines in matplotlib. Previously the first line was pure blue (color='b' in matplotlib syntax), then red, then green etc. They, very sensibly, changed this to a far nicer selection of colours.

However, this change made one thing a bit more difficult – as I found recently. I had plotted a couple of simple lines:

x_values = [0, 1, 2]
line1 = np.array([10, 20, 30])
line2 = line1[::-1]

plt.plot(x_values, line1)
plt.plot(x_values, line2)

which gives

file

I then wanted to plot a shaded area around the second line (the yellow one) – for example, to show the uncertainty in that line.

You can do this with the plt.fill_between function, like this:

plt.fill_between(x_values, line2 - 5, line2 + 5, alpha=0.3, color='y')

This produces a shaded line which extends from 5 below the line to 5 above the line:

file

Unfortunately the colours don’t look quite right: the line isn’t yellow, so doing a partially-transparent yellow background doesn’t look quite right.

I spent a while looking into how to extract the colour of the line so I could use this for the shading, before finding a really easy way to do it. To get the colours in the default colour cycle you can simply use the strings 'C0', 'C1', 'C2' etc. So, in this case just

plt.fill_between(x_values, line2 - 5, line2 + 5, alpha=0.3, color='C1')

The result looks far better now the colours match:

file

I found out about this from a wonderful graphical matplotlib cheatsheet created by Nicolas Rougier – I’d strongly suggest you check it out, there are all sorts of useful things on there that I never knew about!

Just in case you need to do this the manual way, then there are two fairly straightforward ways to get the colour of the second line.

The first is to get the default colour cycle from the matplotlib settings, and extract the relevant colour:

cycle_colors = plt.rcParams['axes.prop_cycle'].by_key()['color']

which gives a list of colours like this:

['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', ...]

You can then just use one of these colours in the call to plt.fill_between – for example:

plt.fill_between(x_values, line2 - 5, line2 + 5, alpha=0.3, color=cycle_colors[1])

The other way is to actually extract the colour of the actual line you plotted, and then use that for the plt.fill_between call

x_values = [0, 1, 2]
line1 = np.array([10, 20, 30])
line2 = line1[::-1]

plt.plot(x_values, line1)
plotted_line = plt.plot(x_values, line2)
plt.fill_between(x_values, line2 - 5, line2 + 5, alpha=0.3, color=plotted_line[0].get_color())

Here we save the result of the plt.plot call when we plot the second line. This gives us a list of the Line2D objects that were created, and we then extract the first (and only) element and call the get_color() method to extract the colour.

I do freelance work in data science and data visualisation – including using matplotlib. If you’d like to work with me, have a look at my freelance website or email me.

Share your thoughts

Markdown in WordPress – and writing blog posts is fun again

November 6, 2019

As you may have noticed, I hadn’t blogged here for quite a while, but have recently started blogging regularly again. This is mostly due to sorting out various WordPress issues I was having, and installing some new plugins to make writing blog posts fun again.

Ever since I installed the WordPress update that added the ‘Gutenberg’ editor, I had various problems with editing and creating new posts. I eventually switched back to the Classic Editor (following these instructions), but still wasn’t really happy. I’ve never really been a huge fan of the WordPress editor – it has been a fiddly to get things formatted the way I want, and it’s never dealt with code snippets very well.

I’ve had some plugins installed to do syntax highlighting, but these have required typing the code into a separate little dialog, and not being able to edit it easily after adding it. I really wanted to be able to include code as easily as I do in Markdown documents using ‘code fence’ syntax. For example, something like this:

  `python
def func(x):
    print(x)
    return 2*x
  `

(with no spaces between the backticks – I had to include those or that example would have been syntax highlighted for me)

Basically, I wanted to write my posts in Markdown. I investigated static blog generators, but didn’t want to deal with converting all of my previous posts, and trying to make sure URLs still redirected properly and so on.

Anyway, I found a solution which works really well for me: the WP Githuber MD plugin.

This allows you to write your posts in Markdown, and it supports Github-style fenced code blocks, with syntax highlighting.

All you need to do is install it and then enable the correct settings. To do this:

Go to the Plugins -> Installed Plugins page
Find ‘WP Githuber MD’ and click ‘Settings’
Go to the ‘Modules’ tab at the top
Turn the switch on the right-hand side of the ‘Syntax Highlight’ heading on
Fiddle with the syntax highlighting settings to your own preferences
(Optional) Turn on the switch next to ‘Image Paste’ to make it really easy to add images to your posts

That’s all that needs doing – now your code blocks will be nicely formatted, and you don’t have to bother with typing code into silly dialogs, just write the post in Markdown and insert code as usual and everything ‘just works’.

As a brief postscript, the ‘Image Paste’ functionality is also really useful. Simply copy an image from somewhere on your computer – often I’m copying something like a matplotlib graph produced by a Python script – and then switch to the Markdown editor and paste. The image will then be uploaded to your WordPress Media Library and the right code to include the image will be inserted. All done with a single keypress!

So yes, overall, I am a big fan of WP Githuber MD – I’ve not been asked to say this, but it has really transformed my blog editing experience!

2 Comments

Five new-ish Python things – Part 1

October 29, 2019

I keep gathering links of interesting Python things I’ve seen around the internet: new packages, good tutorials, and so on – and so I thought I’d start a series where I share them every so often.

Not all of these are new new – some have been around for a while but are new to me – and so they might be new to you too!

Also, there is a distinct ‘PyData’ flavour to these things – they’re all things I’ve come across in my work in data science and geographic processing with Python.

So, on with the list:

removestar

I try really hard to follow the PEP8 style guide for my Python code – but I wasn’t so disciplined in the past, and so I’ve got a lot of old code sitting around which isn’t styled particularly well.

One of the things PEP8 recommends against is using: from blah import *. In my code I used to do a lot of from matplotlib.pyplot import *, and from Py6S import * – but it’s a pain to go through old code and work out what functions are actually used, and replace the import with something like from matplotlib.pyplot import plot, xlabel, title.

removestar is a tool that will do that for you! Just install it with pip install removestar and then it provides a command-line tool to fix your imports for you.

For example, using removestar on the Py6S case study code by running:

removestar ncaveo.py

Gives the following diff as output:

--- original/ncaveo.py
+++ fixed/ncaveo.py
@@ -1,7 +1,7 @@
 # Import Py6S
-from Py6S import *
+from Py6S import Geometry, GroundReflectance, SixS, SixSHelpers
 # Import the Matplotlib plotting environment
-from matplotlib.pyplot import *
+from matplotlib.pyplot import clf, legend, plot, savefig, xlabel, ylabel
 # Import the functions for copying objects
 import copy

To run it on all of the Python files in a module, and do the edits inplace rather than just showing the diffs, you can run it as follows:

removestar -i module_folder/

ipynb-quicklook

file
If you use OS X then you’ll know about the very handy ‘quicklook’ feature that shows you a preview of the selected file in Finder when pressing the spacebar. You can add support for new filetypes to quicklook using quicklook plugins – and I’d already set up a number of useful plugins which will show syntax-highlighted code, preview JSON, CSV and Markdown files nicely, and so on.

I only discovered ipynb-quicklook last week, and it does what you’d expect: it provides previews of Jupyter Notebook files from the Finder. Simply follow the instructions to place the ipynb-quicklook.qlgenerator file in your ~/Library/QuickLook folder, and it ‘Just Works’ – and it’s really quick to render the files too!

Nicolas Rougier’s Matplotlib Cheatsheet

file

This is a great cheatsheet for the matplotlib plotting library from Nicolas Rougier. It’s a great quick reference for all the various matplotlib settings and functions, and reminded me of a number of things matplotlib can do that I’d forgotten about.

Find the high-resolution cheatsheet image here and the repository with all the code used to create it here. Nicolas is also writing a book called Scientific Visualization – Python & Matplotlib which looks great – and it’ll be released open-access once it’s finished (you can donate to see it ‘in progress’).

PyGEOS

If you’re not interested in geographic data processing using Python then this probably won’t interest you…but for those who are interested this looks great. PyGEOS provides native Python bindings to the GEOS library which is used for geometry manipulation by many geospatial tools (such as calculating distances, or finding out whether one geometry contains another). However, by using the underlying C library PyGEOS bypasses the Python interpreter for a lot of the calculations, allowing them to be vectorised efficiently and making it very fast to apply these geometry functions: their preliminary performance tests show speedups ranging from 4x to 136x. The interface is very simple too – for example:

import pygeos
import numpy as np

points = [
    pygeos.Geometry("POINT (1 9)"),
    pygeos.Geometry("POINT (3 5)"),
    pygeos.Geometry("POINT (7 6)")
]
box = pygeos.box(2, 2, 7, 7)
pygeos.contains(box, points)

This project is still in the early days – but definitely one to watch as I think it will have a big impact on the efficiency of Python-based spatial analysis.

napari

file
napari is a fast multi-dimensional image viewer for Python. I found out about it through an extremely comprehensive blog post written by Juan Nunez-Iglesias where he explains the background to the project and what problems it is designed to solve.

One of the key features of napari is that it has a full Python API, allowing you to easily visualise images from within Python – as easily as using imshow() from matplotlib, but with far more features. For example, to view three of the scikit-image sample images just run:

from skimage import data
import napari

with napari.gui_qt():
    viewer = napari.Viewer()
    viewer.add_image(data.astronaut(), name='astronaut')
    viewer.add_image(data.moon(), name='moon')
    viewer.add_image(data.camera(), name='camera')

You can then add some vector points over the image – for example, to use as starting points for a segmentation:

points = np.array([[100, 100], [200, 200], [300, 100]])
viewer.add_points(points, size=30)

That is very useful for me already, and it’s just a tiny taste of what napari has to offer. I’ve only played with it for a short time, but I can already see it being really useful for me next time I’m doing a computer vision project, and I’m already planning to discuss some potential new features to help with satellite imagery work. Definitely something to check out if you’re involved in image processing in any way.

If you liked this, then get me to work for you! I do freelance work in data science, Python development and geospatial analysis – please contact me for more details, or look at my freelance website

1 Comment

Setting limits for a choropleth layer manually with leaflet-choropleth

October 23, 2019

Following on from my last post on plotting choropleth maps with the leaflet-choropleth library, I’m now going to talk about a small addition I’ve made to the library.

Leaflet-choropleth has built-in functionality to automatically categorise your data: you tell it how many categories you’d like and it splits it up. However, once I’d set up my webmap with leaflet-choropleth, using the automatically generated categories, my client said she wanted specific categories to be used. Unfortunately leaflet-choropleth didn’t support that…so I added it!

(It always pleases me a lot that if you’re in a situation where some open-source code doesn’t do what you want it to do, you can just modify it – and then you can contribute the code back to the project too!)

The pull request for this new functionality hasn’t yet been merged, but the updated code is available from my fork. The specific file you need is the updated choropleth.js file. Once you’ve replaced the original choropleth.js with this new version, you will be able to use a new limits option when calling L.choropleth. For example:

var layer_IMD = L.choropleth(geojson, {
    valueProperty: 'IMDRank',
    limits: [1000, 5000, 30000],
    scale: ['red', 'orange', 'yellow'],
    style: {
        color: '#111111', // border color
        weight: 1,
        fillOpacity: 0.5,
        fillColor: '#ffffff'
    }
}).addTo(map);

The value of the limits property should be the ‘dividing lines’ for the limits: so in this case there will be categories of < 1000, 1000-5000, etc.

I think that’s pretty-much all I can say about this – the code for an example map using this new functionality is available on Github and you can see a live map demo here.

This work was done while analysing GIS data and producing a webmap for a freelancing client. If you’d like me to do something similar for you, have a look at my freelance website or email me.

Share your thoughts

Automatically generating a legend for a choropleth layer in Leaflet

October 14, 2019

Some work I’ve been doing recently has involved putting together a webmap using the Leaflet library. I’ve been very impressed with how Leaflet works, and the range of plugins available for it.

leaflet-choropleth is an extension for Leaflet that allows easy generation of choropleth maps in Leaflet. The docs for this module are pretty good, so I’ll just show a quick example of how to use it in a fairly basic way:

var layer_IMD = L.choropleth(geojson, {
        valueProperty: 'IMDRank',
        scale: ['red', 'orange', 'yellow'],
        style: {
            color: '#111111', // border color
            weight: 1,
            fillOpacity: 0.5,
            fillColor: '#ffffff'
        }
    }).addTo(map);

This displays a choropleth based on the GeoJSON data in geojson, and uses a red-orange-yellow colourmap, basing the colours on the IMDRank property of each GeoJSON feature.

This will produce something like this – a map of Index of Multiple Deprivation values in Southampton, UK (read later if you want to see a Github repository of a full map):

One thing I wanted to do was create a legend for this layer in the Leaflet layers control. The leaflet-choropleth docs give an example of creating a legend, but I don’t really like the style, and the legend appears in a separate box rather than in the layers control for the map.

So, I put together a javascript function to create the sort of legend I wanted. For those who just want to use the function, it’s below. For those who want more details, read on…

function legend_for_choropleth_layer(layer, name, units, id) {
    // Generate a HTML legend for a Leaflet layer created using choropleth.js
    //
    // Arguments:
    // layer: The leaflet Layer object referring to the layer - must be a layer using
    //        choropleth.js
    // name: The name to display in the layer control (will be displayed above the legend, and next
    //       to the checkbox
    // units: A suffix to put after each numerical range in the layer - for example to specify the
    //        units of the values - but could be used for other purposes)
    // id: The id to give the <ul> element that is used to create the legend. Useful to allow the legend
    //     to be shown/hidden programmatically
    //
    // Returns:
    // The HTML ready to be used in the specification of the layers control
    var limits = layer.options.limits;
    var colors = layer.options.colors;
    var labels = [];

    // Start with just the name that you want displayed in the layer selector
    var HTML = name

    // For each limit value, create a string of the form 'X-Y'
    limits.forEach(function (limit, index) {
        if (index === 0) {
            var to = parseFloat(limits[index]).toFixed(0);
            var range_str = "< " + to;
        }
        else {
            var from = parseFloat(limits[index - 1]).toFixed(0);
            var to = parseFloat(limits[index]).toFixed(0);
            var range_str = from + "-" + to;
        }

        // Put together a <li> element with the relevant classes, and the right colour and text
        labels.push('<li class="sublegend-item"><div class="sublegend-color" style="background-color: ' +
            colors[index] + '">Â </div>Â ' + range_str + units + '</li>');
    })

    // Put all the <li> elements together in a <ul> element
    HTML += '<ul id="' + id + '" class="sublegend">' + labels.join('') + '</ul>';

    return HTML;
}

This function is fairly simple: it loops through the limits that have been defined for each of the categories in the choropleth map, and generates a chunk of HTML for each of the different categories (specifically, a <li> element), and these elements are put together and wrapped in a <ul> to produce the final HTML for the legend. We also set CSS classes for each element of the legend, so we can style them nicely later.

When setting up the layers control in Leaflet you pass an object mapping display names (the text you want displayed in the layers control) to Layer objects – something like this:

var layers = {
    'OpenStreetMap': layer_OSM,
    'IMD': layer_IMD
};

var layersControl = L.control.layers({},
    layers,
    { collapsed: false }).addTo(map);

To use the function to generate a legend, replace the simple display name with a call to the function, wrapped in []‘s because of javascript’s weird inability to parse function calls in object keys. For example:

var layers = {
    'OpenStreetMap': layer_OSM,
    [legend_for_choropleth_layer(layer_IMD, 'IMD', '', 'legend_IMD')]: layer_IMD
};

Here we’re passing layer_IMD as the Layer object, IMD as the name to display above the legend, no units (so the empty string), and telling it to give the legend HTML element an ID of legend_IMD.

This produces a legend that looks something like this:

To get this nice looking legend, we use the following CSS:

.sublegend-color {
    width: 20px;
    border: 1px solid #666666;
    display: inline-block;
    opacity: 0.5;
}

.sublegend-item {
    padding-top: 0.2em;
}

.sublegend {
    list-style: none;
    padding-inline-start: 24px;
    margin-top: 0px;
}

Just for one final touch, I’d like the legend to disappear when the layer is ‘turned off’, and appear again when it is ‘turned on’ again. This is particularly useful when you have multiple choropleth layers on a map and the combined length of the legends make the layers control very long.

We can do this with a quick bit of jQuery (yes, I know it can be done in pure javascript, but I prefer using jQuery as it’s generally easier). Remember that one of the parameters to the legend_for_choropleth_layer function was the HTML ID to give the legend? Now you know why: we need to use that ID to hide and show the legend.

We connect to some of the Leaflet events to find out when the layers are turned on or off, and then use the jQuery hide and show methods. There’s one little niggle though: we have to use the setTimeout function to ensure that we only run this once – otherwise we get multiple events raised and it causes problems. So, the code to do this is:

layer_IMD.on('add', function () {
    // Need setTimeout so that we don't get multiple
    // onadd/onremove events raised
    setTimeout(function () {
        $('#legend_IMD').show();
    });
});

layer_IMD.on('remove', function () {
    // Need setTimeout so that we don't get multiple
    // onadd/onremove events raised
    setTimeout(function () {
        $('#legend_IMD').hide();
    });
});

You can see how this works by looking at the final map here – try turning the IMD layer off and on again.

All of the code behind this example is available on Github if you want to check how it all fits together.

This work was done while analysing GIS data and producing a webmap for a freelancing client. If you’d like me to do something similar for you, have a look at my freelance website or email me.

2 Comments

Easily hiding items from the legend in matplotlib

October 8, 2019

When producing some graphs for a client recently, I wanted to hide some labels from a legend in matplotlib. I started investigating complex arguments to the plt.legend function, but it turned out that there was a really simple way to do it…

If you start your label for a plot item with an underscore (_) then that item will be hidden from the legend.

For example:

plt.plot(np.random.rand(20), label='Random 1')
plt.plot(np.random.rand(20), label='Random 2')
plt.plot(np.random.rand(20), label='_Hidden label')
plt.legend()

produces a plot like this:
file

You can see that the third line is hidden from the legend – just because we started its label with an underscore.

I found this particularly useful when I wanted to plot a load of lines in the same colour to show all the data for something, and then highlight a few lines that meant specific things. For example:

for i in range(20):
    plt.plot(np.random.rand(20), label='_Hidden', color='gray', alpha=0.3)
plt.plot(np.random.rand(20), label='Important Line 1')
plt.plot(np.random.rand(20), label='Important Line 2')
plt.legend()

file

My next step was to do this when plotting from pandas. In this case I had a dataframe that had a column for each line I wanted to plot in the ‘background’, and then a separate dataframe with each of the ‘special’ lines to highlight.

This code will create a couple of example dataframes:

df = pd.DataFrame()

for i in range(20):
    df[f'Data{i}'] = np.random.rand(20)

special = pd.Series(data=np.random.rand(20))

Plotting this produces a legend with all the individual lines showing:

df.plot(color='gray', alpha=0.3)

file

However, just by changing the column names to start with an underscore you can hide all the entries in the legend. In this example, I actually set one of the columns to a name without an underscore, so that column can be used as a label to represent all of these lines:

cols = ["_" + col for col in df.columns]
cols[0] = 'All other data'
df.columns = cols

Plotting again using exactly the same command as above gives us this – along with some warnings saying that a load of legend items are going to be ignored (in case we accidentally had pandas columns starting with _)

file

Putting it all together, we can plot both dataframes, with a sensible legend:

ax = df.plot(color='gray', alpha=0.3)
special.plot(ax=ax, label='Special data')
plt.legend()

file

Advert: I do freelance data science work – please see here for more details.

4 Comments

Calculating Rayleigh Reflectance using Py6S

October 2, 2019

A user of Py6S recently contacted me to ask if it was possible to get an output of Rayleigh reflectance from Py6S. Unfortunately this email wasn’t sent to the Py6s Google Group, so I thought I’d write a blog post explaining how to do this, and showing a few outputs (reminder: please post Py6S questions there rather than emailing me directly, then people with questions in the future can find the answers there rather than asking again).

So, first of all, what is Rayleigh reflectance? Well, it’s the reflectance (as measured at the top-of-atmosphere) that is caused by Rayleigh scattering in the atmosphere. This is the wavelength-dependent scattering of light by gas molecules in the atmosphere – and it is an inescapable effect of light passing through the atmosphere.

So, on to how to calculate it in Py6S. Unfortunately the underlying 6S model doesn’t provide Rayleigh reflectance as an output, so we have to do a bit more work to calculate it.

First, let’s import Py6S and set up a few basic parameters:

from Py6S import *

s = SixS()

# Standard altitude settings for the sensor
# and target
s.altitudes.set_sensor_satellite_level()
s.altitudes.set_target_sea_level()

# Wavelength of 0.5nm
s.wavelength = Wavelength(0.5)

Now, to calculate the reflectance which is entirely due to Rayleigh scattering we need to ‘turn off’ everything else that is going on that could contribute to the reflectance. First, we ‘turn off’ the ground reflectance by setting it to zero, so we won’t have any contribution from the ground reflectance:

s.ground_reflectance = GroundReflectance.HomogeneousLambertian(0)

Then we turn off aerosol scattering:

s.aero_profile = AeroProfile.PredefinedType(AeroProfile.NoAerosols)

and also atmospheric absorption by gases:

s.atmos_profile = AtmosProfile.PredefinedType(AtmosProfile.NoGaseousAbsorption)

We can then run the simulation (using s.run()) and look at the outputs. The best way to do this is to just run:

print(s.outputs.fulltext)

to look at the ‘pretty’ text output that Py6S provides. The value we want is the ‘apparent reflectance’ – which is the reflectance at the top-of-atmosphere. Because we’ve turned off everything else, this will be purely caused by the Rayleigh reflectance.

We can access this value programmatically as s.outputs.apparent_reflectance.

So, that’s how to get the Rayleigh reflectance – but there are a few more interesting things to say…

Firstly, we don’t actually have to set the ground reflectance to zero. If we set the ground reflectance to something else – for example:

s.ground_reflectance = GroundReflectance.HomogeneousLambertian(GroundReflectance.GreenVegetation)

and run the simulation, then we will get a different answer for the apparent radiance – because the ground reflectance is now being taken into account – but we will see the value we want as the atmospheric intrinsic reflectance. This is the reflectance that comes directly from the atmosphere (in this case just from Rayleigh scattering, but in normal situations this would include aerosol scattering as well). This can be accessed programmatically as s.outputs.atmospheric_intrinsic_reflectance.

One more thing, just to show that Rayleigh reflectance in Py6S behaves in the manner that we’d expect from what we know of the physics… We can put together a bit of code that will extract the Rayleigh reflectance at various wavelengths and plot a graph – we’d expect an exponentially-decreasing curve, showing high Rayleigh reflectance at low wavelengths, and vice versa.

The code below will do this:

from Py6S import *
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

s = SixS()

s.altitudes.set_sensor_satellite_level()
s.altitudes.set_target_sea_level()
s.aero_profile = AeroProfile.PredefinedType(AeroProfile.NoAerosols)
s.atmos_profile = AtmosProfile.PredefinedType(AtmosProfile.NoGaseousAbsorption)

wavelengths = np.arange(0.3, 1.0, 0.05)
results = []

for wv in wavelengths:
    s.wavelength = Wavelength(wv)
    s.run()

    results.append({'wavelength': wv,
                   'rayleigh_refl': s.outputs.atmospheric_intrinsic_reflectance})

results = pd.DataFrame(results)

results.plot(x='wavelength', y='rayleigh_refl', style='x-', label='Rayleigh Reflectance', grid=True)
plt.xlabel('Wavelength ($\mu m$)')
plt.ylabel('Rayleigh Reflectance (no units)')

This produces the following graph, which shows exactly what the physics predicts:

file

There’s nothing particularly revolutionary in that chunk of code – we’ve just combined the code I demonstrated earlier, and then looped through various wavelengths and run the model for each wavelength.

The way that we’re storing the results from the model deserves a brief explanation, as this is a pattern I use a lot. Each time the model is run, a new dict is appended to a list – and this dict has entries for the various parameters we’re interested in (in this case just wavelength) and the various results we’re interested in (in this case just Rayleigh reflectance). After we’ve finished the loop we can simply pass this list of dicts to pd.DataFrame() and get a nice pandas DataFrame back – ready to display, plot or analyse further.

8 Comments

Automatically annotating a boxplot in matplotlib

September 26, 2019

You can probably tell from the sudden influx of matplotlib posts that I’ve been doing a lot of work plotting graphs recently…

I have produced a number of boxplots to compare different sets of data. Some of these graphs are for a non-technical audience, and my client agreed that a boxplot was the best way to visualise the data, but wanted the various elements of the boxplot to be labelled so the audience could work out how to interpret it.

I started doing this manually using the plt.annotate function, but quickly got fed up with manually positioning everything – so I wrote a quick function to do it for me.

If you just want the code then here it is – it’s not perfect, but it should be a good starting point for you:

def annotate_boxplot(bpdict, annotate_params=None,
                     x_offset=0.05, x_loc=0,
                     text_offset_x=35,
                     text_offset_y=20):
    """Annotates a matplotlib boxplot with labels marking various centile levels.

    Parameters:
    - bpdict: The dict returned from the matplotlib `boxplot` function. If you're using pandas you can
    get this dict by setting `return_type='dict'` when calling `df.boxplot()`.
    - annotate_params: Extra parameters for the plt.annotate function. The default setting uses standard arrows
    and offsets the text based on other parameters passed to the function
    - x_offset: The offset from the centre of the boxplot to place the heads of the arrows, in x axis
    units (normally just 0-n for n boxplots). Values between around -0.15 and 0.15 seem to work well
    - x_loc: The x axis location of the boxplot to annotate. Usually just the number of the boxplot, counting
    from the left and starting at zero.
    text_offset_x: The x offset from the arrow head location to place the associated text, in 'figure points' units
    text_offset_y: The y offset from the arrow head location to place the associated text, in 'figure points' units
    """
    if annotate_params is None:
        annotate_params = dict(xytext=(text_offset_x, text_offset_y), textcoords='offset points', arrowprops={'arrowstyle':'->'})

    plt.annotate('Median', (x_loc + 1 + x_offset, bpdict['medians'][x_loc].get_ydata()[0]), **annotate_params)
    plt.annotate('25%', (x_loc + 1 + x_offset, bpdict['boxes'][x_loc].get_ydata()[0]), **annotate_params)
    plt.annotate('75%', (x_loc + 1 + x_offset, bpdict['boxes'][x_loc].get_ydata()[2]), **annotate_params)
    plt.annotate('5%', (x_loc + 1 + x_offset, bpdict['caps'][x_loc*2].get_ydata()[0]), **annotate_params)
    plt.annotate('95%', (x_loc + 1 + x_offset, bpdict['caps'][(x_loc*2)+1].get_ydata()[0]), **annotate_params)

You can then run code like this:

df = pd.DataFrame({'Column 1': np.random.normal(size=100),
                   'Column 2': np.random.normal(scale=2, size=100)})

bpdict = df.boxplot(whis=[5, 95], return_type='dict')
annotate_boxplot(bpdict, x_loc=1)

This will produce something like this:

file

You can pass various parameters to change the display. For example, to make the labels closer to the boxplot and lower than the thing they’re pointing at, set the parameters as: text_offset_x=20 and text_offset_y=-20, giving:

file

So, how does this work? Well, when you create a boxplot, matplotlib very helpfully returns you a dict containing the matplotlib objects referring to each part of the boxplot: the box, the median line, the whiskers etc. It looks a bit like this:

{'boxes': [<matplotlib.lines.Line2D object at 0x1179db908>,
           <matplotlib.lines.Line2D object at 0x1176ac3c8>],
 'caps': [<matplotlib.lines.Line2D object at 0x117736668>,
          <matplotlib.lines.Line2D object at 0x1177369b0>,
          <matplotlib.lines.Line2D object at 0x1176acda0>,
          <matplotlib.lines.Line2D object at 0x1176ace80>],
 'fliers': [<matplotlib.lines.Line2D object at 0x117736dd8>,
            <matplotlib.lines.Line2D object at 0x1176b07b8>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D object at 0x117736cf8>,
             <matplotlib.lines.Line2D object at 0x1176b0470>],
 'whiskers': [<matplotlib.lines.Line2D object at 0x11774ef98>,
              <matplotlib.lines.Line2D object at 0x117736320>,
              <matplotlib.lines.Line2D object at 0x1176ac710>,
              <matplotlib.lines.Line2D object at 0x1176aca58>]}

Each of these objects is a matplotlib.lines.Line2D object, which has a get_xdata() and get_ydata() method (see the docs for more details). In this case, all we’re interested in is the y locations, so get_ydata() suffices.

All we do then is grab the right co-ordinate from the list of co-ordinates that are returned (noting that for the box we have to look at the 0th and the 2nd co-ordinates to get the bottom and top of the box respectively). We also have to remember that the caps dict entry has two objects for each individual boxplot – as there are caps at the bottom and the top – so we have to be a bit careful with selecting those.

The other useful thing to point out is that you can choose what co-ordinate systems you use with the plt.annotate function – so you can set textcoords='offset points' and then set the same xytext value each time we call it – and all the labels will be offset from their arrowhead location the same amount. That saves us manually calculating a location for the text each time.

This function was gradually expanded to allow more configurability – and could be expanded a lot more, but should work as a good starting point for anyone wanting to do the same sort of thing. It depends on a lot of features of plt.annotate – for more details on how to use this function look at its documentation or the Advanced Annotation guide.

3 Comments