Robin's Blog

A load of links…

June 28, 2024

For months now I’ve been collecting a load of links saying that I’ll get round to blogging them "soon". Well, I’m currently babysitting for a friend’s daughter (who is sleeping peacefully upstairs), so I’ve finally found time to write them up.

So, here are a load of links – a lot of them are geospatial- or data-related, but I hope you might find something here you like even if those aren’t your specialisms:

QGIS Visual Changelog for v3.36 – QGIS is great GIS software that I use a lot in my work. The reason this link is here though is because of their ‘visual changelog’ that they write for every new version – it is a really good way of showing what has changed in a GUI application, and does a great job of advertising the new features.
List of unusual units of measurement – a fascinating list from Wikipedia. I think my favourite are the barn (and outhouse/shed) and the shake – you’ll have to read the article to see what these are!
List of humourous units of measurement – linked from the above is this list of humourous units of measurement, including the ‘beard-second’, and the Smoot.
lonboard – a relatively new package for very efficiently creating interactive webmaps of geospatial data from Python. It is so much faster than geopandas.explore(), and is developing at a rapid pace. I’m using it a lot these days.
A new metric for measuring learning – an interesting post from Greg Wilson about the time it takes learners to realise that something was worth learning. I wonder what the values are for some things that I do a lot of – remote sensing, GIS, Python, cloud computing, cloud-native geospatial etc.
The first global 30-m land-cover dynamic monitoring product with fine classification system from 1985 to 2022 – an interesting dataset that I haven’t had chance to investigate in detail yet, but purports to give 30m global land cover every 5 years from 1985 and every year from 2000.
CyberChef – from GCHQ (yes, the UK signals intelligence agency) comes this very useful web-based tool for converting data formats, extracting data, doing basic image processing, extracting files and more. You can even build multiple tools into a ‘recipe’ or pipeline.
envreport – a simple Python script to report lots of information about the Python environment in a diffable form – I can see this being very useful
Antarctic glaciers named after satellites – seven Antarctic glaciers have been named after satellites, reflecting (ha!) the importance of satellites in monitoring Antarctica
MoMAColors and MetBrewer – these are color palettes derived from artwork at the Museum of Modern Art and the Metropolitan Museum of Art in New York. There are some beautiful sets of colours in here (see below) which I want to use in some future data visualisations.

geospatial-cli – a list of geospatial command-line tools. Lots of them I knew, but there were some gems I hadn’t come across before here too.
map-gl-tools – a nice Javascript package to add a load of syntactic sugar to the MapLibreJS library. I’ve recently started using this library and found things quite clunky (coming from a Leaflet background) so this really helped
CoolWalks – a nice research paper looking at creating walking routes on the shaded side of the street for various cities
Writing efficient code for GeoPandas and Shapely in 2023 – a very useful notebook showing how to do things efficiently in GeoPandas in 2023. There are a load of old ways of doing things which are no longer the most efficient!
Inside PostGIS: Calculating Distance – a post explaining how PostGIS’s distance calculation algorithms work
quackosm – a Python and CLI tool for reading OpenStreetMap data using DuckDB – either for future analysis in DuckDB or for export to GeoParquet
Comparing odc.stac.load and stackstac for raster composite workflow – fairly self-explanatory, but it’s interesting to see the differences between two widely-used tools in the STAC ecosystem
Towards Flexible Data Schemas – this article shows how the flexibility of data schemes in specifications like STAC really help their adoption and use for diverse purposes

1 Comment

Accessing Planetary Computer STAC files in DuckDB

June 17, 2024

Microsoft Planetary Computer is a wonderful archive of geospatial datasets (primarily raster images of various types), provided with a STAC catalog to enable them to be easily searched through an API. That’s fine for normal usage where you want to find a selection of images and access the images themselves, but less useful when you want to do some analysis of the metadata itself.

For that use case, Planetary Computer provide the STAC metadata in bulk, stored as GeoParquet files. Their documentation page explains how to use this with geopandas and do some large-ish scale processing with Dask. I wanted to try this with DuckDB – a newer tool that is excellent at accessing and processing Parquet files, including efficiently accessing files available via HTTP, and only downloading the relevant parts of the file. So, this post explains how I managed to do this – showing various different approaches I tried and how (or if!) each of them worked.

Getting URLs

Planetary Computer URLs are basically just URLs to files on Azure Blob Storage. However, the URLs require signing with a Shared Access Signature (SAS) key. Luckily, there is a Planetary Computer Python module that will use an API to generate the correct keys for us.

In the docs for using the STAC GeoParquet files, they give this code:

catalog = pystac_client.Client.open(
    "https://planetarycomputer.microsoft.com/api/stac/v1/",
    modifier=planetary_computer.sign_inplace,
)

asset = catalog.get_collection("io-lulc-9-class").assets["geoparquet-items"]

This gets a collection, and the geoparquet-items collection-level asset, having used a ‘modifier’ to sign the URLs as they are acquired from the STAC API. The URL is then stored in asset.href.

Using the DuckDB CLI – with HTTP URLs

When using GeoPandas, the docs show that you can pass a storage_options parameter to the read_parquet method and it will ‘just work’:

df = geopandas.read_parquet(
    asset.href, storage_options=asset.extra_fields["table:storage_options"]
)

However, to use the file in DuckDB we need a full URL. The SQL code we’re going to use in DuckDB for a quick test looks like this:

SELECT * FROM <URL> LIMIT 1

This just gets the first row of whatever URL is given.

Unfortunately, though, the URL provided by the STAC catalog looks like this:

abfs://items/io-lulc-9-class.parquet

If we try that with DuckDB, we find it doesn’t know how to deal with it. So, we need to convert this to a standard HTTP URL that can be downloaded as if it were just typed into a web browser (or used with cURL or wget). After a bit of playing around, I found I could create a full URL with this Python code:

url = f'https://{asset.extra_fields["table:storage_options"]
["account_name"]}.dfs.core.windows.net/{asset.href[7:]}?
{asset.extra_fields["table:storage_options"]["credential"]}'

That’s taking the account_name and sticking it in as part of the host part of the UK, then extracting the path from the original URL and adding that, and then finishing with the SAS token (stored as credential in the storage options) as a URL query parameter.

That results in a URL that looks like this (split over multiple lines for readability):

https://pcstacitems.dfs.core.windows.net/items/io-lulc-9-class.parquet
?st=2024-06-14T19%3A04%3A20Z&se=2024-06-15T19%3A49%3A20Z&sp=rl&sv=2024-05-04&
sr=c&skoid=9c8ff44a-6a2c-4dfb-b298-1c9212f64d9a
&sktid=72f988bf-86f1-41af-91ab-2d7cd011db47&skt=2024-06-15T09%3A00%3A14Z
&ske=2024-06-22T09%3A00%3A14Z&sks=b
&skv=2024-05-04
&sig=A0e%2BihbAagHmIZ%2Bg8gzH71TavRYQMZiHWJ/Uk9j0Who%3D

(note: that specific URL won’t work for you, as the SAS tokens are time-limited).

If we insert that into our DuckDB SQL statement then we find it works (click to enlarge):

Great! Now let’s try with a different collection on Planetary Computer – in this case the Sentinel 2 collection. We can make the URL in the same way as before, by changing how we define asset:

asset = catalog.get_collection("sentinel-2-l2a").assets["geoparquet-items"]

and then using the same query in DuckDB with the new URL. Unfortunately, this time we get an error:

Invalid Input Error: File '<URL>' too small to be a Parquet file

The problem here is that for larger collections, the STAC metadata is spread out over a series of GeoParquet files inside a folder, and the URL we’ve been given is just to the folder (even though it ends in .parquet). As we’re just using HTTP, there is no way to do things like list the files in a folder, so DuckDB has no way to find out what files are within the folder and start reading them. We need to find another way.

Using the DuckDB CLI – with the Azure extension

Conveniently, there is an Azure extension for DuckDB, and it lets you use URLs like 'abfss://⟨my_filesystem⟩/⟨path⟩/⟨my_file⟩.⟨parquet_or_csv⟩'. That’s a slightly different URL scheme to the one we’ve been given (abfss as opposed to abfs), but we can easily sort that.

Looking at the authentication docs though, it seems to require you to specify either a connection string, a service principal or using the ‘Azure Credential Chain’ (which, I think, works with the az command line and various environment variables that you may have set up). We don’t have any of those – they’re all a far broader scope than what we’ve got, which is just a SAS token for a specific file or folder. It looks like the Azure extension doesn’t support this, so we’ll have to look for another way.

Using the DuckDB Python library – with an Azure connector

As well as the command-line interface, DuckDB can also be used from Python. To set this up, just run pip install duckdb. While you’re at it, you might want to install pandas and the adlfs library for connecting to Azure Storage.

Using this is actually quite simple, first import the libraries:

import duckdb
from adlfs.spec import AzureBlobFileSystem

then set up the Azure connection:

a = AzureBlobFileSystem(account_name='pcstacitems',
   sas_token=asset.extra_fields["table:storage_options"]['credential'])

Note how here we’re using the sas_token parameter to provide a SAS token. We could use a different parameter here to provide a connection string or some other credential type. I couldn’t find much real-world use of this sas_token parameter when looking online – so this is probably the key ‘less well documented’ bit to take away from this article.

Continuing, we then connect to DuckDB and ‘register’ this Azure connection:

connection = duckdb.connect()
connection.register_filesystem(a)

From there, we can run a SQL query like this:

query = connection.sql(f"""
SELECT * FROM 'abfs://items/sentinel-2-l2a.parquet/*.parquet' LIMIT 1;
""")
query.to_df()

The call to to_df at the end converts the result into a Pandas DataFrame for easy viewing and manipulation. The results are shown below (click to enlarge):

Note that we’re passing a URL of abfs://items/sentinel-2-l2a.parquet/*.parquet – this is the URL from the STAC catalog with /*.parquet added to the end to ensure DuckDB picks up the large number of Parquet files stored there. I’d have thought this would have worked without that, but if I miss that out I get a error saying:

TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'

I suspect this is something to do with how things are passed to the Azure connection that we registered with DuckDB, but I’m not entirely sure. If you do know, then please leave a comment!

A few fun queries

So, now we’ve got this working, what can we do? Well, we can – fairly efficiently – run queries across all the STAC metadata for Sentinel 2 images. I’ll just give a few examples of queries I’ve run below.

We can find out how many images there are in the collection in total:

SELECT COUNT(*) FROM 'abfs://items/sentinel-2-l2a.parquet/*.parquet'

We can do the same for just the images acquired in 2020 – by using a fact we know about how the individual parquet files are named (from the Planetary Computer docs):

SELECT COUNT(*) FROM 'abfs://items/sentinel-2-l2a.parquet/*2020*.parquet'

And we can start looking at how many scenes were captured with different amounts of cloud cover:

SELECT COUNT(*) / (SELECT COUNT(*) FROM 'abfs://items/sentinel-2-l2a.parquet/*2020*.parquet'), CASE
    WHEN "eo:cloud_cover" between 0 and 5 then '0-5%'
    WHEN "eo:cloud_cover" between 5 and 10 then '5-10%'
    WHEN "eo:cloud_cover" between 10 and 15 then '10-15%'
    END AS category
FROM 'abfs://items/sentinel-2-l2a.parquet/*2020*.parquet' GROUP BY category,

This tells us that 20% of scenes had between 0 and 5% cloud cover (quite a high number, I thought – but then again, I am used to living in the UK!), and around 4-5% in each of the 5-10% and 10-15% categories.

There are plenty of other analyses that you could do with these Parquet files, of course. At some point I might even get around to the task which initially made me look into this: that is, trying to find Landsat and Sentinel 2 scenes that were acquired over the same location at very similar times. I think I’ll leave that for another day though…

Share your thoughts

Introducing offline_folium

June 10, 2024

Another new-ish package that I’ve never got around to writing about on my blog is offline_folium. It has a somewhat niche use-case, but it seems like a few people have found it useful.

In brief, it allows you to use the folium package for creating interactive maps from Python, but without an internet connection. Folium is built on top of the Leaflet web mapping library – and it loads all the relevant JS and CSS files directly from CDNs. This means that if you try and use folium without an internet connection it just doesn’t work. Offline_folium works around this by downloading the files when you do have an internet connection, and then re-writing the links to the files to point to the offline versions.

I originally created this package when doing freelance work on a project for the UK Navy – they wanted to have interactive maps on an air-gapped computer, so I built a system that would allow this (in this case, the JS/CSS file download step was run as part of building the ‘app’ we produced). I should note at this point that without an internet connection the default OpenStreetMap background mapping will not work. In some situations that would be a significant problem – but we were using other background mapping, coastline vector files and so on, so it wasn’t a problem for us.

So, how do you use this package?

First, install it using pip install offline_folium. Then make sure you have an internet connection and run python -m offline_folium – this will download the JS/CSS files and store them in a sensible place (chosen automatically). Now you’re all set up.

Once you have no internet connection and want to use folium, you must import it in a slightly different way – first import offline from offline_folium, and then import folium, like this:

from offline_folium import offline
import folium

Now you can use folium as usual, for example by creating a simple map object:

m = folium.Map()

So, that’s pretty-much it. People are actively submitting pull requests at the moment, so hopefully the functionality will expand to work with various folium plugins too.

For more information (or to submit a PR yourself), see the Github repo.

Share your thoughts

Introducing pyAURN – a Python package for accessing UK air quality data

June 3, 2024

I realised recently that I’d never actually blogged about my pyAURN package – so it’s about time that I did.

When doing some freelance work on air quality a while back, I wanted an easy way to access UK air quality from the Automatic Urban and Rural Network (AURN). Unfortunately, there isn’t a nice API for accessing the data. Strangely, though, they do provide the data in a series of RData files for use by the openair R package. I wanted to use the data from Python though – but conveniently there is a Python package for reading RData files.

So, I put all these together into a simple Python package called pyAURN. It is strongly based upon openair, but is a lot more limited in functionality – it basically only covers importing the data, and doesn’t have many plotting or analysis functions.

Here’s an example of how to use it:

from pyaurn import importAURN, importMeta, timeAverage

# Download metadata of site IDs, names, locations etc
metadata = importMeta()

# Download 4 years of data for the Marylebone Road site
# (MY1 is the site ID for this site)
# Note: range(2016, 2022) will produce a list of six years: 2016, 2017, 2018, 2019, 2020, and 2021. 
# Alternatively define a list of years to use eg. [2016,2017,2018,2019,2020,2021]
data = importAURN("MY1", range(2016, 2022))

# Group the DataFrame by a frequency of monthly, and the statistic mean(). 
data_monthly = timeAverage(data,avg_time="month",statistic="mean")

I found this really useful for my air quality analysis work – and I hope you do too. The package is on PyPI, so you can run pip install pyaurn or view the project on Github.

Share your thoughts

How to install the Python triangle package on an Apple Silicon Mac

May 28, 2024

I was recently trying to set up RasterVision on my Apple Silicon Mac (specifically a M1 MacBook Pro, but I’m pretty sure this applies to any Apple Silicon Mac). It all went fine until it came time to install the triangle package, when I got an error. The error output is fairly long, but the key part is the end part here:

triangle/core.c:196:12: fatal error: 'longintrepr.h' file not found
        #include "longintrepr.h"
                 ^~~~~~~~~~~~~~~
      1 error generated.
      error: command '/usr/bin/clang' failed with exit code 1
      [end of output]

It took me quite a bit of searching to find the answer (Google just isn’t very good at giving relevant results these days), but actually it turns out to be very simple. The latest version of triangle on PyPI doesn’t work on Apple Silicon, but the code in the Github repository does work, so you can install directly from Github with this command:

pip install git+https://github.com/drufat/triangle.git

and it should all work fine.

Once you’ve done this, install rastervision again and it should recognise that the triangle package is already installed and not try to install it again.

1 Comment

How to subscribe to releases on Github

May 20, 2024

For a while I’d wished there was an easy way to get notified when my favourite open-source packages release a new version. I’d often see something on social media, but that tended to only be for the larger packages – and I wanted to keep up with the smaller ones too.

When I actually bothered to put five minutes of work into finding out if this was possible, I found it was pretty easy. So, here’s how to do it:

Go to the Github repository page for the package you’re interested in.
Click the down arrow next to the ‘Watch’ badge at the top right of the header:

Click ‘Custom’ and then tick ‘Releases’ and click ‘Apply’:

And that’s it! Now you’ll get a notification email when a new release is created on Github. Of course, this relies on the specific project using Github Releases, which not all open-source projects do, but a fairly high proportion of them seem to. After doing this you’ll get an email when there is a new release, with the details of the release and links to see it on Github. An example email for the stac-fastapi repository is below:

Update: Peter in the comments informed me that you can also get an Atom feed for use in your RSS reader by going to https://github.com/{owner}/{repo}/releases.atom. For example, this link for my Py6S package

2 Comments

Some matplotlib tips – a reblog

May 16, 2024

I was looking through my past blog posts recently, and thought a few of them were worth ‘reblogging’ so that more people could see them (now that my blog posts are getting more readers). So, here are a few posts on matplotlib tips.

Matplotlib titles have configurable locations – and you can have more than one at once!

This post explains how to create matplotlib titles in various locations.

Easily hiding items from the legend in matplotlib

This post explains how to easily hide items from the legend in matplotlib.

Easily specifying colours from the default colour cycle in matplotlib

This post shows how to specify colours from the default colour cycle – ranging from a very simple way to more complex methods that might work in other situations.

Share your thoughts

I was interviewed on a podcast!

May 13, 2024

Recently I was interviewed on the Geomob podcast. You can listen on the Geomob site or using the embedded player below:

In the episode I talk about the British Placename Mapper and its inspiration, how I built it, and what might come next. I also cover some of my other geospatial projects including the Free GIS Data site.

Give it a listen and let me know what you think!

Share your thoughts

New Projects page on my website

May 9, 2024

Just a quick post here to say that I’ve added a new Projects page to my freelance website. I realised I didn’t have anywhere online that I could point people to that had links to all of the ‘non-work’ (maybe that should be ‘non-paid’) projects I’ve made.

These projects include my Free GIS Data site, the British Placename Mapper, Py6S and more. I’ve also put together a separate page (linked from the projects page) with all my university theses (PhD, MSc and undergraduate) and other university work – which still get a remarkably high number of downloads.

Have a look here, or see a screenshot of the first few entries below:

1 Comment

Simple segmentation of geospatial images

May 7, 2024

I had a need to do some segmentation of some satellite imagery the other day, for a client. Years ago I was quite experienced at doing segmentation and classification using eCognition but that was using the university’s license, and I don’t have a license myself (and they’re very expensive). So, I wanted a free solution.

First, however, let’s talk a little bit about what segmentation is (thanks to Sonya in the comments for pointing out that I didn’t cover this!). Segmentation is a way of splitting an image up into groups of adjacent pixels (‘segments’ or ‘objects’) that ‘look right’ and, ideally, represent objects in the image. For example, an image of cells from a microscope might be segmented into individual cells, or individual organelles inside the cell (depending on the scale), a satellite image might be segmented into fields, clumps of urban area or individual buildings/swimming pools/driveways – again, depending on the scale. Segmentation uses properties of the image (like colour, texture, sharp lines etc) to try and identify these segments.

Once you’ve segmented an image, you can then do things with the segments – for example, you can classify each segment into a different category (building, road, garden, lake). This is often easier than classifying individual pixels, as you have group statistics about the segment (not just ‘value in the red band’, but a whole histogram of values in the red band, plus mean, median, max etc, plus texture measures and more). Sometimes you may want to use the segment outlines themselves as part of your output (eg. as building outlines), other times they are just used as a way of doing something else (like classification). This whole approach to image processing is often known as Object-based Image Analysis.

There are various segmentation tools in the scikit-image library, but I’ve often struggled using them on satellite or aerial imagery – the algorithms seem better suited to images with a clear foreground and background.

Luckily, I remembered RSGISLib – a very comprehensive library of remote sensing and GIS functions. I last used it many years ago, when most of the documentation was for using it from C++, and installation was a pain. I’m very pleased to say that installation is nice and easy now, and all of the examples are in Python.

So, doing segmentation – using an algorithm specifically designed for segmenting satellite/aerial images – is actually really easy now. Here’s how:

First, install RSGISLib. By far the easiest way is to use conda, but there is further documentation on other installation methods, and there are Docker containers available.

conda install -c conda-forge rsgislib

Then it’s a simple matter of calling the relevant function from Python. The documentation shows the segmentation functions available, and the one you’re most likely to want to use is the Shepherd segmentation algorithm, which is described in this paper). So, to call it, run something like this:

from rsgislib.segmentation.shepherdseg import run_shepherd_segmentation

run_shepherd_segmentation(input_image, output_seg_image,
                          gdalformat=’GTiff’,
                          calc_stats=False,
                          num_clusters=20,
                          min_n_pxls=300)

The parameters are fairly self-explanatory – it will take the input_image filename (any GDAL-supported format will work), produce an output in output_seg_image filename in the gdalformat given. The calc_stats parameter is important if you’re using a format like GeoTIFF, or any format that doesn’t support a Raster Attribute Table (these are mostly supported by somewhat more unusual formats like KEA). You’ll need to set it to False if your format doesn’t support RATs – and I found that if I forgot to set it to false then the script crashed when trying to write stats.

The final two parameters control how the segmentation algorithm itself works. I’ll leave you to read the paper to find out the details, but the names are fairly self-explanatory.

The output of the algorithm will look something like this:

It’s a raster where the value of all the pixels in the first segment are 1, the pixels in the second segment are 2, and so on. The image above uses a greyscale ‘black to white’ colormap, so as the values of the segments increase towards the bottom of the image, they show as more white.

You can convert this raster output to a set of vector polygons, one for each segment, by using any standard raster to vector ‘polygonize’ algorithm. The easiest is probably using GDAL, by running a command like:

gdal_polygonize.py SegRaster.tif SegVector.gpkg

This will give you a result that looks like the red lines on this image:

So, there’s a simple way of doing satellite image segmentation in Python. I hope it was useful.

6 Comments