Robin's Blog

Accessing Planetary Computer STAC files in DuckDB

Microsoft Planetary Computer is a wonderful archive of geospatial datasets (primarily raster images of various types), provided with a STAC catalog to enable them to be easily searched through an API. That’s fine for normal usage where you want to find a selection of images and access the images themselves, but less useful when you want to do some analysis of the metadata itself.

For that use case, Planetary Computer provide the STAC metadata in bulk, stored as GeoParquet files. Their documentation page explains how to use this with geopandas and do some large-ish scale processing with Dask. I wanted to try this with DuckDB – a newer tool that is excellent at accessing and processing Parquet files, including efficiently accessing files available via HTTP, and only downloading the relevant parts of the file. So, this post explains how I managed to do this – showing various different approaches I tried and how (or if!) each of them worked.

Getting URLs

Planetary Computer URLs are basically just URLs to files on Azure Blob Storage. However, the URLs require signing with a Shared Access Signature (SAS) key. Luckily, there is a Planetary Computer Python module that will use an API to generate the correct keys for us.

In the docs for using the STAC GeoParquet files, they give this code:

catalog = pystac_client.Client.open(
    "https://planetarycomputer.microsoft.com/api/stac/v1/",
    modifier=planetary_computer.sign_inplace,
)

asset = catalog.get_collection("io-lulc-9-class").assets["geoparquet-items"]

This gets a collection, and the geoparquet-items collection-level asset, having used a ‘modifier’ to sign the URLs as they are acquired from the STAC API. The URL is then stored in asset.href.

Using the DuckDB CLI – with HTTP URLs

When using GeoPandas, the docs show that you can pass a storage_options parameter to the read_parquet method and it will ‘just work’:

df = geopandas.read_parquet(
    asset.href, storage_options=asset.extra_fields["table:storage_options"]
)

However, to use the file in DuckDB we need a full URL. The SQL code we’re going to use in DuckDB for a quick test looks like this:

SELECT * FROM <URL> LIMIT 1

This just gets the first row of whatever URL is given.

Unfortunately, though, the URL provided by the STAC catalog looks like this:

abfs://items/io-lulc-9-class.parquet

If we try that with DuckDB, we find it doesn’t know how to deal with it. So, we need to convert this to a standard HTTP URL that can be downloaded as if it were just typed into a web browser (or used with cURL or wget). After a bit of playing around, I found I could create a full URL with this Python code:

url = f'https://{asset.extra_fields["table:storage_options"]
["account_name"]}.dfs.core.windows.net/{asset.href[7:]}?
{asset.extra_fields["table:storage_options"]["credential"]}'

That’s taking the account_name and sticking it in as part of the host part of the UK, then extracting the path from the original URL and adding that, and then finishing with the SAS token (stored as credential in the storage options) as a URL query parameter.

That results in a URL that looks like this (split over multiple lines for readability):

https://pcstacitems.dfs.core.windows.net/items/io-lulc-9-class.parquet
?st=2024-06-14T19%3A04%3A20Z&se=2024-06-15T19%3A49%3A20Z&sp=rl&sv=2024-05-04&
sr=c&skoid=9c8ff44a-6a2c-4dfb-b298-1c9212f64d9a
&sktid=72f988bf-86f1-41af-91ab-2d7cd011db47&skt=2024-06-15T09%3A00%3A14Z
&ske=2024-06-22T09%3A00%3A14Z&sks=b
&skv=2024-05-04
&sig=A0e%2BihbAagHmIZ%2Bg8gzH71TavRYQMZiHWJ/Uk9j0Who%3D

(note: that specific URL won’t work for you, as the SAS tokens are time-limited).

If we insert that into our DuckDB SQL statement then we find it works (click to enlarge):

Great! Now let’s try with a different collection on Planetary Computer – in this case the Sentinel 2 collection. We can make the URL in the same way as before, by changing how we define asset:

asset = catalog.get_collection("sentinel-2-l2a").assets["geoparquet-items"]

and then using the same query in DuckDB with the new URL. Unfortunately, this time we get an error:

Invalid Input Error: File '<URL>' too small to be a Parquet file

The problem here is that for larger collections, the STAC metadata is spread out over a series of GeoParquet files inside a folder, and the URL we’ve been given is just to the folder (even though it ends in .parquet). As we’re just using HTTP, there is no way to do things like list the files in a folder, so DuckDB has no way to find out what files are within the folder and start reading them. We need to find another way.

Using the DuckDB CLI – with the Azure extension

Conveniently, there is an Azure extension for DuckDB, and it lets you use URLs like 'abfss://⟨my_filesystem⟩/⟨path⟩/⟨my_file⟩.⟨parquet_or_csv⟩'. That’s a slightly different URL scheme to the one we’ve been given (abfss as opposed to abfs), but we can easily sort that.

Looking at the authentication docs though, it seems to require you to specify either a connection string, a service principal or using the ‘Azure Credential Chain’ (which, I think, works with the az command line and various environment variables that you may have set up). We don’t have any of those – they’re all a far broader scope than what we’ve got, which is just a SAS token for a specific file or folder. It looks like the Azure extension doesn’t support this, so we’ll have to look for another way.

Using the DuckDB Python library – with an Azure connector

As well as the command-line interface, DuckDB can also be used from Python. To set this up, just run pip install duckdb. While you’re at it, you might want to install pandas and the adlfs library for connecting to Azure Storage.

Using this is actually quite simple, first import the libraries:

import duckdb
from adlfs.spec import AzureBlobFileSystem

then set up the Azure connection:

a = AzureBlobFileSystem(account_name='pcstacitems',
   sas_token=asset.extra_fields["table:storage_options"]['credential'])

Note how here we’re using the sas_token parameter to provide a SAS token. We could use a different parameter here to provide a connection string or some other credential type. I couldn’t find much real-world use of this sas_token parameter when looking online – so this is probably the key ‘less well documented’ bit to take away from this article.

Continuing, we then connect to DuckDB and ‘register’ this Azure connection:

connection = duckdb.connect()
connection.register_filesystem(a)

From there, we can run a SQL query like this:

query = connection.sql(f"""
SELECT * FROM 'abfs://items/sentinel-2-l2a.parquet/*.parquet' LIMIT 1;
""")
query.to_df()

The call to to_df at the end converts the result into a Pandas DataFrame for easy viewing and manipulation. The results are shown below (click to enlarge):

Note that we’re passing a URL of abfs://items/sentinel-2-l2a.parquet/*.parquet – this is the URL from the STAC catalog with /*.parquet added to the end to ensure DuckDB picks up the large number of Parquet files stored there. I’d have thought this would have worked without that, but if I miss that out I get a error saying:

TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'

I suspect this is something to do with how things are passed to the Azure connection that we registered with DuckDB, but I’m not entirely sure. If you do know, then please leave a comment!

A few fun queries

So, now we’ve got this working, what can we do? Well, we can – fairly efficiently – run queries across all the STAC metadata for Sentinel 2 images. I’ll just give a few examples of queries I’ve run below.

We can find out how many images there are in the collection in total:

SELECT COUNT(*) FROM 'abfs://items/sentinel-2-l2a.parquet/*.parquet'

We can do the same for just the images acquired in 2020 – by using a fact we know about how the individual parquet files are named (from the Planetary Computer docs):

SELECT COUNT(*) FROM 'abfs://items/sentinel-2-l2a.parquet/*2020*.parquet'

And we can start looking at how many scenes were captured with different amounts of cloud cover:

SELECT COUNT(*) / (SELECT COUNT(*) FROM 'abfs://items/sentinel-2-l2a.parquet/*2020*.parquet'), CASE
    WHEN "eo:cloud_cover" between 0 and 5 then '0-5%'
    WHEN "eo:cloud_cover" between 5 and 10 then '5-10%'
    WHEN "eo:cloud_cover" between 10 and 15 then '10-15%'
    END AS category
FROM 'abfs://items/sentinel-2-l2a.parquet/*2020*.parquet' GROUP BY category,

This tells us that 20% of scenes had between 0 and 5% cloud cover (quite a high number, I thought – but then again, I am used to living in the UK!), and around 4-5% in each of the 5-10% and 10-15% categories.

There are plenty of other analyses that you could do with these Parquet files, of course. At some point I might even get around to the task which initially made me look into this: that is, trying to find Landsat and Sentinel 2 scenes that were acquired over the same location at very similar times. I think I’ll leave that for another day though…


Introducing offline_folium

Another new-ish package that I’ve never got around to writing about on my blog is offline_folium. It has a somewhat niche use-case, but it seems like a few people have found it useful.

In brief, it allows you to use the folium package for creating interactive maps from Python, but without an internet connection. Folium is built on top of the Leaflet web mapping library – and it loads all the relevant JS and CSS files directly from CDNs. This means that if you try and use folium without an internet connection it just doesn’t work. Offline_folium works around this by downloading the files when you do have an internet connection, and then re-writing the links to the files to point to the offline versions.

I originally created this package when doing freelance work on a project for the UK Navy – they wanted to have interactive maps on an air-gapped computer, so I built a system that would allow this (in this case, the JS/CSS file download step was run as part of building the ‘app’ we produced). I should note at this point that without an internet connection the default OpenStreetMap background mapping will not work. In some situations that would be a significant problem – but we were using other background mapping, coastline vector files and so on, so it wasn’t a problem for us.

So, how do you use this package?

First, install it using pip install offline_folium. Then make sure you have an internet connection and run python -m offline_folium – this will download the JS/CSS files and store them in a sensible place (chosen automatically). Now you’re all set up.

Once you have no internet connection and want to use folium, you must import it in a slightly different way – first import offline from offline_folium, and then import folium, like this:

from offline_folium import offline
import folium

Now you can use folium as usual, for example by creating a simple map object:

m = folium.Map()

So, that’s pretty-much it. People are actively submitting pull requests at the moment, so hopefully the functionality will expand to work with various folium plugins too.

For more information (or to submit a PR yourself), see the Github repo.


Introducing pyAURN – a Python package for accessing UK air quality data

I realised recently that I’d never actually blogged about my pyAURN package – so it’s about time that I did.

When doing some freelance work on air quality a while back, I wanted an easy way to access UK air quality from the Automatic Urban and Rural Network (AURN). Unfortunately, there isn’t a nice API for accessing the data. Strangely, though, they do provide the data in a series of RData files for use by the openair R package. I wanted to use the data from Python though – but conveniently there is a Python package for reading RData files.

So, I put all these together into a simple Python package called pyAURN. It is strongly based upon openair, but is a lot more limited in functionality – it basically only covers importing the data, and doesn’t have many plotting or analysis functions.

Here’s an example of how to use it:

from pyaurn import importAURN, importMeta, timeAverage

# Download metadata of site IDs, names, locations etc
metadata = importMeta()

# Download 4 years of data for the Marylebone Road site
# (MY1 is the site ID for this site)
# Note: range(2016, 2022) will produce a list of six years: 2016, 2017, 2018, 2019, 2020, and 2021. 
# Alternatively define a list of years to use eg. [2016,2017,2018,2019,2020,2021]
data = importAURN("MY1", range(2016, 2022))

# Group the DataFrame by a frequency of monthly, and the statistic mean(). 
data_monthly = timeAverage(data,avg_time="month",statistic="mean")

I found this really useful for my air quality analysis work – and I hope you do too. The package is on PyPI, so you can run pip install pyaurn or view the project on Github.


How to install the Python triangle package on an Apple Silicon Mac

I was recently trying to set up RasterVision on my Apple Silicon Mac (specifically a M1 MacBook Pro, but I’m pretty sure this applies to any Apple Silicon Mac). It all went fine until it came time to install the triangle package, when I got an error. The error output is fairly long, but the key part is the end part here:

triangle/core.c:196:12: fatal error: 'longintrepr.h' file not found
        #include "longintrepr.h"
                 ^~~~~~~~~~~~~~~
      1 error generated.
      error: command '/usr/bin/clang' failed with exit code 1
      [end of output]

It took me quite a bit of searching to find the answer (Google just isn’t very good at giving relevant results these days), but actually it turns out to be very simple. The latest version of triangle on PyPI doesn’t work on Apple Silicon, but the code in the Github repository does work, so you can install directly from Github with this command:

pip install git+https://github.com/drufat/triangle.git

and it should all work fine.

Once you’ve done this, install rastervision again and it should recognise that the triangle package is already installed and not try to install it again.


How to subscribe to releases on Github

For a while I’d wished there was an easy way to get notified when my favourite open-source packages release a new version. I’d often see something on social media, but that tended to only be for the larger packages – and I wanted to keep up with the smaller ones too.

When I actually bothered to put five minutes of work into finding out if this was possible, I found it was pretty easy. So, here’s how to do it:

  1. Go to the Github repository page for the package you’re interested in.

  2. Click the down arrow next to the ‘Watch’ badge at the top right of the header:

  1. Click ‘Custom’ and then tick ‘Releases’ and click ‘Apply’:

And that’s it! Now you’ll get a notification email when a new release is created on Github. Of course, this relies on the specific project using Github Releases, which not all open-source projects do, but a fairly high proportion of them seem to. After doing this you’ll get an email when there is a new release, with the details of the release and links to see it on Github. An example email for the stac-fastapi repository is below:

Update: Peter in the comments informed me that you can also get an Atom feed for use in your RSS reader by going to https://github.com/{owner}/{repo}/releases.atom. For example, this link for my Py6S package


Some matplotlib tips – a reblog

I was looking through my past blog posts recently, and thought a few of them were worth ‘reblogging’ so that more people could see them (now that my blog posts are getting more readers). So, here are a few posts on matplotlib tips.

Matplotlib titles have configurable locations – and you can have more than one at once!


This post explains how to create matplotlib titles in various locations.

Easily hiding items from the legend in matplotlib


This post explains how to easily hide items from the legend in matplotlib.

Easily specifying colours from the default colour cycle in matplotlib


This post shows how to specify colours from the default colour cycle – ranging from a very simple way to more complex methods that might work in other situations.


New Projects page on my website

Just a quick post here to say that I’ve added a new Projects page to my freelance website. I realised I didn’t have anywhere online that I could point people to that had links to all of the ‘non-work’ (maybe that should be ‘non-paid’) projects I’ve made.

These projects include my Free GIS Data site, the British Placename Mapper, Py6S and more. I’ve also put together a separate page (linked from the projects page) with all my university theses (PhD, MSc and undergraduate) and other university work – which still get a remarkably high number of downloads.

Have a look here, or see a screenshot of the first few entries below:


Simple segmentation of geospatial images

I had a need to do some segmentation of some satellite imagery the other day, for a client. Years ago I was quite experienced at doing segmentation and classification using eCognition but that was using the university’s license, and I don’t have a license myself (and they’re very expensive). So, I wanted a free solution.

First, however, let’s talk a little bit about what segmentation is (thanks to Sonya in the comments for pointing out that I didn’t cover this!). Segmentation is a way of splitting an image up into groups of adjacent pixels (‘segments’ or ‘objects’) that ‘look right’ and, ideally, represent objects in the image. For example, an image of cells from a microscope might be segmented into individual cells, or individual organelles inside the cell (depending on the scale), a satellite image might be segmented into fields, clumps of urban area or individual buildings/swimming pools/driveways – again, depending on the scale. Segmentation uses properties of the image (like colour, texture, sharp lines etc) to try and identify these segments.

Once you’ve segmented an image, you can then do things with the segments – for example, you can classify each segment into a different category (building, road, garden, lake). This is often easier than classifying individual pixels, as you have group statistics about the segment (not just ‘value in the red band’, but a whole histogram of values in the red band, plus mean, median, max etc, plus texture measures and more). Sometimes you may want to use the segment outlines themselves as part of your output (eg. as building outlines), other times they are just used as a way of doing something else (like classification). This whole approach to image processing is often known as Object-based Image Analysis.

There are various segmentation tools in the scikit-image library, but I’ve often struggled using them on satellite or aerial imagery – the algorithms seem better suited to images with a clear foreground and background.

Luckily, I remembered RSGISLib – a very comprehensive library of remote sensing and GIS functions. I last used it many years ago, when most of the documentation was for using it from C++, and installation was a pain. I’m very pleased to say that installation is nice and easy now, and all of the examples are in Python.

So, doing segmentation – using an algorithm specifically designed for segmenting satellite/aerial images – is actually really easy now. Here’s how:

First, install RSGISLib. By far the easiest way is to use conda, but there is further documentation on other installation methods, and there are Docker containers available.

conda install -c conda-forge rsgislib

Then it’s a simple matter of calling the relevant function from Python. The documentation shows the segmentation functions available, and the one you’re most likely to want to use is the Shepherd segmentation algorithm, which is described in this paper). So, to call it, run something like this:

from rsgislib.segmentation.shepherdseg import run_shepherd_segmentation

run_shepherd_segmentation(input_image, output_seg_image,
                          gdalformat=’GTiff’,
                          calc_stats=False,
                          num_clusters=20,
                          min_n_pxls=300)

The parameters are fairly self-explanatory – it will take the input_image filename (any GDAL-supported format will work), produce an output in output_seg_image filename in the gdalformat given. The calc_stats parameter is important if you’re using a format like GeoTIFF, or any format that doesn’t support a Raster Attribute Table (these are mostly supported by somewhat more unusual formats like KEA). You’ll need to set it to False if your format doesn’t support RATs – and I found that if I forgot to set it to false then the script crashed when trying to write stats.

The final two parameters control how the segmentation algorithm itself works. I’ll leave you to read the paper to find out the details, but the names are fairly self-explanatory.

The output of the algorithm will look something like this:

It’s a raster where the value of all the pixels in the first segment are 1, the pixels in the second segment are 2, and so on. The image above uses a greyscale ‘black to white’ colormap, so as the values of the segments increase towards the bottom of the image, they show as more white.

You can convert this raster output to a set of vector polygons, one for each segment, by using any standard raster to vector ‘polygonize’ algorithm. The easiest is probably using GDAL, by running a command like:

gdal_polygonize.py SegRaster.tif SegVector.gpkg

This will give you a result that looks like the red lines on this image:

So, there’s a simple way of doing satellite image segmentation in Python. I hope it was useful.


What’s the largest building in Southampton? Find out with 5 lines of code

Recently I became suddenly curious about the sizes of buildings in Southampton, UK, where I live. I don’t know what triggered this sudden curiosity, but I wanted to know what the largest buildings in Southampton are. In this context, I’m using “largest” to mean largest in terms of land area covered – ie. the area of the outline when viewed in plan view. Luckily I know about useful sources of geographic data, and also know how to use GeoPandas, so I could answer this question pretty quickly – in fact, in only five lines of code. I’ll take you through this code below, as an example of a very simple GIS analysis.

First I needed to get hold of the data. I know Ordnance Survey release data on buildings in Great Britain, but to make this even easier we can go to Alastair Rae’s website where he has split the OS data up into Local Authority areas. We need to download the buildings data for Southampton, so we go here and download a GeoPackage file.

Then we need to create a Python environment to do the analysis in. You can do this in various ways – with virtualenvs, conda environments or whatever – but you just need to ensure that Jupyter and GeoPandas are installed. Then create a new notebook and you’re ready to start coding.

First, we import geopandas:

import geopandas as gpd

and then load the buildings data:

buildings = gpd.read_file("Southampton_buildings.gpkg")

The buildings GeoDataFrame has a load of polygon geometries, one for each building. We can calculate the area of a polygon with the .area property – so to create a new ‘area’ column in the GeoDataFrame we can run:

buildings[’area’] = buildings.geometry.area

I’m only interested in the largest buildings, so we can now sort by this new area column, and take the first twenty entries:

top20 = buildings.sort_values(’area’, ascending=False).head(20)

We can then use the lovely explore function to show these buildings on a map. This will load an interactive map in the Jupyter notebook:

top20.explore()

If you’d like to save the interactive map to a standalone HTML file then you can do this instead:

top20.explore().save(“map.html”)

I’ve done that, and uploaded that HTML file to my website – and you can view it here.

So, putting all the code together, we have:

import geopandas as gpd
buildings = gpd.read_file("Southampton_buildings.gpkg")
buildings[’area’] = buildings.geometry.area
top20 = buildings.sort_values(’area’, ascending=False).head(20)
top20.explore()

Five lines of code, with a simple analysis, resulting in an interactive map, and all with the power of GeoPandas.

Hopefully in a future post I’ll do a bit more work on this data – I’d like to make a prettier map, and I’d like to try and find some way to test my friends and see if they can work out what buildings they are.