Robin's Blog

Searching an aerial photo with text queries – a demo and how it works

Summary: I’ve created a demo web app where you can search an aerial photo of Southampton, UK using text queries such as "roundabout", "tennis court" or "ship". It uses vector embeddings to do this – which I explain in this blog post.

In this post I’m going to try and explain a bit more about how this works.

Firstly, I should explain that the only data used for the searching is the aerial image data itself – even though a number of these things will be shown on the OpenStreetMap map, none of that data is used, so you can also search for things that wouldn’t be shown on a map (like a blue bus)

The main technique that lets us do this is vector embeddings. I strongly suggest you read Simon Willison’s great article/talk on embeddings but I’ll try and explain here too. An embedding model lets you turn a piece of data (for example, some text, or an image) into a constant-length vector – basically just a sequence of numbers. This vector would look something like [0.283, -0.825, -0.481, 0.153, ...] and would be the same length (often hundreds or even thousands of elements long) regardless how long the data you fed into it was.

In this case, I’m using the SkyCLIP model which produces vectors that are 768 elements long. One of the key features of these vectors are that the model is trained to produce similar vectors for things that are similar in some way. For example, a text embedding model may produce a similar vector for the words "King" and "Queen", or "iPad" and "tablet". The ‘closer’ a vector is to another vector, the more similar the data that produced it.

The SkyCLIP model was trained on image-text pairs – so a load of images that had associated text describing what was in the image. SkyCLIP’s training data "contains 5.2 million remote sensing image-text pairs in total, covering more than 29K distinct semantic tags" – and these semantic tags and the text descriptions of them were generated from OpenStreetMap data.

Once we’ve got the vectors, how do we work out how close vectors are? Well, we can treat the vectors as encoding a point in 768-dimensional space. That’s a bit difficult to visualise – so imagine a point in 2- or 3-dimensional space as that’s easier, plotted on a graph. Vectors for similar things will be located physically closer on the graph – and one way of calculating similarity between two vectors is just to measure the multi-dimensional distance on a graph. In this situation we’re actually using cosine similarity, which gives a number between -1 and +1 representing the similarity of two vectors.

So, we now have a way to calculate an embedding vector for any piece of data. The next step we take is to split the aerial image into lots of little chunks – we call them ‘image chips’ – and calculate the embedding of each of those chunks, and then compare them to the embedding calculated from the text query.

I used the RasterVision library for this, and I’ll show you a bit of the code. First, we generate a sliding window dataset, which will allow us to then iterate over image chips. We define the size of the image chip to be 200×200 pixels, with a ‘stride’ of 100 pixels which means each image chip will overlap the ones on each side by 100 pixels. We then configure it to resize the output to 224×224 pixels, which is the size that the SkyCLIP model expects as input.

ds = SemanticSegmentationSlidingWindowGeoDataset.from_uris(
        image_uri=uri,
        image_raster_source_kw=dict(channel_order=[0, 1, 2]),
        size=200,
        stride=100,
        out_size=224,
    )

We then iterate over all of the image chips, run the model to calculate the embedding and stick it into a big array:

dl = DataLoader(ds, batch_size=24)

EMBEDDING_DIM_SIZE = 768
embs = torch.zeros(len(ds), EMBEDDING_DIM_SIZE)

with torch.inference_mode(), tqdm(dl, desc='Creating chip embeddings') as bar:
    i = 0
    for x, _ in bar:
        x = x.to(DEVICE)
        emb = model.encode_image(x)
        embs[i:i + len(x)] = emb.cpu()
        i += len(x)

# normalize the embeddings
embs /= embs.norm(dim=-1, keepdim=True)

embs.shape

We also do a fair amount of fiddling around to get the locations of each chip and store those too.

Once we’ve stored all of those (I’ll get on to storage in a moment), we need to calculate the embedding of the text query too – which can be done with code like this:

text = tokenizer(text_queries)
with torch.inference_mode():
        text_features = model.encode_text(text.to(DEVICE))
        text_features /= text_features.norm(dim=-1, keepdim=True)
        text_features = text_features.cpu()

It’s then ‘just’ a matter of comparing the text query embedding to the embeddings of all of the image chips, and finding the ones that are closest to each other.

To do this, we can use a vector database. There are loads of different vector databases to choose from, but I’d recently been to a tutorial at PyData Southampton (I’m one of the co-organisers, and I strongly recommend attending if you’re in the area) which used the Pinecone serverless vector database, and they have a fairly generous free tier, so I thought I’d try that.

Pinecone, like all other vector databases, allows you to insert a load of vectors and their metadata (in this case, their location in the image) into the database, and then search the database to find the vectors closest to a ‘search vector’ you provide.

I won’t bother showing you all the code for this side of things: it’s fairly standard code for calling Pinecone APIs, mostly copied from their tutorials.

I then wrapped this all up in a FastAPI API, and put a simple Javascript front-end on it to display the results on a Leaflet web map. I also added some basic caching to stop us hitting the Pinecone API too frequently (as there is limit to the number of API calls you can make on the free plan). And that’s pretty-much it.

I hope the explanation made sense: have a play with the app here and post a comment with any questions.


Who reads my blog? Send me an email or comment if you do!

I’m interested to find out who is reading my blog. Following the lead of Jamie Tanna who was in turn copying Terence Eden (both of whose blogs I read), I’d like to ask people who read this to drop me an email or leave a comment on this post if you read this blog and see this post. I have basic analytics on this blog, and I seem to get a reasonable number of views – but I don’t know how many of those are real people, and how many are bots etc.

Feel free to just say hi, but if you have chance then I’d love to find out a bit more about you and how you read this. Specifically, feel free to answer any or all of the following questions:

  • Do you read this on the website or via RSS?
  • Do you check regularly/occasionally for new posts, or do you follow me on social media (if so, which one?) to see new posts?
  • How did you find my blog in the first place?
  • Are you interested in and/or working in one of my specialisms – like geospatial data, general data science/data processing, Python or similar?
  • Which posts do you find most interesting? Programming posts on how to do things? Geographic analyses? Book reviews? Rare unusual posts (disability, recipes etc)?
  • Have you met me in real life?
  • Is there anything particular you’d like me to write about?

The comments box should be just below here, and my email is robin@rtwilson.com

Thanks!


A load of links…

For months now I’ve been collecting a load of links saying that I’ll get round to blogging them "soon". Well, I’m currently babysitting for a friend’s daughter (who is sleeping peacefully upstairs), so I’ve finally found time to write them up.

So, here are a load of links – a lot of them are geospatial- or data-related, but I hope you might find something here you like even if those aren’t your specialisms:

  • QGIS Visual Changelog for v3.36 – QGIS is great GIS software that I use a lot in my work. The reason this link is here though is because of their ‘visual changelog’ that they write for every new version – it is a really good way of showing what has changed in a GUI application, and does a great job of advertising the new features.
  • List of unusual units of measurement – a fascinating list from Wikipedia. I think my favourite are the barn (and outhouse/shed) and the shake – you’ll have to read the article to see what these are!
  • List of humourous units of measurement – linked from the above is this list of humourous units of measurement, including the ‘beard-second’, and the Smoot.
  • lonboard – a relatively new package for very efficiently creating interactive webmaps of geospatial data from Python. It is so much faster than geopandas.explore(), and is developing at a rapid pace. I’m using it a lot these days.
  • A new metric for measuring learning – an interesting post from Greg Wilson about the time it takes learners to realise that something was worth learning. I wonder what the values are for some things that I do a lot of – remote sensing, GIS, Python, cloud computing, cloud-native geospatial etc.
  • The first global 30-m land-cover dynamic monitoring product with fine classification system from 1985 to 2022 – an interesting dataset that I haven’t had chance to investigate in detail yet, but purports to give 30m global land cover every 5 years from 1985 and every year from 2000.
  • CyberChef – from GCHQ (yes, the UK signals intelligence agency) comes this very useful web-based tool for converting data formats, extracting data, doing basic image processing, extracting files and more. You can even build multiple tools into a ‘recipe’ or pipeline.
  • envreport – a simple Python script to report lots of information about the Python environment in a diffable form – I can see this being very useful
  • Antarctic glaciers named after satellites – seven Antarctic glaciers have been named after satellites, reflecting (ha!) the importance of satellites in monitoring Antarctica
  • MoMAColors and MetBrewer – these are color palettes derived from artwork at the Museum of Modern Art and the Metropolitan Museum of Art in New York. There are some beautiful sets of colours in here (see below) which I want to use in some future data visualisations.

  • geospatial-cli – a list of geospatial command-line tools. Lots of them I knew, but there were some gems I hadn’t come across before here too.
  • map-gl-tools – a nice Javascript package to add a load of syntactic sugar to the MapLibreJS library. I’ve recently started using this library and found things quite clunky (coming from a Leaflet background) so this really helped
  • CoolWalks – a nice research paper looking at creating walking routes on the shaded side of the street for various cities
  • Writing efficient code for GeoPandas and Shapely in 2023 – a very useful notebook showing how to do things efficiently in GeoPandas in 2023. There are a load of old ways of doing things which are no longer the most efficient!
  • Inside PostGIS: Calculating Distance – a post explaining how PostGIS’s distance calculation algorithms work
  • quackosm – a Python and CLI tool for reading OpenStreetMap data using DuckDB – either for future analysis in DuckDB or for export to GeoParquet
  • Comparing odc.stac.load and stackstac for raster composite workflow – fairly self-explanatory, but it’s interesting to see the differences between two widely-used tools in the STAC ecosystem
  • Towards Flexible Data Schemas – this article shows how the flexibility of data schemes in specifications like STAC really help their adoption and use for diverse purposes

Accessing Planetary Computer STAC files in DuckDB

Microsoft Planetary Computer is a wonderful archive of geospatial datasets (primarily raster images of various types), provided with a STAC catalog to enable them to be easily searched through an API. That’s fine for normal usage where you want to find a selection of images and access the images themselves, but less useful when you want to do some analysis of the metadata itself.

For that use case, Planetary Computer provide the STAC metadata in bulk, stored as GeoParquet files. Their documentation page explains how to use this with geopandas and do some large-ish scale processing with Dask. I wanted to try this with DuckDB – a newer tool that is excellent at accessing and processing Parquet files, including efficiently accessing files available via HTTP, and only downloading the relevant parts of the file. So, this post explains how I managed to do this – showing various different approaches I tried and how (or if!) each of them worked.

Getting URLs

Planetary Computer URLs are basically just URLs to files on Azure Blob Storage. However, the URLs require signing with a Shared Access Signature (SAS) key. Luckily, there is a Planetary Computer Python module that will use an API to generate the correct keys for us.

In the docs for using the STAC GeoParquet files, they give this code:

catalog = pystac_client.Client.open(
    "https://planetarycomputer.microsoft.com/api/stac/v1/",
    modifier=planetary_computer.sign_inplace,
)

asset = catalog.get_collection("io-lulc-9-class").assets["geoparquet-items"]

This gets a collection, and the geoparquet-items collection-level asset, having used a ‘modifier’ to sign the URLs as they are acquired from the STAC API. The URL is then stored in asset.href.

Using the DuckDB CLI – with HTTP URLs

When using GeoPandas, the docs show that you can pass a storage_options parameter to the read_parquet method and it will ‘just work’:

df = geopandas.read_parquet(
    asset.href, storage_options=asset.extra_fields["table:storage_options"]
)

However, to use the file in DuckDB we need a full URL. The SQL code we’re going to use in DuckDB for a quick test looks like this:

SELECT * FROM <URL> LIMIT 1

This just gets the first row of whatever URL is given.

Unfortunately, though, the URL provided by the STAC catalog looks like this:

abfs://items/io-lulc-9-class.parquet

If we try that with DuckDB, we find it doesn’t know how to deal with it. So, we need to convert this to a standard HTTP URL that can be downloaded as if it were just typed into a web browser (or used with cURL or wget). After a bit of playing around, I found I could create a full URL with this Python code:

url = f'https://{asset.extra_fields["table:storage_options"]
["account_name"]}.dfs.core.windows.net/{asset.href[7:]}?
{asset.extra_fields["table:storage_options"]["credential"]}'

That’s taking the account_name and sticking it in as part of the host part of the UK, then extracting the path from the original URL and adding that, and then finishing with the SAS token (stored as credential in the storage options) as a URL query parameter.

That results in a URL that looks like this (split over multiple lines for readability):

https://pcstacitems.dfs.core.windows.net/items/io-lulc-9-class.parquet
?st=2024-06-14T19%3A04%3A20Z&se=2024-06-15T19%3A49%3A20Z&sp=rl&sv=2024-05-04&
sr=c&skoid=9c8ff44a-6a2c-4dfb-b298-1c9212f64d9a
&sktid=72f988bf-86f1-41af-91ab-2d7cd011db47&skt=2024-06-15T09%3A00%3A14Z
&ske=2024-06-22T09%3A00%3A14Z&sks=b
&skv=2024-05-04
&sig=A0e%2BihbAagHmIZ%2Bg8gzH71TavRYQMZiHWJ/Uk9j0Who%3D

(note: that specific URL won’t work for you, as the SAS tokens are time-limited).

If we insert that into our DuckDB SQL statement then we find it works (click to enlarge):

Great! Now let’s try with a different collection on Planetary Computer – in this case the Sentinel 2 collection. We can make the URL in the same way as before, by changing how we define asset:

asset = catalog.get_collection("sentinel-2-l2a").assets["geoparquet-items"]

and then using the same query in DuckDB with the new URL. Unfortunately, this time we get an error:

Invalid Input Error: File '<URL>' too small to be a Parquet file

The problem here is that for larger collections, the STAC metadata is spread out over a series of GeoParquet files inside a folder, and the URL we’ve been given is just to the folder (even though it ends in .parquet). As we’re just using HTTP, there is no way to do things like list the files in a folder, so DuckDB has no way to find out what files are within the folder and start reading them. We need to find another way.

Using the DuckDB CLI – with the Azure extension

Conveniently, there is an Azure extension for DuckDB, and it lets you use URLs like 'abfss://⟨my_filesystem⟩/⟨path⟩/⟨my_file⟩.⟨parquet_or_csv⟩'. That’s a slightly different URL scheme to the one we’ve been given (abfss as opposed to abfs), but we can easily sort that.

Looking at the authentication docs though, it seems to require you to specify either a connection string, a service principal or using the ‘Azure Credential Chain’ (which, I think, works with the az command line and various environment variables that you may have set up). We don’t have any of those – they’re all a far broader scope than what we’ve got, which is just a SAS token for a specific file or folder. It looks like the Azure extension doesn’t support this, so we’ll have to look for another way.

Using the DuckDB Python library – with an Azure connector

As well as the command-line interface, DuckDB can also be used from Python. To set this up, just run pip install duckdb. While you’re at it, you might want to install pandas and the adlfs library for connecting to Azure Storage.

Using this is actually quite simple, first import the libraries:

import duckdb
from adlfs.spec import AzureBlobFileSystem

then set up the Azure connection:

a = AzureBlobFileSystem(account_name='pcstacitems',
   sas_token=asset.extra_fields["table:storage_options"]['credential'])

Note how here we’re using the sas_token parameter to provide a SAS token. We could use a different parameter here to provide a connection string or some other credential type. I couldn’t find much real-world use of this sas_token parameter when looking online – so this is probably the key ‘less well documented’ bit to take away from this article.

Continuing, we then connect to DuckDB and ‘register’ this Azure connection:

connection = duckdb.connect()
connection.register_filesystem(a)

From there, we can run a SQL query like this:

query = connection.sql(f"""
SELECT * FROM 'abfs://items/sentinel-2-l2a.parquet/*.parquet' LIMIT 1;
""")
query.to_df()

The call to to_df at the end converts the result into a Pandas DataFrame for easy viewing and manipulation. The results are shown below (click to enlarge):

Note that we’re passing a URL of abfs://items/sentinel-2-l2a.parquet/*.parquet – this is the URL from the STAC catalog with /*.parquet added to the end to ensure DuckDB picks up the large number of Parquet files stored there. I’d have thought this would have worked without that, but if I miss that out I get a error saying:

TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'

I suspect this is something to do with how things are passed to the Azure connection that we registered with DuckDB, but I’m not entirely sure. If you do know, then please leave a comment!

A few fun queries

So, now we’ve got this working, what can we do? Well, we can – fairly efficiently – run queries across all the STAC metadata for Sentinel 2 images. I’ll just give a few examples of queries I’ve run below.

We can find out how many images there are in the collection in total:

SELECT COUNT(*) FROM 'abfs://items/sentinel-2-l2a.parquet/*.parquet'

We can do the same for just the images acquired in 2020 – by using a fact we know about how the individual parquet files are named (from the Planetary Computer docs):

SELECT COUNT(*) FROM 'abfs://items/sentinel-2-l2a.parquet/*2020*.parquet'

And we can start looking at how many scenes were captured with different amounts of cloud cover:

SELECT COUNT(*) / (SELECT COUNT(*) FROM 'abfs://items/sentinel-2-l2a.parquet/*2020*.parquet'), CASE
    WHEN "eo:cloud_cover" between 0 and 5 then '0-5%'
    WHEN "eo:cloud_cover" between 5 and 10 then '5-10%'
    WHEN "eo:cloud_cover" between 10 and 15 then '10-15%'
    END AS category
FROM 'abfs://items/sentinel-2-l2a.parquet/*2020*.parquet' GROUP BY category,

This tells us that 20% of scenes had between 0 and 5% cloud cover (quite a high number, I thought – but then again, I am used to living in the UK!), and around 4-5% in each of the 5-10% and 10-15% categories.

There are plenty of other analyses that you could do with these Parquet files, of course. At some point I might even get around to the task which initially made me look into this: that is, trying to find Landsat and Sentinel 2 scenes that were acquired over the same location at very similar times. I think I’ll leave that for another day though…


Introducing offline_folium

Another new-ish package that I’ve never got around to writing about on my blog is offline_folium. It has a somewhat niche use-case, but it seems like a few people have found it useful.

In brief, it allows you to use the folium package for creating interactive maps from Python, but without an internet connection. Folium is built on top of the Leaflet web mapping library – and it loads all the relevant JS and CSS files directly from CDNs. This means that if you try and use folium without an internet connection it just doesn’t work. Offline_folium works around this by downloading the files when you do have an internet connection, and then re-writing the links to the files to point to the offline versions.

I originally created this package when doing freelance work on a project for the UK Navy – they wanted to have interactive maps on an air-gapped computer, so I built a system that would allow this (in this case, the JS/CSS file download step was run as part of building the ‘app’ we produced). I should note at this point that without an internet connection the default OpenStreetMap background mapping will not work. In some situations that would be a significant problem – but we were using other background mapping, coastline vector files and so on, so it wasn’t a problem for us.

So, how do you use this package?

First, install it using pip install offline_folium. Then make sure you have an internet connection and run python -m offline_folium – this will download the JS/CSS files and store them in a sensible place (chosen automatically). Now you’re all set up.

Once you have no internet connection and want to use folium, you must import it in a slightly different way – first import offline from offline_folium, and then import folium, like this:

from offline_folium import offline
import folium

Now you can use folium as usual, for example by creating a simple map object:

m = folium.Map()

So, that’s pretty-much it. People are actively submitting pull requests at the moment, so hopefully the functionality will expand to work with various folium plugins too.

For more information (or to submit a PR yourself), see the Github repo.


Introducing pyAURN – a Python package for accessing UK air quality data

I realised recently that I’d never actually blogged about my pyAURN package – so it’s about time that I did.

When doing some freelance work on air quality a while back, I wanted an easy way to access UK air quality from the Automatic Urban and Rural Network (AURN). Unfortunately, there isn’t a nice API for accessing the data. Strangely, though, they do provide the data in a series of RData files for use by the openair R package. I wanted to use the data from Python though – but conveniently there is a Python package for reading RData files.

So, I put all these together into a simple Python package called pyAURN. It is strongly based upon openair, but is a lot more limited in functionality – it basically only covers importing the data, and doesn’t have many plotting or analysis functions.

Here’s an example of how to use it:

from pyaurn import importAURN, importMeta, timeAverage

# Download metadata of site IDs, names, locations etc
metadata = importMeta()

# Download 4 years of data for the Marylebone Road site
# (MY1 is the site ID for this site)
# Note: range(2016, 2022) will produce a list of six years: 2016, 2017, 2018, 2019, 2020, and 2021. 
# Alternatively define a list of years to use eg. [2016,2017,2018,2019,2020,2021]
data = importAURN("MY1", range(2016, 2022))

# Group the DataFrame by a frequency of monthly, and the statistic mean(). 
data_monthly = timeAverage(data,avg_time="month",statistic="mean")

I found this really useful for my air quality analysis work – and I hope you do too. The package is on PyPI, so you can run pip install pyaurn or view the project on Github.


How to install the Python triangle package on an Apple Silicon Mac

I was recently trying to set up RasterVision on my Apple Silicon Mac (specifically a M1 MacBook Pro, but I’m pretty sure this applies to any Apple Silicon Mac). It all went fine until it came time to install the triangle package, when I got an error. The error output is fairly long, but the key part is the end part here:

triangle/core.c:196:12: fatal error: 'longintrepr.h' file not found
        #include "longintrepr.h"
                 ^~~~~~~~~~~~~~~
      1 error generated.
      error: command '/usr/bin/clang' failed with exit code 1
      [end of output]

It took me quite a bit of searching to find the answer (Google just isn’t very good at giving relevant results these days), but actually it turns out to be very simple. The latest version of triangle on PyPI doesn’t work on Apple Silicon, but the code in the Github repository does work, so you can install directly from Github with this command:

pip install git+https://github.com/drufat/triangle.git

and it should all work fine.

Once you’ve done this, install rastervision again and it should recognise that the triangle package is already installed and not try to install it again.


How to subscribe to releases on Github

For a while I’d wished there was an easy way to get notified when my favourite open-source packages release a new version. I’d often see something on social media, but that tended to only be for the larger packages – and I wanted to keep up with the smaller ones too.

When I actually bothered to put five minutes of work into finding out if this was possible, I found it was pretty easy. So, here’s how to do it:

  1. Go to the Github repository page for the package you’re interested in.

  2. Click the down arrow next to the ‘Watch’ badge at the top right of the header:

  1. Click ‘Custom’ and then tick ‘Releases’ and click ‘Apply’:

And that’s it! Now you’ll get a notification email when a new release is created on Github. Of course, this relies on the specific project using Github Releases, which not all open-source projects do, but a fairly high proportion of them seem to. After doing this you’ll get an email when there is a new release, with the details of the release and links to see it on Github. An example email for the stac-fastapi repository is below:

Update: Peter in the comments informed me that you can also get an Atom feed for use in your RSS reader by going to https://github.com/{owner}/{repo}/releases.atom. For example, this link for my Py6S package


Some matplotlib tips – a reblog

I was looking through my past blog posts recently, and thought a few of them were worth ‘reblogging’ so that more people could see them (now that my blog posts are getting more readers). So, here are a few posts on matplotlib tips.

Matplotlib titles have configurable locations – and you can have more than one at once!


This post explains how to create matplotlib titles in various locations.

Easily hiding items from the legend in matplotlib


This post explains how to easily hide items from the legend in matplotlib.

Easily specifying colours from the default colour cycle in matplotlib


This post shows how to specify colours from the default colour cycle – ranging from a very simple way to more complex methods that might work in other situations.