Robin's Blog

How to fix GeoPandas drop_duplicates on geometry column not getting rid of all duplicates?

I use GeoPandas for a lot of my vector GIS data manipulation in Python.

I had a situation the other day where I ended up with duplicates of some geometries in my GeoDataFrame, and I wanted to remove them. The simple way to do this is to use the underlying pandas method drop_duplicates on the geometry column, like:

gdf.drop_duplicates('geometry')

However, even after running this I still had some duplicate geometries present. I couldn’t quite believe this, so checked multiple times and then started examining the geometries themselves.

What I found was that my duplicates were technically different geometries, but they looked the same when viewing them on a map. That’s because my geometries were LineStrings and I had two copies of the geometry: one with co-ordinates listed in the order left-to-right, and one in the order right-to-left.

This is illustrated in the image below: both lines look the same, but one line has the individual vertex co-ordinates in order from left-to-right and one has the same co-ordinates in order from right-to-left.

These two geometries will show as the same when using the geometry.equals() method, but won’t be picked up by drop_duplicates. That’s because drop_duplicates just serialises the geometry to Well-Known Binary and compares those to check for equality.

I started implementing various complex (and computationally-intensive) ways to deal with this, and then posted an issue on the GeoPandas Github page. Someone there gave me a simple solution which I want to share with you.

All you need to do is run gdf.normalize() first. So, the full code would be:

gdf.normalize()
gdf.drop_duplicates('geometry')

The normalize() method puts the vertices into a standard order so that they can be compared easily. This works for vertex order in lines and polygons, and ring orders in complex polygons.


If you found this post useful, please consider buying me a coffee.
This post originally appeared on Robin's Blog.


Categorised as: Academic, GIS, Programming, Python


Leave a Reply

Your email address will not be published. Required fields are marked *