September 30, 2022

Spatial joins with geopandas and shapely

Recently a project at work required me to write an automated and reusable script to add country data to NSDIC’s Free Air Gravity Anomaly dataset. The classic open-source solution would have been PyQGIS’s processing module, but QGIS strangely still doesn’t support editing CSV layers out of the box, which would make for annoying intermediate steps converting to and from GeoJSON.

Fortunately in 2022 Python has an excellent set of geospatial libraries available. After loading the IceBridge data into a pandas DataFrame, we can easily build a geopandas GeoDataFrame with a new column of Shapely Points initialized from the provided Longitude and Latitude fields:

import pandas as pd
import geopandas as gpd
import shapely

df = pd.read_csv("/home/eric/icebridge_cleaned.csv")
geometry = [shapely.geometry.Point(xy) for xy in zip(df.Longitude, df.Latitude)]
gdf = gpd.GeoDataFrame(df, crs="EPSG:4326", geometry=geometry)

Next load the country data (geopandas.read_file() can read nearly any form of geospatial data into a GeoDataFrame) and drop unnecessary columns:

countries = gpd.read_file("/home/eric/EEZ_Land_v3_202030.shp")
countries = countries["geometry", "TERRITORY1"]

Now all that’s left to do is run the join. While we’d intuitively expect to use intersect, quirks under the hood in Shapely make the equivalent within predicate 100-1000 times faster, and the only real option for datasets of this size. We’ll finish up by dropping the geometry column we generated early (we only needed it to be able to run the join), as well as the generated index column which we don’t care about, and renaming the new column to something more intuitive:

res = gpd.sjoin(gdf, countries, how="left", predicate="within")
res = res.drop(["index_right"], axis=1)
res = res.drop(["geometry", "index_right"], axis=1)
res = res.rename(columns={"TERRITORY1": "Country"})

Now we have our original IceBridge CSV data with a new column describing which country each observation was made in.