A few words about spatial data processing
Preview image (by author)Disclaimer: in this article we will demonstrate all approaches using the open-source library estaty. This library appeared due to the desire to formalize the algorithms we use in our work into a tool that will be available to other developers. In other words, this article is written by the maintainer of the library.
Today we would like to continue discussing the topic of spatial data processing using open-source Python library. We have already talked about how to combine Open Street Map and Landsat open data to verify areas of green zones nearby real estate objects.
Now let’s consider another type of analysis: proximity analysis, or the availability (or accessibility) of some useful objects, such as parks, hospitals, kindergartens, etc. (Figure 1).
Figure 1. Schematic illustrating the approach to proximity analysis (let’s imagine that here it is about hairdressing salons, for example) (image by author)
A little bit about proximity analysis
We propose to start the story with a brief literature review. I like the article “Location Analytics Use-Cases that Every Data Scientist Should Know” (attention — it is Member-only story) that pretty easily introduces the reader to what geographic information analysis is.
The articles “Creating a Cost Distance Surface to Measure Park Access” and “The growing state of Utah: a proximity analysis of Utah residents to medical care facilities” also show quite clearly what characteristics in cities engineers can calculate to perform such analyses. The topic is quite popular and a lot of both scientific papers and practical applications have been developed — let’s join in.
Library (that we use)
Now let’s go to a more concrete conversation: the tools that can be used for this proximity analysis (in addition to geographic information systems like QGIS or ArcGIS and proprietary services as the end user of the product).
We, as developers, would like to analyze not aggregated data but raw values with the help of convenient libraries or services (preferably open-source, of course). We have not found any suitable tools for us, so we develop and use our own open-source library written in Python. It is called estaty and if you want, you can see the documentation page after reading the article:
https://estaty.readthedocs.io/en/latest/?badge=latest
Based on this library, we compose use cases. Based on these examples, we developed microservices that help real estate managers perform analysis. In addition to the ready-made services, however, these scripts are organized as use cases and are available to anyone who wants to use the library.
estaty can use different data sources to perform the calculations, but the open examples (to keep them truly open) are based on OpenStreetMap data. In the current paper, we want to show how we can do a fairly quick collateral estimation and still scale the approach to new data sources if needed.
Analysis (which we can do)
First, install the library using the command
pip install estaty==0.1.0
or, if you’re using poetry:
poetry add estaty==0.1.0
We are now ready to use version 0.1.0
Let’s start the implementation. We are going to provide a proximity analysis for the following address: “Berlin, Neustädtische Kirchstraße 4–7” — coordinates {latitude: 52.5171411, longitude: 13.3857187}. We will analyze the neighborhood of this object within a radius of 2 km.
In order to find out how long it will take us to get to the nearest bar (according to OSM — it is worth mentioning here), let’s write and run the following code:
https://medium.com/media/0a75b5d27f34c9b131fc0c251b8453cd/href
Here the following occurs — we construct a processing pipeline for analyzing a property. The pipeline will look like a graph consisting of sequential transformations over the data:
Figure 2. Spatial data processing pipeline for calculating distances of routes to bars using OpenStreetMap data (image by author)
Let’s take a closer look at what’s going “under the hood”. Let’s start with the fact that the estaty architecture is built on 5 abstract layers (in the case of the Pipeline above, only three types of nodes are used):
DataSource — loads data from the required source, brings the data to a common standard (vector and raster). No matter what the source was, the output from this node will be only in two possible variations, which allows to unify their further processing;Preprocessor — preprocessing of vector or raster data, e.g. assigning a new coordinate reference system (CRS);Merger — combining data in the required way. Possible combinations: vector data with vector data, raster with raster and vector data with raster data;Analyzer — core of the library — uses simple atomic transformations over rasters and vector objects to perform some analysis, such as area matching or pathfinding;Report — optional final analysis module, which allows generating a report in a user-friendly format.
By combining these nodes in a certain way it is possible to compose different pipelines for data analysis without modifying the ways of processing the original data — by substituting solely the original sources or ways of analysis (see Animation 1).
Animation 1. Flexibility of the library when preparing a pipeline for analysis (animation by author)
So, Figure 2 shows that only three nodes are used in the analysis — this is a fairly simple pipeline. As a result of its work we will get a geopandas GeoDataFrame (a table with geometries of objects) with linear objects — routes to the objects of our interest. In this case — to the bars. Visualization of the result, by the way, will look as follows:
Figure 3. Routes to bars with distances calculated by the algorithm (image by author)
We have the rights to do what we want with the obtained set of linear objects — for example, to find the distance to the nearest bar, or to request the average value from the sample (which is done in the code):
Min length: 308.93, mAverage length: 2306.49, m
As a true expert in the location of bars in the centers of large cities, I can say that the bars in the area of analysis, in reality, can be much more. But we agreed at the beginning that we would analyze based on OSM data, so….
The Pipeline itself is independent of the source of data on which the analysis is performed. This is ensured by the weak interconnectedness of the library modules. Let’s start with the DataSource node — it has a built-in mechanism of data conversion to one type (for example, vector data here will be in the format geopandas GeoDataFrame and will always be only three types, no matter what source we used for this, all of them will be converted to points, lines, or polygons). This is followed by a pre-processor that will automatically determine the appropriate metric projection for our data. The only thing that matters to the pre-processor is that the data must come from the node’s DataSource, the rest should be taken care of by the node before it. And our simple pipeline is completed by an analysis block, which will operate on any geometric objects, no matter if they are area, linear or point.
Thus, we can do exactly the same calculations for other categories of data that we will extract from OSM (of course, the list is not limited only to OSM), for example, proximity analysis for schools, parks, water objects, public toilets, waste disposal bins, police stations and so on — anything (for the types of objects that can be defined by tags, see the wiki page Map features or in the documentation estaty about data sources):
Figure 5. Proximity analysis for schools, parks, water objects, waste disposal bins, police stations (image by author)
How the approach can be scaled up
We mentioned that spatial data of any nature can be analyzed: social data, natural objects or artificial constructions. Let’s look even deeper into the source code. From the figures we can see that distance calculation is performed on routes built with regard to roads, which is reasonable, because usually we don’t fly to bars by helicopters, but walk. Therefore, the ‘distance’ analysis includes the search for routes on the road graph, using the one from OSM. That is, the algorithm searches for a route on the following abstractions (we relied on the following article, highly recommended — OSMnx: Python for Street Networks):
Figure 6. Graph of walking roads, which is used to search for optimal routes (image by author)
A graph here is a geographically referenced (have latitude and longitude attributes) nodes and edges. To find the shortest route from object A to object B, we need to take three steps:
Find the nearest node in the graph to object A (let’s call it node 1)Find the nearest node in the graph to object B (let’s call it node 2)Using the graph traversal algorithm, find the optimal route from node 1 to node 2
If we want to obtain the distance, there is a need to sum up the length of the route from node 1 to node 2 with the distances from object A to node 1 and from object B to node 2. The last two components can be calculated in a straight line because the network with nodes is quite dense. But here there is a difficulty — if object A and B are points, it is relatively easy to find the nodes nearest to them (calculate the distances between points). If the geometry of object A is, for example, an area, then we have to look for the “nearest” node to the polygon. What to do in the case of a linear object was also a problem to solve
Then let’s answer the question, “How to calculate distances from graph nodes to vector objects of different types: lines, points and polygons in a unified way”? For this purpose, the library converts objects to point type (it is very easy to find the distance between points) before starting the calculation of routes. We call such transformations representations (Figure 7).
Figure 7. Some ways to represent vector objects (image by author)
That is, instead of writing and maintaining three algorithms, we write one (search for routes to points) and then reduce all the initial data types to one. How to reduce them to one type is an ambiguous question. We could do it simply by calculating centroids. However, in the case of calculation in urban areas and especially for large objects (like parks), finding routes for centroids may be unnecessary — we consider that we have reached the park not when we are in its center, but when we take the first step from the asphalted path into the pleasant dirt park darkness.
So, we decided to use an approach where we find the closest point of a polygon or linear object to the object of analysis (from which the routes are built). Then this point is considered as the final destination node — to this node that the further route through the graph will be built.
More advanced analysis
You can’t impress anyone with routes to bars alone. So let’s complicate the task. Let’s assume that we want to make a more complex analysis by combining data from different sources. For example, we want to calculate how much on average we have to walk to parks with oak trees (we like to walk after bars and admire the park with trees). To do this, we will combine data from two sources:
Let’s download vector data from GBIF | Global Biodiversity Information Facility (check https://www.gbif.org/, relevant for 20 October 2023)OpenStreetMap — native integration of data downloading in the library
GBIF data is a vector layer with point geometries that show the locations where certain species of plants and animals (and something else) have been found. In this case we will be interested in the Oak (or Quercus robur in Latin). This is what our data will look like when we overlay a point layer with oaks on top of the parks polygons:
Figure 8. Combining OSM park data and GBIF data to calculate routes to oak woodland parks (image by author)
The code to analyze looks like this:
https://medium.com/media/bc32a4d23f98770ba1e642190aeefcde/href
One important thing to note is that the points do not have to fall exactly within the park’s polygon. We combine data from different sources and they may differ slightly. Therefore, we set a buffer of 10 meters and look at the result (Figure 9) — we will consider that there is an oak tree in the park if the park polygon has an intersection with the buffer polygon according to GBIF data:
Figure 9. Routes to parks where oak trees occur, with calculated distances (image by author)
A bit more about relevance
The topic of analysis is in demand and obvious. Consequently, for example, functionality of routing is available in many source code solutions (while preparing this article I came across an interesting notebook “Exploring routing tools for calculating paths for a set of origin-destination pairs”), as well as in popular user services like Google Maps, and services related to maps implicitly. If we talk about proximity analysis, then here too there are a lot of implemented solutions. For instance, we can mention such a tool as pandana.
All these tools are certainly useful, but using them in application examples requires writing some auxiliary code. We are trying to make a convenient tool for our own team and want to share our findings with the community. In the approach presented in this article we can mention data source agnostic approach and a flexible way of combining data. We will be glad to receive comments and suggestions.
Useful links:
https://github.com/red5ai/estaty — Python open-source library for processing spatial data and preparing prototype algorithms for property analysis. The library is quite new, but we plan to develop and improve it;https://estaty.readthedocs.io/en/latest/ — documentation page;https://github.com/red5ai/estaty_examples — open-source repository with example launches for this article
Datasets used in this post (& licenses):
Map data copyrighted OpenStreetMap contributors and available from https://www.openstreetmap.org
License — Open Data Commons Open Database License
Links are up to date at 25th of October 2023GBIF.org (20 October 2023) GBIF Occurrence Download https://doi.org/10.15468/dl.f487j5
License — Attribution-NonCommercial 4.0 International
Links are up to date at 25th of October 2023
The story about proximity analysis using OpenStreetMap data and more, was presented by Mikhail Sarafanov and the Wiredhut team
Proximity Analysis to Find the Nearest Bar Using Python was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.