Unfolding the universe of possibilities..

Journeying through the galaxy of bits and bytes.

Identifying Topical Hot Spots in Urban Areas

Hipster Hot-Spots in Budapest.

A generic framework using OpenStreetMap and DBSCAN Spatial Clustering to Capture the most hyped urban areas

In this article, I show a quick and easy-to-use methodology that is capable of identifying hot spots for a given interest based on Point of interest (POI) collected from OpenStreeetMap (OSM) using the DBSCAN algorithm of sklearn. First, I will collect the raw data of POIs belonging to a couple of categories that I found on ChatGPT, and I assumed they are characteristic of the sometimes-called hyp-lifestyle (e.g., cafes, bars, marketplaces, yoga studios); after converting that data into a handy GeoDataFrame, I do the geospatial clustering, and finally, evaluate the results based on how well the different urban functionalities mix in each cluster.

While the choice of the topic I call ‘hipster’ and the POI categories linked to it is somewhat arbitrary, they can be easily replaced by other topics and categories — the automatic hot-spot detecting method remains the same. The advantages of such an easy-to-adopt method range from identifying local innovation hubs supporting innovation planning to detecting urban subcenters supporting urban planning initiatives, assessing different market opportunities for businesses, analyzing real-estate investment opportunities, or capturing tourist hotspots.

All images were created by the author.

1. Acquire data from OSM

First, I get the admin polygon of the target city. As Budapest is my hometown, for easy (on-field) validation purposes, I use that. However, as I am only using the global database of OSM, these steps can easily be reproduced for any other part of the world that OSM covers. In particular, I use the OSMNx package to get the admin boundaries in a super easy way.

import osmnx as ox # version: 1.0.1

city = ‘Budapest’
admin = ox.geocode_to_gdf(city)
admin.plot()

The result of this code block:

The admin boundaries of Budapest.

Now, use the OverPass API to download the POIs that fall within the bounding box of Budapest’s admin boundaries. In the amenity_mapping list, I compiled a list of POI categories that I associate with the hipster lifestyle. I also have to note here that this is a vague and not-expert-based categorization, and with the methods presented here, anyone may update the list of categories accordingly. Additionally, one may incorporate other POI data sources that contain more fine-grained multi-level categorization for a more accurate characterization of the given topic. In other words, this list can be changed in any way you see fit — from covering the hipsterish things better to readjusting this exercise to any other topic-categorization (e.g., food courts, shopping areas, tourist hotspots, etc).

Note: as the OverPass downloader returns all results within a bounding box, at the end of this code block, I filter out those POIs outside of the admin boundaries by using GeoPandas’ overlay function.

import overpy # version: 0.6
from shapely.geometry import Point # version: 1.7.1
import geopandas as gpd # version: 0.9.0

# start the api
api = overpy.Overpass()

# get the enclosing bounding box
minx, miny, maxx, maxy = admin.to_crs(4326).bounds.T[0]
bbox = ‘,’.join([str(miny), str(minx), str(maxy), str(maxx)])

# define the OSM categories of interest
amenity_mapping = [
(“amenity”, “cafe”),
(“tourism”, “gallery”),
(“amenity”, “pub”),
(“amenity”, “bar”),
(“amenity”, “marketplace”),
(“sport”, “yoga”),
(“amenity”, “studio”),
(“shop”, “music”),
(“shop”, “second_hand”),
(“amenity”, “foodtruck”),
(“amenity”, “music_venue”),
(“shop”, “books”),
]

# iterate over all categories, call the overpass api,
# and add the results to the poi_data list
poi_data = []

for idx, (amenity_cat, amenity) in enumerate(amenity_mapping):
query = f”””node[“{amenity_cat}”=”{amenity}”]({bbox});out;”””
result = api.query(query)
print(amenity, len(result.nodes))

for node in result.nodes:
data = {}
name = node.tags.get(‘name’, ‘N/A’)
data[‘name’] = name
data[‘amenity’] = amenity_cat + ‘__’ + amenity
data[‘geometry’] = Point(node.lon, node.lat)
poi_data.append(data)

# transform the results into a geodataframe
gdf_poi = gpd.GeoDataFrame(poi_data)
print(len(gdf_poi))
gdf_poi = gpd.overlay(gdf_poi, admin[[‘geometry’]])
gdf_poi.crs = 4326
print(len(gdf_poi))

The result of this code block is the frequency distribution of each downloaded POI category:

The frequency distribution of each downloaded POI category.

2. Visualize the POI data

Now, visualize all the 2101 POIs:

import matplotlib.pyplot as plt
f, ax = plt.subplots(1,1,figsize=(10,10))
admin.plot(ax=ax, color = ‘none’, edgecolor = ‘k’, linewidth = 2)
gdf_poi.plot(column = ‘amenity’, ax=ax, legend = True, alpha = 0.3)

The result of this code cell:

Budapest with all the downloaded POIs labeled by their categories.

This plot is pretty difficult to interpret — except that the city center is super crowded, so let’s go for an interactive visualization tool, Folium.

import folium
import branca.colormap as cm

# get the centroid of the city and set up the map
x, y = admin.geometry.to_list()[0].centroid.xy
m = folium.Map(location=[y[0], x[0]], zoom_start=12, tiles=’CartoDB Dark_Matter’)
colors = [‘blue’, ‘green’, ‘red’, ‘purple’, ‘orange’, ‘pink’, ‘gray’, ‘cyan’, ‘magenta’, ‘yellow’, ‘lightblue’, ‘lime’]

# transform the gdf_poi
amenity_colors = {}
unique_amenities = gdf_poi[‘amenity’].unique()
for i, amenity in enumerate(unique_amenities):
amenity_colors[amenity] = colors[i % len(colors)]

# visualize the pois with a scatter plot
for idx, row in gdf_poi.iterrows():
amenity = row[‘amenity’]
lat = row[‘geometry’].y
lon = row[‘geometry’].x
color = amenity_colors.get(amenity, ‘gray’) # default to gray if not in the colormap

folium.CircleMarker(
location=[lat, lon],
radius=3,
color=color,
fill=True,
fill_color=color,
fill_opacity=1.0, # No transparency for dot markers
popup=amenity,
).add_to(m)

# show the map
m

The default view of this map (which you can easily change by adjusting the zoom_start=12 parameter):

Budapest with all the downloaded POIs labeled by their categories — interactive version, first zoom setting.

Then, it is possible to change the zoom parameter and replot the map, or simply zoom in using the mouse:

Budapest with all the downloaded POIs labeled by their categories — interactive version, second zoom setting.

Or completely zoom out:

Budapest with all the downloaded POIs labeled by their categories — interactive version, third zoom setting.

3. Spatial clustering

Now that I have all the necessary POIs at hand, I go for the DBSCAN algorithm, first writing a function that takes the POIs and does the clustering. I will only finetune the eps parameter of DBSDCAN, which, essentially, quantifies the characteristic size of a cluster, the distance between POIs to be grouped together. Additionally, I transform the geometries to a local CRS (EPSG:23700) to work in SI units. More on CRS conversions here.

from sklearn.cluster import DBSCAN # version: 0.24.1
from collections import Counter

# do the clusteirng
def apply_dbscan_clustering(gdf_poi, eps):

feature_matrix = gdf_poi[‘geometry’].apply(lambda geom: (geom.x, geom.y)).tolist()
dbscan = DBSCAN(eps=eps, min_samples=1) # You can adjust min_samples as needed
cluster_labels = dbscan.fit_predict(feature_matrix)
gdf_poi[‘cluster_id’] = cluster_labels

return gdf_poi

# transforming to local crs
gdf_poi_filt = gdf_poi.to_crs(23700)

# do the clustering
eps_value = 50
clustered_gdf_poi = apply_dbscan_clustering(gdf_poi_filt, eps_value)

# Print the GeoDataFrame with cluster IDs
print(‘Number of clusters found: ‘, len(set(clustered_gdf_poi.cluster_id)))
clustered_gdf_poi

The result of this cell:

Preview on the POI GeoDataFrame where each POI is labeled by its cluster id.

There are 1237 clusters — that seems to be a bit too much if we are just looking at cozy, hipsterish hotspots. Let’s take a look at their size distribution and then pick a size threshold — calling a cluster with two POIs hotspots is probably not really sound anyways.

clusters = clustered_gdf_poi.cluster_id.to_list()
clusters_cnt = Counter(clusters).most_common()

f, ax = plt.subplots(1,1,figsize=(8,4))
ax.hist([cnt for c, cnt in clusters_cnt], bins = 20)
ax.set_yscale(‘log’)
ax.set_xlabel(‘Cluster size’, fontsize = 14)
ax.set_ylabel(‘Number of clusters’, fontsize = 14)

The result of this cell:

Cluster size distribution.

Based on the gap in the histogram, let’s keep clusters with at least 10 POIs! For now, this is a simple enough working hypothesis. However, this could be worked in more sophisticated ways as well, for instance, by incorporating the number of different POI types or the geographical area covered.

to_keep = [c for c, cnt in Counter(clusters).most_common() if cnt>9]
clustered_gdf_poi = clustered_gdf_poi[clustered_gdf_poi.cluster_id.isin(to_keep)]
clustered_gdf_poi = clustered_gdf_poi.to_crs(4326)
len(to_keep)

This snippet shows that there are 15 clusters satisfying the filtering.

Once we have the 15 true hipster clusters, put them on a map:

import folium
import random

# get the centroid of the city and set up the map
min_longitude, min_latitude, max_longitude, max_latitude = clustered_gdf_poi.total_bounds
m = folium.Map(location=[(min_latitude+max_latitude)/2, (min_longitude+max_longitude)/2], zoom_start=14, tiles=’CartoDB Dark_Matter’)

# get unique, random colors for each cluster
unique_clusters = clustered_gdf_poi[‘cluster_id’].unique()
cluster_colors = {cluster: “#{:02x}{:02x}{:02x}”.format(random.randint(0, 255), random.randint(0, 255), random.randint(0, 255)) for cluster in unique_clusters}

# visualize the pois
for idx, row in clustered_gdf_poi.iterrows():
lat = row[‘geometry’].y
lon = row[‘geometry’].x
cluster_id = row[‘cluster_id’]
color = cluster_colors[cluster_id]

# create a dot marker
folium.CircleMarker(
location=[lat, lon],
radius=3,
color=color,
fill=True,
fill_color=color,
fill_opacity=0.9,
popup=row[‘amenity’],
).add_to(m)

# show the map
mHipster POI clusters — first zoom level.Hipster POI clusters — second zoom level.Hipster POI clusters — third zoom level.

4. Comparing the clusters

Each cluster counts as a fancy, hipster cluster — however, they all must be unique in some way or another, right? Let’s see how unique they are by comparing the portfolio of POI categories they have to offer.

First, shoot for diversity and measure the variety/diversity of POI categories in each cluster by computing their entropy.

import math
import pandas as pd

def get_entropy_score(tags):
tag_counts = {}
total_tags = len(tags)
for tag in tags:
if tag in tag_counts:
tag_counts[tag] += 1
else:
tag_counts[tag] = 1

tag_probabilities = [count / total_tags for count in tag_counts.values()]
shannon_entropy = -sum(p * math.log(p) for p in tag_probabilities)
return shannon_entropy

# create a dict where each cluster has its own list of amenitiy
clusters_amenities = clustered_gdf_poi.groupby(by = ‘cluster_id’)[‘amenity’].apply(list).to_dict()

# compute and store the entropy scores
entropy_data = []
for cluster, amenities in clusters_amenities.items():
E = get_entropy_score(amenities)
entropy_data.append({‘cluster’ : cluster, ‘size’ :len(amenities), ‘entropy’ : E})

# add the entropy scores to a dataframe
entropy_data = pd.DataFrame(entropy_data)
entropy_data

The result of this cell:

The diversity (entropy) of each cluster based on its POI profile.

And a quick correlation analysis on this table:

entropy_data.corr()The correlation between cluster features.

After computing the correlation between cluster-ID, cluster size and cluster entropy, there is a significant correlation between size and entropy; however, it’s far from explaining all the diversity. Apparently, indeed, some hotspots are more diverse than others – while others are somewhat more specialized. What are they specialized in? I will answer this question by comparing the POI profiles of each cluster to the overall distribution of each POI type within clusters and pick the top three POI categories most typical to a cluster compared to the average.

# packing the poi profiles into dictionaries
clusters = sorted(list(set(clustered_gdf_poi.cluster_id)))
amenity_profile_all = dict(Counter(clustered_gdf_poi.amenity).most_common())
amenity_profile_all = {k : v / sum(amenity_profile_all.values()) for k, v in amenity_profile_all.items()}

# computing the relative frequency of each category
# and keeping only the above-average (>1) and top 3 candidates
clusters_top_profile = {}
for cluster in clusters:

amenity_profile_cls = dict(Counter(clustered_gdf_poi[clustered_gdf_poi.cluster_id == cluster].amenity).most_common() )
amenity_profile_cls = {k : v / sum(amenity_profile_cls.values()) for k, v in amenity_profile_cls.items()}

clusters_top_amenities = []
for a, cnt in amenity_profile_cls.items():
ratio = cnt / amenity_profile_all[a]
if ratio>1: clusters_top_amenities.append((a, ratio))
clusters_top_amenities = sorted(clusters_top_amenities, key=lambda tup: tup[1], reverse=True)
clusters_top_amenities = clusters_top_amenities[0:min([3,len(clusters_top_amenities)])]

clusters_top_profile[cluster] = [c[0] for c in clusters_top_amenities]

# print, for each cluster, its top categories:
for cluster, top_amenities in clusters_top_profile.items():
print(cluster, top_amenities)

The result of this code-block:

The unique amenity fingerprint of each cluster.

The top category descriptions already show some trends. Such as cluster 17 is clearly for drinking, while 19 also mixes music, possibly partying with it. Cluster 91, with bookstores, galleries, and cafes, is certainly a place for daytime relaxation, while cluster 120, with music and a gallery can be a great warm-up for any pub crawl. From the distribution, we can also see that hopping in a bar is always appropriate (or, depending on the use case, we should think of further normalizations based on category frequencies)!

Concluding remarks

As a local resident, I can confirm that these clusters make perfect sense and represent the desired urban functionality mix quite well despite the simple methodology. Of course, this is a quick pilot that can be enriched and touched up in several ways, such as:

Relying on more detailed POI categorization and selectionConsidering the POI categories when doing the clustering (semantic clustering)Enriching the POI information with e.g., social media reviews and ratings

Identifying Topical Hot Spots in Urban Areas was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Leave a Comment