An applied project detailing the performance of LLMs for geocoding, in comparison to modern geocoding APIs
Photo by Sylwia Bartyzel on Unsplash
The modern convenience of typing a location into your phone’s search bar and seeing it automatically appear on a map is one that is often taken for granted. But have you ever paused to wonder how this seamless interaction and translation between text and maps actually works? The answer to this question is geocoding.
Esri, a global leader in geospatial software, defines geocoding as “the process of transforming a description of a location — such as a pair of coordinates, an address, or a name of a place — to a location on the earth’s surface.” Geocoding is what empowers the advanced functionalities that are found in navigation apps, mapping services, and geographic information science (GIScience) platforms. Various providers like Esri, Google, Mapbox, and others all provide geocoding APIs that are able to take in location descriptions and return latitude and longitude values, or coordinates, for the given descriptions, allowing us to think spatially about the data.
With the rise of generative AI and Large Language Models (LLMs) like OpenAI’s GPT, Google’s Bard, or Meta’s LLaMA, comes excellent opportunities for making use of the technologies for geospatial applications. The uses range widely, from code generation with GitHub’s Copilot to image segmentation with Meta’s Segment Anything Model (SAM), or even potentially to geocoding.
In this article, the suitability of utilizing “out-of-the-box” generative AI for geocoding unstructured location descriptions will be examined. This evaluation will be performed on a small dataset of vehicle accident data from Minnesota. Through this analysis, the effectiveness of standard LLMs in geocoding tasks and the exploration of one of generative AI’s potential geospatial use cases will be pursued, in comparison to the prevailing conventional approaches.
Understanding Geocoding and its Integration with AI
Modern geocoders consist of two essential components, a reference dataset and a geocoding algorithm. The reference data often contains both explicit and relative descriptions of places attached to a geographic location, meaning that not only are explicit descriptions like addresses tied to a location, but also more unstructured descriptions of places are tied to locations as well. A matching algorithm then may be used to find suitable matches between an input description and the descriptions that are contained within the reference dataset. One simple example of a matching algorithm could be the use of an interpolation algorithm to pinpoint a street address’s location by estimating the position between two known addresses.
The concept of predictive geocoding, employing AI and machine learning to enhance the geocoding process, has a longstanding history. Techniques including Natural Language Processing (NLP) and Deep Learning have been proposed and utilized, yielding varied levels of success. The use of AI and ML in geocoding is not a recent development. However, the emergence of generative AI presents itself as a new frontier for geocoding, much like it has to numerous other domains.
Navigating Challenges & Exploring Future Opportunities
As you may know, LLMs are trained using a vast amount of textual data drawn from the internet, books, journal articles, and various other sources. This often, if not always, lacks comprehensive geospatial information. This lack of geospatial training data in LLMs has implications on the potential and applicability in understanding and solving geospatial challenges. With no foundational domain-specific knowledge, how can we expect a model to perform well on an intricate problem?
The answer is that we simply cannot.
In this analysis, I assess the suitability of LLMs as a standalone benchmark and within the context of workflows using traditional GIScience methods. The outcomes underscore a familiar point — while new technology may be impressive, it doesn’t always lead to enhanced performance when tackling complex challenges.
Case Study: Unstructured Location Descriptions of Car Accidents
Data Collection and Preparation
To test out and quantify the geocoding capabilities of LLMs, a list of 100 unstructured location descriptions of vehicle accidents in Minnesota were randomly selected from a dataset that was scraped from the web. The ground truth coordinates for all 100 accidents were manually created through the use of various mapping applications like Google Maps and the Minnesota Department of Transportation’s Traffic Mapping Application (TMA).
Some sample location descriptions are featured below.
US Hwy 71 at MN Hwy 60 , WINDOM, Cottonwood CountyEB Highway 10 near Joplin St NW, ELK RIVER, Sherburne CountyEB I 90 / HWY 22, FOSTER TWP, Faribault CountyHighway 75 milepost 403, SAINT VINCENT TWP, Kittson County65 Highway / King Road, BRUNSWICK TWP, Kanabec County
As seen in the examples above, there are wide variety of possibilities for how the description can be structured, as well as what defines the location. One example of this is the fourth description, which features a mile marker number, which is far less likely to be matched in any sort of geocoding process, since that information typically isn’t included in any sort of reference data. Finding the ground truth coordinates for descriptions like this one relied heavily on the use of the Minnesota Department of Transportation’s Linear Referencing System (LRS) which provides a standardized approach of how roads are measured through out the State, with which mile markers play a vital role in. This data can be accessed through the TMA application mentioned previously.
Methodology & Geocoding Strategies
After preparing the data, five separate notebooks were set up to test out different geocoding processes. Their configurations are as follows.
1. Google Geocoding API, used on the raw location description
2. Esri Geocoding API, used on the raw location description
3. Google Geocoding API, used on an OpenAI GPT 3.5 standardized location description
4. Esri Geocoding API, used on an OpenAI GPT 3.5 standardized location description
5. OpenAI GPT 3.5, used as a geocoder itself
To summarize, the Google and Esri geocoding APIs were used on both the raw descriptions as well as descriptions that were standardized using a short prompt that was passed into the OpenAI GPT 3.5 model. The Python code for this standardization process can be seen below.
def standardize_location(df, description_series):
df[“ai_location_description”] = df[description_series].apply(_gpt_chat)
return df
def _gpt_chat(input_text):
prompt = “””Standardize the following location description into text
that could be fed into a Geocoding API. When responding, only
return the output text.”””
response = openai.ChatCompletion.create(
model=”gpt-3.5-turbo”,
messages=[
{“role”: “system”, “content”: prompt},
{“role”: “user”, “content”: input_text},
],
temperature=0.7,
n=1,
max_tokens=150,
stop=None,
)
return response.choices[0].message.content.strip().split(“n”)[-1]
The four test cases using geocoding APIs used the code below to make API requests to their respective geocoders and return the resulting coordinates for all 100 descriptions.
# Esri Geocoder
def geocode_esri(df, description_series):
df[“xy”] = df[description_series].apply(
_single_esri_geocode
)
df[“x”] = df[“xy”].apply(
lambda row: row.split(“,”)[0].strip()
)
df[“y”] = df[“xy”].apply(
lambda row: row.split(“,”)[1].strip()
)
df[“x”] = pd.to_numeric(df[“x”], errors=”coerce”)
df[“y”] = pd.to_numeric(df[“y”], errors=”coerce”)
df = df[df[“x”].notna()]
df = df[df[“y”].notna()]
return df
def _single_esri_geocode(input_text):
base_url = “https://geocode-api.arcgis.com/arcgis/rest/services/World/GeocodeServer/findAddressCandidates”
params = {
“f”: “json”,
“singleLine”: input_text,
“maxLocations”: “1”,
“token”: os.environ[“GEOCODE_TOKEN”],
}
response = requests.get(base_url, params=params)
data = response.json()
try:
x = data[“candidates”][0][“location”][“x”]
y = data[“candidates”][0][“location”][“y”]
except:
x = None
y = None
return f”{x}, {y}”# Google Geocoder
def geocode_google(df, description_series):
df[“xy”] = df[description_series].apply(
_single_google_geocode
)
df[“x”] = df[“xy”].apply(
lambda row: row.split(“,”)[0].strip()
)
df[“y”] = df[“xy”].apply(
lambda row: row.split(“,”)[1].strip()
)
df[“x”] = pd.to_numeric(df[“x”], errors=”coerce”)
df[“y”] = pd.to_numeric(df[“y”], errors=”coerce”)
df = df[df[“x”].notna()]
df = df[df[“y”].notna()]
return df
def _single_google_geocode(input_text):
base_url = “https://maps.googleapis.com/maps/api/geocode/json”
params = {
“address”: input_text,
“key”: os.environ[“GOOGLE_MAPS_KEY”],
“bounds”: “43.00,-97.50 49.5,-89.00”,
}
response = requests.get(base_url, params=params)
data = response.json()
try:
x = data[“results”][0][“geometry”][“location”][“lng”]
y = data[“results”][0][“geometry”][“location”][“lat”]
except:
x = None
y = None
return f”{x}, {y}”
Additionally, one final process tested was to use GPT 3.5 as the geocoder itself, without the help of any geocoding API. The code for this process looked nearly identical to the standardization code used above, but featured a different prompt, shown below.
Geocode the following address. Return a latitude (Y) and longitude (X) as accurately as possible. When responding, only return the output text in the following format: X, Y
Performance Metrics and Insights
After the various processes were developed, each process was run and several performance metrics were calculated, both in terms of execution time and geocoding accuracy. These metrics are listed below.
| Geocoding Process | Mean | StdDev | MAE | RMSE |
| ——————- | —— | —— | —— | —— |
| Google with GPT 3.5 | 0.1012 | 1.8537 | 0.3698 | 1.8565 |
| Google with Raw | 0.1047 | 1.1383 | 0.2643 | 1.1431 |
| Esri with GPT 3.5 | 0.0116 | 0.5748 | 0.0736 | 0.5749 |
| Esri with Raw | 0.0001 | 0.0396 | 0.0174 | 0.0396 |
| GPT 3.5 Geocoding | 2.1261 | 80.022 | 45.416 | 80.050 | | Geocoding Process | 75% ET | 90% ET | 95% ET | Run Time |
| ——————- | —— | —— | —— | ——– |
| Google with GPT 3.5 | 0.0683 | 0.3593 | 3.3496 | 1m 59.9s |
| Google with Raw | 0.0849 | 0.4171 | 3.3496 | 0m 23.2s |
| Esri with GPT 3.5 | 0.0364 | 0.0641 | 0.1171 | 2m 22.7s |
| Esri with Raw | 0.0362 | 0.0586 | 0.1171 | 0m 51.0s |
| GPT 3.5 Geocoding | 195.54 | 197.86 | 199.13 | 1m 11.9s |
The metrics are explained in more detail here. Mean represents the mean error (in terms of Manhattan distance, or the total of X and Y difference from the ground truth, in decimal degrees). StdDev represents the standard deviation of error (in terms of Manhattan distance, in decimal degrees). MAE represents the mean absolute error (in terms of Manhattan distance, in decimal degrees). RMSE represents the root mean square error (in terms of Manhattan distance, in decimal degrees). 75%, 90%, 95% ET represents the error threshold for that given percent (in terms of Euclidean distance, in decimal degrees), meaning that for a given percentage, that percentage of records falls within the resulting value’s distance from the ground truth. Lastly, run time simply represents the total time taken to run the geocoding process on 100 records.
Clearly, GPT 3.5 performs far worse on its own. Although, if a couple outliers are taken out of the picture (which were labelled by the model as being located in other continents), for the most part the results of that process don’t look too out of place, visually at least.
It is also interesting to see that the LLM-standardization process actually decreased accuracy, which I personally found a bit surprising, since my whole intention of introducing that component was to hopefully slightly improve the overall accuracy of the geocoding process. It is worth noting that the prompts themselves could have been a part of the problem here, and it is worth further exploring the role of “prompt engineering” in geospatial contexts.
The last main takeaway from this analysis is the execution time differences, with which any process that includes the use of GPT 3.5 performs significantly slower. Esri’s geocoding API is also slower than Google’s in this setting too. Rigorous testing was not performed, however, so these results should be taken with that into consideration.
Concluding Thoughts
Although the “out-of-the-box” geocoding capabilities of OpenAI’s GPT 3.5 model might not match the sophistication of modern geocoders, the testing underscores a potentially promising outlook. The results highlight substantial room for improvement, suggesting that in the foreseeable future, the geospatial capabilities of Large Language Models (LLMs) have plenty of opportunities to improve and eventually make an impact on geocoding as we know it.
There are many specific use cases where LLMs could potentially be sufficient for geocoding, as is. However, as this example shows, there is a discrepancy between LLM’s capabilities and the demands of geocoding tasks requiring a high level of precision at a fine spatial resolution. Therefore, while LLMs hold potential, this example demonstrates the criticality of precision and accuracy for certain applications.
Overall, generative AI presents itself as an exciting innovation with broad and far-reaching implications and opportunities across the landscape of geography and GIS, including for the use of geocoding. Ongoing advancements are continuing to be made at a staggering pace, allowing for continued progress to be made on developing integrations between generative AI and geospatial every day.
References:
[1] Wikipedia contributors, Address geocoding (2023), Wikipedia
[2] A. Hassan, The Future of Geospatial AI (2023), Spatial Data Science
[3] L. Mearian, What are LLMs, and how are they used in generative AI? (2023), ComputerWorld
Acknowledgments:
I would like to extend my appreciation to Dr. Bryan Runck for his invaluable support, guidance, and expertise in contributing to the editing and review of this article.
The Language of Locations: Evaluating Generative AI’s Geocoding Proficiency was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Rattling nice style and great articles, hardly anything else we need : D.