Claude plays GeoGuessr
Jerry Wei
January 06, 2025.
Constantly seeking ways to evaluate and benchmark the capabilities of our AI models helps us understand their current performance and drives future development and improvements. In this spirit, I recently examined a fun task: How well can Claude play the popular web game GeoGuessr?
What is GeoGuessr?
For those who are unfamiliar, GeoGuessr is an online game that challenges players to guess the location of a randomly-selected Google Street View image. Players can pan the camera and navigate along roads to gather more context before placing their guess on a world map. The closer the guess is to the actual location, the higher the score.
Traditionally, GeoGuessr is a test of geographical knowledge, visual perception, and deductive reasoning. But could an AI excel at this task? I decided to find out.
Experimental setup
Preface: code for these experiments is publicly-available on my GitHub.
While the full GeoGuessr game allows for camera movement and map-based guessing, I simplified the task for my experiments. I presented Claude with a single static Google Street View image and asked it to directly output its guess as a latitude–longitude coordinate pair. I used imagery from the OpenStreetView-5M dataset, which contains 5.1 million images spanning 225 countries; each image is annotated with latitude–longitude coordinates representing its geolocation. For these experiments, I used the test set containing 210,122 image–coordinate pairs.
I briefly iterated on prompts using the Anthropic workbench and ended up with the following prompt that instructs Claude to first analyze the image, then use chain-of-thought to come up with a set of predicted coordinates for the image.
The assistant is playing GeoGuessr, a game where an image of a random Google Street View location is shown and the player has to guess the location of the image on a world map.
In the following conversation, the assistant will be shown a single image and must make its best guess of the location of the image by providing a latitude and longitude coordinate pair.
Human: [INSERT IMAGE] Here is an image from the Geoguessr game.
* Please reason about where you think this image is in <thinking> tags.
* Next, provide your final answer of your predicted latitude and longitude coordinates in <latitude> and <longitude> tags.
* The latitude and longitude coordinates that you give me should just be the `float` numbers; do not provide any thing else.
* You will NOT be penalized on the length of your reasoning, so feel free to think as long as you want.
Assistant: Certainly! I'll analyze the image in <thinking> tags and then provide my reasoning and final estimate of the latitude and longitude.
<thinking>
Scoring Metric
To evaluate Claude's performance, I used the scoring function from GeoGuessr's in-game metric:
score = 5000 * exp(-distance / 1492.7)
Here, distance is the Haversine distance (in kilometers) between Claude's guessed coordinates and the ground-truth location. This exponential-decay function heavily penalizes errors at close ranges but is more forgiving of misses at large scales. For example, guessing 250 km away from the correct answer versus 200 km away from the correct answer will have a much greater impact on the score than being 1,050 km away versus 1,000 km away.
Models used
I used the following models from the Claude-Haiku* and Claude-Sonnet families. I didn't test Claude-Opus models because of the large inference cost.
* I used an internal version of Claude-3.5-Haiku with multimodality enabled; this model is not available to the public yet as of this blog post's publication.
Results
Claude outperforms the average human player
The GeoGuessr world map with more than one million locations and more than 148M players shows an average score of 10,548 across 5 images, yielding an average per-image score of 2109.4. Additionally, this Reddit post analyzing statistics of players on the r/geoguessr subreddit shows that the highest score that a player was averaging was 22,897 across 5 images for 812 games, yielding an average per-image score of 4579.4. I used these two average per-image scores as baselines for the average human player and a top human player, respectively, as I wasn't able to find more-extensive statistics.
The top-performing Claude-3.5-Sonnet model achieved an average score of 2268.97, surpassing the estimated human average of 2109.4. However, it still fell short of expert-human performance (4579.4). All tested models obtained a perfect score of 5,000 on at least one image, demonstrating the potential for precise localization. Note that the error bars in the bar plot are tiny because the evaluation set is so large.
Claude has reasonable chains of thoughts
I picked several images for which Claude scored highly and examined its reasoning chain. Qualitatively, Claude demonstrated an impressive ability to pick up on subtle geographic cues (vegetation, architecture, vehicles, signage) and synthesize them into coherent localization reasoning.
Country-level predictions
Examining country-level predictions from Claude-3.5-Sonnet (October) revealed a strong bias towards guessing a location in the United States. This likely reflects the geographic distribution of the training data and real-world Street View coverage. Future work could explore calibration techniques to mitigate this bias, though it's unclear what the true distribution of Google Street View images is. The following heatmap is on a log-10 scale because the United States was extremely overrepresented.
Score distributions
The score distributions for all models were heavily skewed, with the bulk of guesses receiving a score of 0 (completely incorrect). This mirrors typical human performance, where most attempts are wild misses with occasional lucky hits.
Limitations and future directions
These preliminary experiments offer a tantalizing glimpse into Claude's geolocation capabilities, but this simplified setup has significant limitations:
Static imagery doesn't capture the full richness of the GeoGuessr game, which allows for dynamic exploration, though this can be seen as a disadvantage for Claude.
Direct coordinate output differs from the native map-based interface.
The prompt was only minimally tuned and could almost certainly be optimized further.
I didn't test larger models due to the large amount of compute necessary.
Future investigations could focus on:
Developing an interactive computer-use environment for Claude to play full rounds of GeoGuessr.
Best-of-N sampling, averaging coordinates across N samples, or other inference-time strategies to improve performance.
Prompt engineering to elicit more-consistent and more-grounded reasoning.
Evaluating the performance of larger models like Claude-3-Opus.
Conclusions
GeoGuessr serves as a compelling test bed for the geospatial-reasoning capabilities of large language models. By iterating on this task, we can push the boundaries of what AI can achieve in geographic localization and, more broadly, in synthesizing knowledge across modalities to solve complex problems.
While Claude's GeoGuessr abilities are already impressive, it's clear there's still significant room for growth. As we refine our models and training pipelines, I expect to see continued progress on this task and many others. The journey to artificial general intelligence is long, but each small step brings us closer to a future where AI can master not just trivia and language, but every task humans can perform.
Bibtex
If you’d like to cite this post you can use the following Bibtex key:
@misc{jerry2025claude,
author = {Jerry Wei},
title = {Claude plays GeoGuessr},
date = {2025-01-06},
year = {2025},
url = {https://www.jerrywei.net/blog/claude-plays-geoguessr},
}
Acknowledgements
Thanks to Drake Thomas, Zoe Blumenfeld, Kevin Garcia, Stuart Ritchie, and Jan Leike for helpful feedback on this blog post!