The Cloud-Based Geospatial Benchmark: Challenges and LLM Evaluation
|
|
Background
Large Language Models (LLMs) are increasingly used to support scientific research, including data analysis, coding, and interpretation. In Earth Observation and geospatial science, these tools show promise for analyzing satellite imagery and large environmental datasets. However, their real-world reliability remains difficult to assess. Many existing evaluations focus on abstract tasks or intermediate code quality rather than whether a model can correctly solve realistic scientific problems end to end. This creates a credibility gap: without rigorous, domain-specific benchmarks, it is hard to know when LLM-generated results can be trusted for environmental analysis, decision-making, or education. The lab developed this work to address the need for transparent, reproducible evaluation of LLMs on practical geospatial tasks. Approach We created the Cloud-Based Geospatial Benchmark (CBGB), a curated set of 45 real-world challenges drawn from geography and environmental science. Each challenge asks an LLM to generate code that produces a single, unambiguous numerical answer, such as an area, average value, or rate of change derived from satellite data. The problems span three difficulty levels—Easy, Intermediate, and Difficult—and reflect the kinds of analyses routinely performed in remote sensing workflows. While the tasks can be solved on different platforms, they are well suited to cloud-based systems with large geospatial data catalogs. We evaluated leading LLMs both in a single-pass setting and in an iterative setting where models could correct their code after receiving execution errors, mirroring how analysts actually work. Key Findings
Impact This work provides a practical framework for evaluating LLMs in geospatial science using realistic, reproducible tasks. By focusing on end-to-end problem solving with clear numerical answers, CBGB helps close the credibility gap in Geo-AI. The benchmark supports more informed adoption of LLMs in research, education, and environmental decision-making, and offers a foundation for improving future AI systems that interact with complex Earth Observation data. |
Resources
Published Paper : Cardille JA, Johnston R, Ilyushchenko S, Kartiwa J, Shamsi Z, Abraham M, Azad K, Ahmed K, Bergeron Quick E., Caughie N, Jencz N, Dyson K, Puzzi Nicolau A, Lopez-Ornelas MF, Saah D, Brenner M, Venugopalan S, Ponda SS. The Cloud-Based Geospatial Benchmark: challenges and LLM evaluation. In: TerraBytes-ICML 2025 Workshop. 2025. Downloadable PDF (OpenReview): https://openreview.net/pdf?id=oaYShIy3Xe.
Source Code Repository :
• CBGB Benchmark (Earth Engine Community GitHub) — Official repository containing the benchmark definitions, example solutions, and supporting scripts hosted within Google’s Earth Engine Community organization on GitHub:
https://github.com/google/earthengine-community/tree/master/experimental/cbgb_benchmark
Source Code Repository :
• CBGB Benchmark (Earth Engine Community GitHub) — Official repository containing the benchmark definitions, example solutions, and supporting scripts hosted within Google’s Earth Engine Community organization on GitHub:
https://github.com/google/earthengine-community/tree/master/experimental/cbgb_benchmark