Cardille Computational Landscape Ecology Lab
  • Home
  • Research
    • Remote Sensing & Change Detection
    • Geo-AI
    • Aquatic
    • Landscape Ecology
    • Books
  • Team
    • Current lab members
    • Past lab members
    • Invitation To Students
    • Funding
  • Courses
  • Publications
  • Service
  • Contact

Cloud-Based Geospatial Benchmarks

The Cloud-Based Geospatial Benchmark: Challenges and LLM Evaluation

Picture
Agricultural Land Use Analysis in California's Central Valley (Click the image for access to the github repository)
Picture
Calculating Iron Oxide Ratio (IOR) for Hydrothermal Rock Detection (Click the image for access to the github repository)
Picture
Calculating Road Length in Lesotho (Click image for access to the github repository)
Background

Large Language Models (LLMs) are increasingly used to support scientific research, including data analysis, coding, and interpretation. In Earth Observation and geospatial science, these tools show promise for analyzing satellite imagery and large environmental datasets. However, their real-world reliability remains difficult to assess. Many existing evaluations focus on abstract tasks or intermediate code quality rather than whether a model can correctly solve realistic scientific problems end to end. This creates a credibility gap: without rigorous, domain-specific benchmarks, it is hard to know when LLM-generated results can be trusted for environmental analysis, decision-making, or education. The lab developed this work to address the need for transparent, reproducible evaluation of LLMs on practical geospatial tasks.

Approach

We created the Cloud-Based Geospatial Benchmark (CBGB), a curated set of 45 real-world challenges drawn from geography and environmental science. Each challenge asks an LLM to generate code that produces a single, unambiguous numerical answer, such as an area, average value, or rate of change derived from satellite data. The problems span three difficulty levels—Easy, Intermediate, and Difficult—and reflect the kinds of analyses routinely performed in remote sensing workflows. While the tasks can be solved on different platforms, they are well suited to cloud-based systems with large geospatial data catalogs. We evaluated leading LLMs both in a single-pass setting and in an iterative setting where models could correct their code after receiving execution errors, mirroring how analysts actually work.

Key Findings
​
  • Iterative feedback substantially improves performance : Models that could see and correct execution errors consistently outperformed those operating without feedback. This highlights the importance of interactive, loop-based evaluation rather than one-shot testing.
  • Performance scales with problem difficulty : Most models solved Easy problems reliably, results diverged on Intermediate problems, and only the strongest models succeeded on a subset of Difficult problems. This confirms that the benchmark is not saturated and remains challenging.
  • Reasoning-oriented models perform best overall : Models designed to explicitly reason through problems achieved higher accuracy, with the top-performing agents reaching 71% accuracy across all challenges.
​
Impact

This work provides a practical framework for evaluating LLMs in geospatial science using realistic, reproducible tasks. By focusing on end-to-end problem solving with clear numerical answers, CBGB helps close the credibility gap in Geo-AI. The benchmark supports more informed adoption of LLMs in research, education, and environmental decision-making, and offers a foundation for improving future AI systems that interact with complex Earth Observation data.

Resources

Published Paper : Cardille JA, Johnston R, Ilyushchenko S, Kartiwa J, Shamsi Z, Abraham M, Azad K, Ahmed K, Bergeron Quick E., Caughie N, Jencz N, Dyson K, Puzzi Nicolau A, Lopez-Ornelas MF, Saah D, Brenner M, Venugopalan S, Ponda SS. The Cloud-Based Geospatial Benchmark: challenges and LLM evaluation. In: TerraBytes-ICML 2025 Workshop. 2025. Downloadable PDF (OpenReview): https://openreview.net/pdf?id=oaYShIy3Xe.

Source Code Repository :
• CBGB Benchmark (Earth Engine Community GitHub) — Official repository containing the benchmark definitions, example solutions, and supporting scripts hosted within Google’s Earth Engine Community organization on GitHub:
https://github.com/google/earthengine-community/tree/master/experimental/cbgb_benchmark
​

Back to Geo-AI Overview

Back to Research
Powered by Create your own unique website with customizable templates.
  • Home
  • Research
    • Remote Sensing & Change Detection
    • Geo-AI
    • Aquatic
    • Landscape Ecology
    • Books
  • Team
    • Current lab members
    • Past lab members
    • Invitation To Students
    • Funding
  • Courses
  • Publications
  • Service
  • Contact