Cloud-Based Geospatial Benchmark (CBGB)
|
The Cloud-Based Geospatial Benchmark: Challenges and LLM Evaluation
How well can Large Language Models (LLMs) write code to solve geospatial problems ? The Cloud-Based Geospatial Benchmark (CBGB) is a new set of 45 challenges designed to rigorously evaluate the capability of LLMs and their agents in generating code for complex Earth Observation (EO) tasks (particularly involving the analysis of satellite imagery and geospatial data).
The core purpose of CBGB is to measure how effectively LLMs can produce code to yield short, numerical answers to a variety of geospatial scenarios in geography and environmental science, often readily achieved using extensive data catalogs and powerful APIs of platforms like Google Earth Engine. A key finding of ours is that models equipped with an error-correction feedback loop consistently perform better, mirroring the iterative nature of real-world geospatial analysis and underscoring the benchmark's non-saturated difficulty, especially for complex problems. The benchmark is unique because its challenges, curated by domain experts and categorized by difficulty, focus on end-to-end problem-solving resulting in a single, unambiguous numerical output, rather than merely evaluating intermediate code artifacts. |
AI-Accelerated Scientific Discovery
|
An AI system to help scientists write expert-level empirical software
In this paper, our research details an innovative AI system combining a Large Language Model (LLM) with Tree Search (TS) to help scientists write expert-level empirical software. The core idea was to frame software development as a "scorable task," where the system iteratively generates and tests code to maximize a specified quality metric. The system demonstrated superhuman performance across diverse scientific domains, including discovering 40 novel methods for single-cell data analysis that outperformed top human-developed algorithms and generating COVID-19 forecasting models that outperformed the robust CDC ensemble.
Key mechanisms enabling this success are the ability to tirelessly explore the vast solution space, generate novel recombinations of existing ideas, and integrate complex research concepts from external literature to significantly accelerate the scientific discovery cycle. |