CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

    Read original: arXiv:2409.11363 - Published 9/18/2024 by Zachary S. Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, Arvind Narayanan
    Total Score

    0

    CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

    Sign in to get full access

    or

    If you already have an account, we'll log you in

    Overview

    • Introduces CORE-Bench, a benchmark for evaluating the computational reproducibility of published research through AI agents
    • Aims to foster credibility in published research by incentivizing researchers to ensure their work is computationally reproducible
    • Provides a standardized way to assess an AI agent's ability to reproduce the computational experiments described in a research paper

    Plain English Explanation

    CORE-Bench is a tool designed to help improve the credibility of scientific research. Often, when researchers publish their work, it can be difficult for others to reproduce the computational experiments they describe. This can undermine confidence in the findings.

    CORE-Bench addresses this issue by providing a benchmark that evaluates AI agents on their ability to computationally reproduce the experiments from a given research paper. The idea is that if an AI agent can successfully recreate the computational steps outlined in a paper, it increases the likelihood that the original research was conducted correctly and the results are reliable.

    By incentivizing researchers to ensure their work is computationally reproducible, CORE-Bench aims to foster greater trust in published research and encourage more rigorous scientific practices.

    Technical Explanation

    CORE-Bench is a benchmark designed to evaluate the computational reproducibility of published research through AI agents. The benchmark involves a set of research papers, each with a corresponding computational experiment that an AI agent must attempt to reproduce.

    The key elements of CORE-Bench include:

    1. Paper Selection: Researchers curate a set of high-quality research papers that cover a diverse range of scientific domains and computational techniques.

    2. Computational Experiment Extraction: For each paper, the researchers extract the computational experiments described in the paper, including the data, code, and computational environment required to reproduce the experiments.

    3. Agent Evaluation: AI agents are tasked with attempting to reproduce the computational experiments for each paper. The agents are evaluated on their ability to successfully recreate the experiments, as well as the efficiency and fidelity of their reproduction.

    4. Reproducibility Scoring: CORE-Bench provides a standardized scoring system to assess the computational reproducibility of each paper, based on the performance of the AI agents.

    By providing a standardized benchmark, CORE-Bench aims to incentivize researchers to ensure their work is computationally reproducible, ultimately enhancing the credibility of published research.

    Critical Analysis

    The CORE-Bench paper acknowledges several caveats and limitations of the approach:

    • The selection of papers and computational experiments included in the benchmark may not be representative of all scientific domains or computational techniques.
    • The evaluation of AI agents may be influenced by the specific implementation details of the benchmark, which could introduce biases.
    • Computational reproducibility is just one aspect of research credibility, and other factors, such as experimental design and data validity, are not directly addressed by CORE-Bench.

    Additionally, the paper does not discuss potential issues that could arise from the use of CORE-Bench, such as the risk of researchers gaming the system or the challenges of evaluating complex computational workflows.

    Overall, while CORE-Bench represents an important step towards fostering greater credibility in published research, further research and refinement may be needed to address these limitations and ensure the widespread adoption and effectiveness of the benchmark.

    Conclusion

    CORE-Bench is a novel approach to addressing the issue of computational reproducibility in scientific research. By providing a standardized benchmark for evaluating AI agents' ability to reproduce the computational experiments described in published papers, CORE-Bench aims to incentivize researchers to ensure their work is computationally reproducible.

    This, in turn, has the potential to increase the credibility and trustworthiness of published research, which is crucial for advancing scientific knowledge and informing important decisions in fields like healthcare, policy, and technology development. While CORE-Bench has some limitations, it represents a significant step towards creating a more robust and reliable scientific ecosystem.



    This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

    Follow @aimodelsfyi on 𝕏 →

    Related Papers

    CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark
    Total Score

    0

    CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

    Zachary S. Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, Arvind Narayanan

    AI agents have the potential to aid users on a variety of consequential tasks, including conducting scientific research. To spur the development of useful agents, we need benchmarks that are challenging, but more crucially, directly correspond to real-world tasks of interest. This paper introduces such a benchmark, designed to measure the accuracy of AI agents in tackling a crucial yet surprisingly challenging aspect of scientific research: computational reproducibility. This task, fundamental to the scientific process, involves reproducing the results of a study using the provided code and data. We introduce CORE-Bench (Computational Reproducibility Agent Benchmark), a benchmark consisting of 270 tasks based on 90 scientific papers across three disciplines (computer science, social science, and medicine). Tasks in CORE-Bench consist of three difficulty levels and include both language-only and vision-language tasks. We provide an evaluation system to measure the accuracy of agents in a fast and parallelizable way, saving days of evaluation time for each run compared to a sequential implementation. We evaluated two baseline agents: the general-purpose AutoGPT and a task-specific agent called CORE-Agent. We tested both variants using two underlying language models: GPT-4o and GPT-4o-mini. The best agent achieved an accuracy of 21% on the hardest task, showing the vast scope for improvement in automating routine scientific tasks. Having agents that can reproduce existing work is a necessary step towards building agents that can conduct novel research and could verify and improve the performance of other research agents. We hope that CORE-Bench can improve the state of reproducibility and spur the development of future research agents.

    Read more

    9/18/2024

    ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery
    Total Score

    0

    ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery

    Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, Vishal Dey, Mingyi Xue, Frazier N. Baker, Benjamin Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, Huan Sun

    The advancements of language language models (LLMs) have piqued growing interest in developing LLM-based language agents to automate scientific discovery end-to-end, which has sparked both excitement and skepticism about the true capabilities of such agents. In this work, we argue that for an agent to fully automate scientific discovery, it must be able to complete all essential tasks in the workflow. Thus, we call for rigorous assessment of agents on individual tasks in a scientific workflow before making bold claims on end-to-end automation. To this end, we present ScienceAgentBench, a new benchmark for evaluating language agents for data-driven scientific discovery. To ensure the scientific authenticity and real-world relevance of our benchmark, we extract 102 tasks from 44 peer-reviewed publications in four disciplines and engage nine subject matter experts to validate them. We unify the target output for every task to a self-contained Python program file and employ an array of evaluation metrics to examine the generated programs, execution results, and costs. Each task goes through multiple rounds of manual validation by annotators and subject matter experts to ensure its annotation quality and scientific plausibility. We also propose two effective strategies to mitigate data contamination concerns. Using our benchmark, we evaluate five open-weight and proprietary LLMs, each with three frameworks: direct prompting, OpenHands, and self-debug. Given three attempts for each task, the best-performing agent can only solve 32.4% of the tasks independently and 34.3% with expert-provided knowledge. These results underscore the limited capacities of current language agents in generating code for data-driven discovery, let alone end-to-end automation for scientific research.

    Read more

    10/8/2024

    AI Agents That Matter
    Total Score

    2

    AI Agents That Matter

    Sayash Kapoor, Benedikt Stroebl, Zachary S. Siegel, Nitya Nadgir, Arvind Narayanan

    AI agents are an exciting new research direction, and agent development is driven by benchmarks. Our analysis of current agent benchmarks and evaluation practices reveals several shortcomings that hinder their usefulness in real-world applications. First, there is a narrow focus on accuracy without attention to other metrics. As a result, SOTA agents are needlessly complex and costly, and the community has reached mistaken conclusions about the sources of accuracy gains. Our focus on cost in addition to accuracy motivates the new goal of jointly optimizing the two metrics. We design and implement one such optimization, showing its potential to greatly reduce cost while maintaining accuracy. Second, the benchmarking needs of model and downstream developers have been conflated, making it hard to identify which agent would be best suited for a particular application. Third, many agent benchmarks have inadequate holdout sets, and sometimes none at all. This has led to agents that are fragile because they take shortcuts and overfit to the benchmark in various ways. We prescribe a principled framework for avoiding overfitting. Finally, there is a lack of standardization in evaluation practices, leading to a pervasive lack of reproducibility. We hope that the steps we introduce for addressing these shortcomings will spur the development of agents that are useful in the real world and not just accurate on benchmarks.

    Read more

    7/2/2024

    💬

    Total Score

    0

    MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation

    Qian Huang, Jian Vora, Percy Liang, Jure Leskovec

    A central aspect of machine learning research is experimentation, the process of designing and running experiments, analyzing the results, and iterating towards some positive outcome (e.g., improving accuracy). Could agents driven by powerful language models perform machine learning experimentation effectively? To answer this question, we introduce MLAgentBench, a suite of 13 tasks ranging from improving model performance on CIFAR-10 to recent research problems like BabyLM. For each task, an agent can perform actions like reading/writing files, executing code, and inspecting outputs. We then construct an agent that can perform ML experimentation based on ReAct framework. We benchmark agents based on Claude v1.0, Claude v2.1, Claude v3 Opus, GPT-4, GPT-4-turbo, Gemini-Pro, and Mixtral and find that a Claude v3 Opus agent is the best in terms of success rate. It can build compelling ML models over many tasks in MLAgentBench with 37.5% average success rate. Our agents also display highly interpretable plans and actions. However, the success rates vary considerably; they span from 100% on well-established older datasets to as low as 0% on recent Kaggle challenges created potentially after the underlying LM was trained. Finally, we identify several key challenges for LM-based agents such as long-term planning and reducing hallucination. Our code is released at https://github.com/snap-stanford/MLAgentBench.

    Read more

    4/16/2024