0

0

ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities

    Published 12/3/2024 by Ezra Karger, Houtan Bastani, Chen Yueh-Han, Zachary Jacobs, Danny Halawi, Fred Zhang, Philip E. Tetlock

    Overview

    • Forecasts of future events are essential for making informed decisions.
    • Machine learning (ML) systems have the potential to generate forecasts at scale.
    • However, there is no standard way to evaluate the accuracy of ML forecasting systems.

    Example market question from human survey.

    1/1

    Example market question from human survey.

    Original caption: Figure 4: An example market-based question from the human survey.

    Composition of the question bank, categorized by market or dataset source.

    1/2

    Source URL Number of Predictions Combinations
    RFI randforecastinginitiative.org 15 105
    Manifold Markets manifold.markets 392 76,636
    Metaculus metaculus.com 800 319,600
    Polymarket polymarket.com 534 142,311
    Market Total 1,741 538,652
    ACLED acleddata.com 3,150 4,959,675
    DBnomics db.nomics.world 52 1,326
    FRED fred.stlouisfed.org 166 13,695
    Wikipedia wikipedia.org 335 55,945
    Yahoo! Finance finance.yahoo.com 504 126,756
    Dataset Total 4,207 5,157,397
    Question Bank Total 5,948 5,696,049

    Original caption: Table 1: Question bank composition, grouped by source type (market or dataset).

    Plain English Explanation

    ForecastBench is a new benchmark that aims to address this gap. It is a dynamic test that automatically generates and regularly updates a set of 1,000 questions about future events with no known answers at the time of submission. This ensures there is no risk of data leakage, which could artificially inflate a system's performance.

    The researchers tested the forecasting capabilities of expert human forecasters, the general public, and large language models (LLMs) on a random subset of 200 questions from the benchmark. While LLMs have shown super-human performance on many tasks, the results here were different. The expert human forecasters outperformed the top-performing LLM in a statistically significant way (p-value = 0.01).

    The researchers make the results publicly available on a leaderboard at www.forecastbench.org. This allows researchers to track the progress of AI systems in this important area of forecasting future events.

    Key Findings

    • ForecastBench is a new dynamic benchmark for evaluating the forecasting capabilities of machine learning systems.
    • It consists of 1,000 questions about future events with no known answers at the time of submission.
    • Expert human forecasters outperformed the top-performing large language model in a statistically significant way.

    Technical Explanation

    The researchers developed ForecastBench to address the lack of a standardized way to evaluate the forecasting capabilities of machine learning systems. ForecastBench automatically generates and regularly updates a set of 1,000 questions about future events. These questions have no known answers at the time of submission, ensuring there is no risk of data leakage that could artificially inflate a system's performance.

    To quantify the capabilities of current ML systems, the researchers collected forecasts from expert human forecasters, the general public, and large language models (LLMs) on a random subset of 200 questions from the benchmark. The results showed that while LLMs have achieved super-human performance on many benchmarks, they performed less well on this forecasting task. Expert human forecasters outperformed the top-performing LLM in a statistically significant way (p-value = 0.01).

    Critical Analysis

    The researchers acknowledge that ForecastBench is a first step towards a standardized benchmark for evaluating forecasting capabilities, and that further research is needed to refine and expand the benchmark. Additionally, the sample size of 200 questions used in the initial evaluation is relatively small, and testing on the full set of 1,000 questions could provide more robust and generalizable results.

    It would also be valuable to explore the specific factors that contribute to the superior performance of expert human forecasters compared to LLMs. Understanding the strengths and weaknesses of each approach could help inform the development of more accurate and reliable forecasting systems in the future.

    Conclusion

    ForecastBench represents an important step towards developing a standardized way to evaluate the forecasting capabilities of machine learning systems. The finding that expert human forecasters outperformed the top-performing LLM suggests that there is still room for improvement in the forecasting abilities of AI systems. Continued research and development in this area could lead to significant advancements in the field of forecasting, with important implications for decision-making and planning across a wide range of domains.

    Full paper

    Loading...

    Loading PDF viewer...

    Read original: arXiv:2409.19839



    This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

    Total Score

    1

    Follow @aimodelsfyi on š• ā†’

    Related Papers

    Can Language Models Use Forecasting Strategies?
    Total Score

    0

    Can Language Models Use Forecasting Strategies?

    Sarah Pratt, Seth Blumberg, Pietro Kreitlon Carolino, Meredith Ringel Morris

    Advances in deep learning systems have allowed large models to match or surpass human accuracy on a number of skills such as image classification, basic programming, and standardized test taking. As the performance of the most capable models begin to saturate on tasks where humans already achieve high accuracy, it becomes necessary to benchmark models on increasingly complex abilities. One such task is forecasting the future outcome of events. In this work we describe experiments using a novel dataset of real world events and associated human predictions, an evaluation metric to measure forecasting ability, and the accuracy of a number of different LLM based forecasting designs on the provided dataset. Additionally, we analyze the performance of the LLM forecasters against human predictions and find that models still struggle to make accurate predictions about the future. Our follow-up experiments indicate this is likely due to models' tendency to guess that most events are unlikely to occur (which tends to be true for many prediction datasets, but does not reflect actual forecasting abilities). We reflect on next steps for developing a systematic and reliable approach to studying LLM forecasting.

    Read more

    6/10/2024

    Can time series forecasting be automated? A benchmark and analysis
    Total Score

    0

    Can time series forecasting be automated? A benchmark and analysis

    Anvitha Thirthapura Sreedhara, Joaquin Vanschoren

    In the field of machine learning and artificial intelligence, time series forecasting plays a pivotal role across various domains such as finance, healthcare, and weather. However, the task of selecting the most suitable forecasting method for a given dataset is a complex task due to the diversity of data patterns and characteristics. This research aims to address this challenge by proposing a comprehensive benchmark for evaluating and ranking time series forecasting methods across a wide range of datasets. This study investigates the comparative performance of many methods from two prominent time series forecasting frameworks, AutoGluon-Timeseries, and sktime to shed light on their applicability in different real-world scenarios. This research contributes to the field of time series forecasting by providing a robust benchmarking methodology and facilitating informed decision-making when choosing forecasting methods for achieving optimal prediction.

    Read more

    7/26/2024

    Reasoning and Tools for Human-Level Forecasting
    Total Score

    0

    Reasoning and Tools for Human-Level Forecasting

    Elvis Hsieh, Preston Fu, Jonathan Chen

    Language models (LMs) trained on web-scale datasets are largely successful due to their ability to memorize large amounts of training data, even if only present in a few examples. These capabilities are often desirable in evaluation on tasks such as question answering but raise questions about whether these models can exhibit genuine reasoning or succeed only at mimicking patterns from the training data. This distinction is particularly salient in forecasting tasks, where the answer is not present in the training data, and the model must reason to make logical deductions. We present Reasoning and Tools for Forecasting (RTF), a framework of reasoning-and-acting (ReAct) agents that can dynamically retrieve updated information and run numerical simulation with equipped tools. We evaluate our model with questions from competitive forecasting platforms and demonstrate that our method is competitive with and can outperform human predictions. This suggests that LMs, with the right tools, can indeed think and adapt like humans, offering valuable insights for real-world decision-making.

    Read more

    11/4/2024

    šŸ’¬

    Total Score

    0

    Humans vs Large Language Models: Judgmental Forecasting in an Era of Advanced AI

    MAhdi Abolghasemi, Odkhishig Ganbold, Kristian Rotaru

    This study investigates the forecasting accuracy of human experts versus Large Language Models (LLMs) in the retail sector, particularly during standard and promotional sales periods. Utilizing a controlled experimental setup with 123 human forecasters and five LLMs, including ChatGPT4, ChatGPT3.5, Bard, Bing, and Llama2, we evaluated forecasting precision through Mean Absolute Percentage Error. Our analysis centered on the effect of the following factors on forecasters performance: the supporting statistical model (baseline and advanced), whether the product was on promotion, and the nature of external impact. The findings indicate that LLMs do not consistently outperform humans in forecasting accuracy and that advanced statistical forecasting models do not uniformly enhance the performance of either human forecasters or LLMs. Both human and LLM forecasters exhibited increased forecasting errors, particularly during promotional periods and under the influence of positive external impacts. Our findings call for careful consideration when integrating LLMs into practical forecasting processes.

    Read more

    5/20/2024