0

0

Questionable practices in machine learning

    Published 10/31/2024 by Gavin Leech, Juan J. Vazquez, Niclas Kupper, Misha Yagudin, Laurence Aitchison

    Overview

    • This paper examines questionable practices that can arise in machine learning (ML) research, such as overfitting, publication bias, and misleading evaluations.
    • The authors highlight the importance of addressing these issues to ensure the reliability and integrity of ML-driven science.
    • They draw connections to related work on topics like unraveling overoptimism and publication bias in ML-driven science, lessons for reliable machine learning, and the importance of embracing negative results in ML.

    Research process overview, idealized.

    1/2

    Research process overview, idealized.

    Original caption: Figure 1: An idealised research process.

    Questionable and fraudulent machine learning practices.

    1/2

    Example Type of Contamination Section and Description Stage Accidental?
    Training Contamination Example Training Contamination 3.1.1: Training on the test set (e.g. in the web corpus) Training Plausibly
    Prompt Contamination Example Prompt Contamination 3.1.2: Just putting test data into the prompt (few-shot) Evaluation Plausibly
    RAG Contamination Example RAG Contamination 3.1.3: Leaking benchmark data via Retrieval Augmented Generation Evaluation Plausibly
    Dirty Paraphrases Example Dirty Paraphrases 3.1.4: Rephrasing test data and training on it Collection No
    Contamination Laundering Example Contamination Laundering 3.1.5: Contaminated models generating training data Collection Plausibly
    Thieved Test Example Thieved Test 3.1.6: Obtaining private test labels Collection No
    User Contamination Example User Contamination 3.1.7: Post-training on test data in user prompts Training Plausibly
    Over-hyping Example Over-hyping 3.1.8: Tuning hyperparameters further after test Training Plausibly
    Meta-contamination Example Meta-contamination 3.1.9: Reusing contaminated hyperparameters/designs Training Plausibly
    Semantic Duplicates Example Semantic Duplicates 3.1.10: Train and test set include near-identical points Collection Plausibly
    Baseline Nerfing Example Baseline Nerfing 3.2.1: Optimising training parameters of baselines less Evaluation Plausibly
    Baseline Hacking Example Baseline Hacking 3.2.1: Choosing weak baselines to compare to Evaluation No
    Runtime Nerfing Example Runtime Nerfing 3.2.2: Optimising baselines’ inference parameters less Evaluation Plausibly
    Runtime Hacking Example Runtime Hacking 3.2.3: Post-hoc best inference parameters or decoding Evaluation No
    Benchmark Hacking Example Benchmark Hacking 3.2.4: Choosing easier benchmarks Evaluation Plausibly
    Subset Hacking Example Subset Hacking 3.2.5: Subsetting the benchmark until you win Evaluation No
    Harness Hacking Example Harness Hacking 3.2.6: Choosing evaluation details after test Evaluation No
    Golden Seed Example Golden Seed 3.2.7: Training/tuning with many different seeds Training No
    Prompt Nerfing Example Prompt Nerfing 3.2.8: Undertuning prompts of baseline models (e.g. fewer few-shot examples) Evaluation Plausibly
    Prompt Hacking Example Prompt Hacking 3.2.8: Choosing the best prompt strategy post-hoc (few-shot examples, system/user prompt, CoT) Evaluation Plausibly
    Superfluous Cog Example Superfluous Cog 3.3.1: Redundant module added to claim novelty Design Plausibly
    Whack-a-mole Example Whack-a-mole 3.3.2: Monitoring for specific failures and fine-tuning them away ad hoc Training No
    Benchmark Decoration Example Benchmark Decoration 3.3.3: Pretraining on benchmark / instruction data Training Plausibly
    p-hacking Example p-hacking 3.3.4: When bolding SOTA results, flawed sampling Reporting Plausibly
    Point Scores Example Point Scores 3.3.5: Reporting single run results i.e. no error bars Reporting No
    Outright Lies Example Outright Lies 3.3.6: Fabricating results. Included for completeness. Reporting No
    Over/Underclaiming Example Over/Underclaiming 3.3.7: Misleading claims about model capabilities Reporting No
    Reification Example Reification 3.3.8: General claims from narrow ML benchmarks Reporting No
    Nonzero-shot Example Nonzero-shot 3.3.9: Claiming ‘zero-shot’ while training on examples Reporting Plausibly
    Misarithmetic Mean Example Misarithmetic Mean 3.3.10: Using arithmetic mean on normalised results Reporting Plausibly
    Parameter Smuggling Example Parameter Smuggling 3.3.11: Under-reporting the model size; or substituting-in more embedding parameters Reporting No
    File Drawer Example File Drawer 3.3.12: Failing to report negative benchmark studies Reporting No
    Inductive Smuggling Example Inductive Smuggling 3.4.2: Handcrafting inductive bias for a task Design No
    Label Noise Example Label Noise 3.4.3: Using benchmarks known to be error-ridden Collection Plausibly

    Original caption: Table 1: Some questionable or fraudulent practices in ML

    Plain English Explanation

    The paper discusses problematic practices that can creep into machine learning research. One issue is overfitting, where models perform exceptionally well on the data they were trained on, but fail to generalize to new, unseen data. This can lead to overconfident claims about a model's capabilities.

    Another concern is publication bias, where researchers are more likely to publish positive results that show their methods working well, while negative or inconclusive findings often go unpublished. This skews the literature and gives an unrealistic impression of the field's progress.

    The paper also highlights misleading evaluations, where the metrics used to assess a model's performance may not actually capture its true capabilities or real-world applicability. For example, popular benchmarks for evaluating privacy defenses in ML have been shown to be unreliable.

    By addressing these problematic practices, the authors argue that the field of machine learning can become more rigorous, reliable, and transparent - leading to better uncertainty quantification in large language models and other advances.

    Technical Explanation

    The paper begins by discussing the rise of machine learning as a powerful tool for scientific discovery, but notes that this has also led to the emergence of questionable research practices. The authors highlight three key issues:

    1. Overfitting: The authors explain how machine learning models can become overly specialized to the training data, leading to inflated performance metrics that do not reflect real-world generalization. They draw connections to related work on unraveling overoptimism and publication bias in ML-driven science.

    2. Publication bias: The paper discusses the tendency for positive results to be more likely to be published, while negative or inconclusive findings often go unreported. This can skew the scientific literature and give an unrealistic impression of progress in the field. The authors relate this to lessons for reliable machine learning and the importance of embracing negative results.

    3. Misleading evaluations: The authors examine how the metrics used to assess machine learning models, particularly in the context of privacy defenses, can be misleading and fail to capture real-world performance. They discuss the issues with evaluating machine learning privacy defenses and the need for more robust evaluation methods.

    Critical Analysis

    The paper raises valid concerns about the potential for questionable practices to undermine the reliability and integrity of machine learning research. The authors provide a nuanced and well-reasoned critique, acknowledging the field's rapid progress while also highlighting important caveats and areas for improvement.

    One potential limitation of the research is that it focuses primarily on issues within the machine learning research community, without delving deeply into the broader societal implications of these practices. For example, the authors could have explored how misleading evaluations and publication biases might impact real-world deployments of machine learning systems and their effects on individuals and communities.

    Additionally, while the paper makes a strong case for addressing these problematic practices, it could be strengthened by providing more concrete recommendations or frameworks for how the research community can work to mitigate them. Further research in this direction could help translate the authors' insights into actionable steps for improving the reliability and transparency of machine learning-driven science.

    Conclusion

    This paper sheds important light on the emergence of questionable practices in machine learning research, such as overfitting, publication bias, and misleading evaluations. By drawing connections to related work and highlighting the need for more rigorous and transparent approaches, the authors make a compelling case for addressing these issues to ensure the integrity and reliability of ML-driven scientific discoveries.

    As the field of machine learning continues to advance, it will be crucial for researchers, practitioners, and the broader public to remain vigilant and critical in their assessment of the methods and findings presented. Addressing the problematic practices outlined in this paper can help pave the way for a more trustworthy and impactful future for machine learning and its applications.

    Full paper

    Loading...

    Loading PDF viewer...

    Read original: arXiv:2407.12220



    This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

    Total Score

    4

    Follow @aimodelsfyi on 𝕏 →