# The Optimal Choice of Hypothesis Is the Weakest, Not the Shortest

2301.12987

5

0

📈

## Abstract

If $A$ and $B$ are sets such that $A subset B$, generalisation may be understood as the inference from $A$ of a hypothesis sufficient to construct $B$. One might infer any number of hypotheses from $A$, yet only some of those may generalise to $B$. How can one know which are likely to generalise? One strategy is to choose the shortest, equating the ability to compress information with the ability to generalise (a proxy for intelligence). We examine this in the context of a mathematical formalism of enactive cognition. We show that compression is neither necessary nor sufficient to maximise performance (measured in terms of the probability of a hypothesis generalising). We formulate a proxy unrelated to length or simplicity, called weakness. We show that if tasks are uniformly distributed, then there is no choice of proxy that performs at least as well as weakness maximisation in all tasks while performing strictly better in at least one. In experiments comparing maximum weakness and minimum description length in the context of binary arithmetic, the former generalised at between $1.1$ and $5$ times the rate of the latter. We argue this demonstrates that weakness is a far better proxy, and explains why Deepmind's Apperception Engine is able to generalise effectively.

Get summaries of the top AI research delivered straight to your inbox:

## Overview

- This paper examines the relationship between compression and the ability to generalize from one set (A) to a larger set (B).
- The authors propose a new concept called "weakness" as a proxy for generalization, which they argue performs better than simply minimizing description length.
- They demonstrate this through experiments comparing maximum weakness and minimum description length in the context of binary arithmetic.

## Plain English Explanation

The paper explores the idea that the ability to compress information may be a proxy for the ability to generalize from a smaller set (A) to a larger set (B). The authors reason that if we can find a shorter way to describe the information in A, then that compressed representation might be able to generate the larger set B.

However, the authors show that compression alone is not enough - there may be many potential compressed representations of A, but only some of them will actually generalize to B. To address this, they introduce a new concept called "weakness," which they argue is a better proxy for generalization than simply minimizing the length of the description.

The key insight is that by focusing on the "weakest" or most constrained hypothesis that can generate A, you are more likely to find one that can also generate the larger set B. The authors demonstrate this through experiments on binary arithmetic, where the "maximum weakness" approach outperformed the "minimum description length" approach in terms of the rate of generalization.

The authors suggest that this idea of "weakness" could help explain why certain AI systems, like DeepMind's Apperception Engine, are able to generalize effectively - they may be homing in on the weakest, most constrained hypotheses that can still account for the training data.

## Technical Explanation

The paper formalizes the concept of generalization in the context of enactive cognition. If we have a set A that is a subset of a larger set B, generalization can be understood as the process of inferring a hypothesis that is sufficient to construct B from A.

The authors explore the idea that the ability to compress information, as measured by the minimum description length of a hypothesis, may be a proxy for the ability to generalize. They show, however, that compression is neither necessary nor sufficient to maximize the probability of a hypothesis generalizing.

To address this, the authors introduce a new concept called "weakness," which is unrelated to length or simplicity. They prove that if tasks are uniformly distributed, there is no choice of proxy that performs at least as well as weakness maximization in all tasks while performing strictly better in at least one.

In their experiments, the authors compare the performance of maximum weakness and minimum description length in the context of binary arithmetic. They find that the maximum weakness approach generalizes at between 1.1 and 5 times the rate of the minimum description length approach. This, they argue, demonstrates that weakness is a far better proxy for generalization than compression alone.

## Critical Analysis

The paper makes a compelling case for the "weakness" proxy as a more effective way to identify hypotheses that can generalize from A to B, compared to simply minimizing description length. The authors provide a strong theoretical foundation and empirical evidence to support their claims.

However, the paper does not address potential limitations or caveats of the weakness approach. For example, it's unclear how the approach would scale to more complex real-world problems, or how sensitive it is to the specific distribution of tasks. Additionally, the authors do not discuss potential practical challenges in implementing the weakness maximization strategy in real-world AI systems.

It would also be helpful to see the authors contextualize their findings within the broader literature on generalization in AI, and explore potential connections or differences with other approaches.

Overall, this paper presents an interesting and potentially impactful idea, but further research and discussion would be beneficial to fully understand its implications and limitations.

## Conclusion

This paper introduces a novel concept called "weakness" as a proxy for generalization, which the authors argue outperforms the more common approach of minimizing description length. Through a combination of theoretical analysis and empirical experiments, the authors demonstrate that weakness maximization is a more effective strategy for identifying hypotheses that can generalize from a smaller set (A) to a larger set (B).

The findings have potentially significant implications for the development of more robust and generalizable AI systems, as the weakness approach may help address some of the limitations of current approaches focused solely on compression or simplicity. Further research and real-world testing would be needed to fully understand the broader applicability and practical implementation of this idea.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

## Related Papers

❗

### Is Complexity an Illusion?

Michael Timothy Bennett

0

0

Simplicity is held by many to be the key to general intelligence. Simpler models tend to generalise, identifying the cause or generator of data with greater sample efficiency. The implications of the correlation between simplicity and generalisation extend far beyond computer science, addressing questions of physics and even biology. Yet simplicity is a property of form, while generalisation is of function. In interactive settings, any correlation between the two depends on interpretation. In theory there could be no correlation and yet in practice, there is. Previous theoretical work showed generalisation to be a consequence of weak constraints implied by function, not form. Experiments demonstrated choosing weak constraints over simple forms yielded a 110-500% improvement in generalisation rate. Here we show that all constraints can take equally simple forms, regardless of weakness. However if forms are spatially extended, then function is represented using a finite subset of forms. If function is represented using a finite subset of forms, then we can force a correlation between simplicity and generalisation by making weak constraints take simple forms. If function is determined by a goal directed process that favours versatility (e.g. natural selection), then efficiency demands weak constraints take simple forms. Complexity has no causal influence on generalisation, but appears to due to confounding.

5/31/2024

### New!Minimal Communication-Cost Statistical Learning

Milad Sefidgaran, Abdellatif Zaidi, Piotr Krasnowski

0

0

A client device which has access to $n$ training data samples needs to obtain a statistical hypothesis or model $W$ and then to send it to a remote server. The client and the server devices share some common randomness sequence as well as a prior on the hypothesis space. In this problem a suitable hypothesis or model $W$ should meet two distinct design criteria simultaneously: (i) small (population) risk during the inference phase and (ii) small 'complexity' for it to be conveyed to the server with minimum communication cost. In this paper, we propose a joint training and source coding scheme with provable in-expectation guarantees, where the expectation is over the encoder's output message. Specifically, we show that by imposing a constraint on a suitable Kullback-Leibler divergence between the conditional distribution induced by a compressed learning model $widehat{W}$ given $W$ and the prior, one guarantees simultaneously small average empirical risk (aka training loss), small average generalization error and small average communication cost. We also consider a one-shot scenario in which the guarantees on the empirical risk and generalization error are obtained for every encoder's output message.

6/13/2024

### Theoretical Analysis of Weak-to-Strong Generalization

Hunter Lang, David Sontag, Aravindan Vijayaraghavan

0

0

Strong student models can learn from weaker teachers: when trained on the predictions of a weaker model, a strong pretrained student can learn to correct the weak model's errors and generalize to examples where the teacher is not confident, even when these examples are excluded from training. This enables learning from cheap, incomplete, and possibly incorrect label information, such as coarse logical rules or the generations of a language model. We show that existing weak supervision theory fails to account for both of these effects, which we call pseudolabel correction and coverage expansion, respectively. We give a new bound based on expansion properties of the data distribution and student hypothesis class that directly accounts for pseudolabel correction and coverage expansion. Our bounds capture the intuition that weak-to-strong generalization occurs when the strong model is unable to fit the mistakes of the weak teacher without incurring additional error. We show that these expansion properties can be checked from finite data and give empirical evidence that they hold in practice.

5/28/2024

### Quantifying the Gain in Weak-to-Strong Generalization

Moses Charikar, Chirag Pabbaraju, Kirankumar Shiragur

0

0

Recent advances in large language models have shown capabilities that are extraordinary and near-superhuman. These models operate with such complexity that reliably evaluating and aligning them proves challenging for humans. This leads to the natural question: can guidance from weak models (like humans) adequately direct the capabilities of strong models? In a recent and somewhat surprising work, Burns et al. (2023) empirically demonstrated that when strong models (like GPT-4) are finetuned using labels generated by weak supervisors (like GPT-2), the strong models outperform their weaker counterparts -- a phenomenon they term weak-to-strong generalization. In this work, we present a theoretical framework for understanding weak-to-strong generalization. Specifically, we show that the improvement in performance achieved by strong models over their weaker counterparts is quantified by the misfit error incurred by the strong model on labels generated by the weaker model. Our theory reveals several curious algorithmic insights. For instance, we can predict the amount by which the strong model will improve over the weak model, and also choose among different weak models to train the strong model, based on its misfit error. We validate our theoretical findings through various empirical assessments.

5/27/2024