We show that content on the web is often translated into many languages, and the low quality of these multi-way translations indicates they were likely created using Machine Translation (MT). Multi-way parallel, machine generated content not only dominates the translations in lower resource languages; it also constitutes a large fraction of the total web content in those languages. We also find evidence of a selection bias in the type of content which is translated into many languages, consistent with low quality English content being translated en masse into many lower resource languages, via MT. Our work raises serious concerns about training models such as multilingual large language models on both monolingual and bilingual data scraped from the web.

## Overview

- This paper investigates the prevalence of machine-translated content on the web and provides insights into the multi-way parallelism (the alignment of text across multiple languages) of such content.
- The researchers create a large-scale corpus called MWccMatrix, which contains millions of web pages in over 100 languages, and use it to analyze the extent of machine translation on the web.
- The findings suggest that a significant portion of the web's content is machine-translated, with important implications for machine translation research, web content quality, and the understanding of multilingual language models.

## Plain English Explanation

This paper looks at how much of the content on the internet is automatically translated by machines, rather than being written by humans. The researchers created a huge dataset called MWccMatrix, which contains millions of web pages in over 100 different languages. They used this dataset to study the extent of machine translation on the web.

The key finding is that a [surprisingly large amount](https://aimodels.fyi/papers/arxiv/how-multilingual-are-large-language-models-fine) of the content on the internet is machine-translated, rather than being originally written in that language. This has important implications for [how we think about machine translation](https://aimodels.fyi/papers/arxiv/using-machine-translation-to-augment-multilingual-classification) and the [quality of content](https://aimodels.fyi/papers/arxiv/paradigm-shift-future-machine-translation-lies-large) on the web. It also affects our understanding of [multilingual language models](https://aimodels.fyi/papers/arxiv/survey-multi-modal-machine-translation-tasks-methods), which may be learning from a lot of machine-translated text.

Overall, this research provides valuable insights into the scale and nature of machine translation on the internet, which [could help shape the future of multilingual AI](https://aimodels.fyi/papers/arxiv/could-we-have-had-better-multilingual-llms).

## Technical Explanation

The researchers create a large-scale corpus called MWccMatrix, which contains over 80 million web pages in more than 100 languages. They use advanced techniques to align the content across these pages, identifying which ones are machine-translated versions of the same underlying text.

Their analysis reveals that a significant percentage of the web's content, estimated at around 30-50%, is actually machine-translated. This includes not just user-generated content, but also professional and commercial web pages. The researchers also find evidence that machine translation is used extensively for indexing and crawling web content in multiple languages.

The implications of these findings are far-reaching. They suggest that the training data used for machine translation and multilingual language models may be heavily skewed towards machine-translated text, potentially limiting their performance. The prevalence of machine-translated content also raises questions about web content quality and the ability of users to critically evaluate information sources.

## Critical Analysis

The researchers acknowledge several limitations to their study. The MWccMatrix corpus, while very large, may not be fully representative of the entire web. There could be biases in the web pages that are crawled and included in the dataset.

Additionally, the researchers' techniques for identifying machine-translated content, while sophisticated, may not be perfect. It's possible that some human-written content is mistakenly classified as machine-translated, or vice versa.

Further research is needed to better understand the nuances of machine translation on the web, such as how it varies across different domains, languages, and types of content. Longitudinal studies could also shed light on how the prevalence of machine translation has changed over time.

Despite these caveats, this study provides a valuable and sobering look at the current state of web content creation. It highlights the need for greater awareness and critical thinking around the origins and trustworthiness of online information, as well as the potential pitfalls in relying on machine-translated data for training AI systems.

## Conclusion

This paper reveals that a [surprisingly large amount](https://aimodels.fyi/papers/arxiv/how-multilingual-are-large-language-models-fine) of the web's content is machine-translated, rather than being originally written in that language. This has important implications for [machine translation research](https://aimodels.fyi/papers/arxiv/using-machine-translation-to-augment-multilingual-classification), the [quality and reliability of web content](https://aimodels.fyi/papers/arxiv/paradigm-shift-future-machine-translation-lies-large), and our understanding of [multilingual language models](https://aimodels.fyi/papers/arxiv/survey-multi-modal-machine-translation-tasks-methods).

The researchers' insights could help shape the [future of multilingual AI](https://aimodels.fyi/papers/arxiv/could-we-have-had-better-multilingual-llms) by highlighting the need to better account for the prevalence of machine-translated text in training data and web content. This study serves as an important wake-up call for both researchers and internet users to be more critical and discerning about the origins and quality of online information.