A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism

2401.05749

YC

95

Reddit

0

Published 6/7/2024 by Brian Thompson, Mehak Preet Dhaliwal, Peter Frisch, Tobias Domhan, Marcello Federico
A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism

Abstract

We show that content on the web is often translated into many languages, and the low quality of these multi-way translations indicates they were likely created using Machine Translation (MT). Multi-way parallel, machine generated content not only dominates the translations in lower resource languages; it also constitutes a large fraction of the total web content in those languages. We also find evidence of a selection bias in the type of content which is translated into many languages, consistent with low quality English content being translated en masse into many lower resource languages, via MT. Our work raises serious concerns about training models such as multilingual large language models on both monolingual and bilingual data scraped from the web.

Get summaries of the top AI research delivered straight to your inbox:

Overview

  • This paper investigates the prevalence of machine-translated content on the web and provides insights into the multi-way parallelism (the alignment of text across multiple languages) of such content.
  • The researchers create a large-scale corpus called MWccMatrix, which contains millions of web pages in over 100 languages, and use it to analyze the extent of machine translation on the web.
  • The findings suggest that a significant portion of the web's content is machine-translated, with important implications for machine translation research, web content quality, and the understanding of multilingual language models.

Plain English Explanation

This paper looks at how much of the content on the internet is automatically translated by machines, rather than being written by humans. The researchers created a huge dataset called MWccMatrix, which contains millions of web pages in over 100 different languages. They used this dataset to study the extent of machine translation on the web.

The key finding is that a surprisingly large amount of the content on the internet is machine-translated, rather than being originally written in that language. This has important implications for how we think about machine translation and the quality of content on the web. It also affects our understanding of multilingual language models, which may be learning from a lot of machine-translated text.

Overall, this research provides valuable insights into the scale and nature of machine translation on the internet, which could help shape the future of multilingual AI.

Technical Explanation

The researchers create a large-scale corpus called MWccMatrix, which contains over 80 million web pages in more than 100 languages. They use advanced techniques to align the content across these pages, identifying which ones are machine-translated versions of the same underlying text.

Their analysis reveals that a significant percentage of the web's content, estimated at around 30-50%, is actually machine-translated. This includes not just user-generated content, but also professional and commercial web pages. The researchers also find evidence that machine translation is used extensively for indexing and crawling web content in multiple languages.

The implications of these findings are far-reaching. They suggest that the training data used for machine translation and multilingual language models may be heavily skewed towards machine-translated text, potentially limiting their performance. The prevalence of machine-translated content also raises questions about web content quality and the ability of users to critically evaluate information sources.

Critical Analysis

The researchers acknowledge several limitations to their study. The MWccMatrix corpus, while very large, may not be fully representative of the entire web. There could be biases in the web pages that are crawled and included in the dataset.

Additionally, the researchers' techniques for identifying machine-translated content, while sophisticated, may not be perfect. It's possible that some human-written content is mistakenly classified as machine-translated, or vice versa.

Further research is needed to better understand the nuances of machine translation on the web, such as how it varies across different domains, languages, and types of content. Longitudinal studies could also shed light on how the prevalence of machine translation has changed over time.

Despite these caveats, this study provides a valuable and sobering look at the current state of web content creation. It highlights the need for greater awareness and critical thinking around the origins and trustworthiness of online information, as well as the potential pitfalls in relying on machine-translated data for training AI systems.

Conclusion

This paper reveals that a surprisingly large amount of the web's content is machine-translated, rather than being originally written in that language. This has important implications for machine translation research, the quality and reliability of web content, and our understanding of multilingual language models.

The researchers' insights could help shape the future of multilingual AI by highlighting the need to better account for the prevalence of machine-translated text in training data and web content. This study serves as an important wake-up call for both researchers and internet users to be more critical and discerning about the origins and quality of online information.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

How Multilingual Are Large Language Models Fine-Tuned for Translation?

How Multilingual Are Large Language Models Fine-Tuned for Translation?

Aquia Richburg, Marine Carpuat

YC

0

Reddit

0

A new paradigm for machine translation has recently emerged: fine-tuning large language models (LLM) on parallel text has been shown to outperform dedicated translation systems trained in a supervised fashion on much larger amounts of parallel data (Xu et al., 2024a; Alves et al., 2024). However, it remains unclear whether this paradigm can enable massively multilingual machine translation or whether it requires fine-tuning dedicated models for a small number of language pairs. How does translation fine-tuning impact the MT capabilities of LLMs for zero-shot languages, zero-shot language pairs, and translation tasks that do not involve English? To address these questions, we conduct an extensive empirical evaluation of the translation quality of the TOWER family of language models (Alves et al., 2024) on 132 translation tasks from the multi-parallel FLORES-200 data. We find that translation fine-tuning improves translation quality even for zero-shot languages on average, but that the impact is uneven depending on the language pairs involved. These results call for further research to effectively enable massively multilingual translation with LLMs.

Read more

6/3/2024

🏷️

Using Machine Translation to Augment Multilingual Classification

Adam King

YC

0

Reddit

0

An all-too-present bottleneck for text classification model development is the need to annotate training data and this need is multiplied for multilingual classifiers. Fortunately, contemporary machine translation models are both easily accessible and have dependable translation quality, making it possible to translate labeled training data from one language into another. Here, we explore the effects of using machine translation to fine-tune a multilingual model for a classification task across multiple languages. We also investigate the benefits of using a novel technique, originally proposed in the field of image captioning, to account for potential negative effects of tuning models on translated data. We show that translated data are of sufficient quality to tune multilingual classifiers and that this novel loss technique is able to offer some improvement over models tuned without it.

Read more

5/10/2024

💬

A Paradigm Shift: The Future of Machine Translation Lies with Large Language Models

Chenyang Lyu, Zefeng Du, Jitao Xu, Yitao Duan, Minghao Wu, Teresa Lynn, Alham Fikri Aji, Derek F. Wong, Siyou Liu, Longyue Wang

YC

0

Reddit

0

Machine Translation (MT) has greatly advanced over the years due to the developments in deep neural networks. However, the emergence of Large Language Models (LLMs) like GPT-4 and ChatGPT is introducing a new phase in the MT domain. In this context, we believe that the future of MT is intricately tied to the capabilities of LLMs. These models not only offer vast linguistic understandings but also bring innovative methodologies, such as prompt-based techniques, that have the potential to further elevate MT. In this paper, we provide an overview of the significant enhancements in MT that are influenced by LLMs and advocate for their pivotal role in upcoming MT research and implementations. We highlight several new MT directions, emphasizing the benefits of LLMs in scenarios such as Long-Document Translation, Stylized Translation, and Interactive Translation. Additionally, we address the important concern of privacy in LLM-driven MT and suggest essential privacy-preserving strategies. By showcasing practical instances, we aim to demonstrate the advantages that LLMs offer, particularly in tasks like translating extended documents. We conclude by emphasizing the critical role of LLMs in guiding the future evolution of MT and offer a roadmap for future exploration in the sector.

Read more

4/3/2024

🤖

A Survey on Multi-modal Machine Translation: Tasks, Methods and Challenges

Huangjun Shen, Liangying Shao, Wenbo Li, Zhibin Lan, Zhanyu Liu, Jinsong Su

YC

0

Reddit

0

In recent years, multi-modal machine translation has attracted significant interest in both academia and industry due to its superior performance. It takes both textual and visual modalities as inputs, leveraging visual context to tackle the ambiguities in source texts. In this paper, we begin by offering an exhaustive overview of 99 prior works, comprehensively summarizing representative studies from the perspectives of dominant models, datasets, and evaluation metrics. Afterwards, we analyze the impact of various factors on model performance and finally discuss the possible research directions for this task in the future. Over time, multi-modal machine translation has developed more types to meet diverse needs. Unlike previous surveys confined to the early stage of multi-modal machine translation, our survey thoroughly concludes these emerging types from different aspects, so as to provide researchers with a better understanding of its current state.

Read more

5/24/2024