A Japanese-Chinese Parallel Corpus Using Crowdsourcing for Web Mining

2405.09017

YC

0

Reddit

0

Published 5/16/2024 by Masaaki Nagata, Makoto Morishita, Katsuki Chousa, Norihito Yasuda

šŸŽÆ

Abstract

Using crowdsourcing, we collected more than 10,000 URL pairs (parallel top page pairs) of bilingual websites that contain parallel documents and created a Japanese-Chinese parallel corpus of 4.6M sentence pairs from these websites. We used a Japanese-Chinese bilingual dictionary of 160K word pairs for document and sentence alignment. We then used high-quality 1.2M Japanese-Chinese sentence pairs to train a parallel corpus filter based on statistical language models and word translation probabilities. We compared the translation accuracy of the model trained on these 4.6M sentence pairs with that of the model trained on Japanese-Chinese sentence pairs from CCMatrix (12.4M), a parallel corpus from global web mining. Although our corpus is only one-third the size of CCMatrix, we found that the accuracy of the two models was comparable and confirmed that it is feasible to use crowdsourcing for web mining of parallel data.

Create account to get full access

or

If you already have an account, we'll log you in

Overview

  • Collected over 10,000 URL pairs of bilingual websites with parallel documents
  • Created a Japanese-Chinese parallel corpus of 4.6M sentence pairs from these websites
  • Used a 160K word pair Japanese-Chinese bilingual dictionary for document and sentence alignment
  • Trained a parallel corpus filter using 1.2M high-quality Japanese-Chinese sentence pairs
  • Compared the translation accuracy of the model trained on 4.6M sentence pairs to the model trained on 12.4M sentence pairs from CCMatrix
  • Found the accuracy of the two models was comparable, demonstrating the feasibility of using crowdsourcing for web mining of parallel data

Plain English Explanation

The researchers collected a large number of website pairs with content in both Japanese and Chinese. They used these websites to create a parallel corpus - a dataset of sentence-level translations between the two languages. This corpus contained 4.6 million sentence pairs.

To help align the sentences between the two languages, the researchers used a Japanese-Chinese dictionary with 160,000 word pairs. They then used a subset of 1.2 million high-quality sentence pairs to train a filtering model. This model could identify which sentence pairs were good translations of each other, based on statistical patterns in the language.

The researchers compared the performance of machine translation models trained on their 4.6 million sentence pair corpus versus a larger 12.4 million sentence pair corpus called CCMatrix. Even though their corpus was only about a third the size, the translation accuracy of the two models was similar. This suggests that it is feasible to use crowdsourcing techniques to build useful parallel corpora for language translation, without needing to rely solely on large web mining efforts.

Technical Explanation

The researchers collected a large number of URL pairs of bilingual websites that contained parallel documents. From these websites, they created a Japanese-Chinese parallel corpus containing 4.6 million sentence pairs.

To align the documents and sentences between the two languages, the researchers used a Japanese-Chinese bilingual dictionary with 160,000 word pairs. They then used a subset of 1.2 million high-quality sentence pairs to train a parallel corpus filter. This filter used statistical language models and word translation probabilities to identify which sentence pairs were good translations of each other.

The researchers compared the translation accuracy of a model trained on their 4.6 million sentence pair corpus to a model trained on 12.4 million sentence pairs from the CCMatrix corpus. Despite being only about one-third the size, they found the translation accuracy of the two models was comparable. This confirms that it is feasible to use crowdsourcing techniques, such as identifying parallel web pages, to build useful parallel corpora for machine translation, without needing to rely solely on large-scale web mining efforts.

Critical Analysis

The researchers acknowledge that their corpus of 4.6 million sentence pairs is relatively small compared to the 12.4 million sentence pairs in the CCMatrix corpus. They also note that the quality of the sentence pairs in their corpus may vary, as they were collected through crowdsourcing rather than large-scale web mining.

While the researchers found that the translation accuracy of the two models was comparable, it's possible that the CCMatrix corpus may still have an advantage in certain language domains or use cases. The researchers encourage further research to explore the strengths and limitations of both approaches to parallel corpus construction.

Additionally, the researchers do not provide much detail on the specific crowdsourcing techniques they used to collect the website pairs. More research may be needed to understand the best practices and challenges of using crowdsourcing for parallel data collection.

Overall, the researchers have demonstrated a promising approach to building useful parallel corpora through crowdsourcing, but there is still room for further exploration and refinement of the methodology.

Conclusion

This research shows that it is possible to use crowdsourcing techniques to build a large-scale parallel corpus for machine translation, without needing to rely solely on web mining efforts. The researchers were able to create a Japanese-Chinese corpus of 4.6 million sentence pairs that achieved comparable translation accuracy to a much larger corpus.

This suggests that crowdsourcing can be a viable and cost-effective approach for building high-quality parallel data, which is a critical component of cross-lingual language models and other multilingual natural language processing applications. As the field of machine translation continues to advance, techniques like this that leverage the wisdom of the crowd could play an important role in accelerating progress and making language technologies more accessible to diverse global audiences.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

šŸ“‰

KazParC: Kazakh Parallel Corpus for Machine Translation

Rustem Yeshpanov, Alina Polonskaya, Huseyin Atakan Varol

YC

0

Reddit

0

We introduce KazParC, a parallel corpus designed for machine translation across Kazakh, English, Russian, and Turkish. The first and largest publicly available corpus of its kind, KazParC contains a collection of 371,902 parallel sentences covering different domains and developed with the assistance of human translators. Our research efforts also extend to the development of a neural machine translation model nicknamed Tilmash. Remarkably, the performance of Tilmash is on par with, and in certain instances, surpasses that of industry giants, such as Google Translate and Yandex Translate, as measured by standard evaluation metrics, such as BLEU and chrF. Both KazParC and Tilmash are openly available for download under the Creative Commons Attribution 4.0 International License (CC BY 4.0) through our GitHub repository.

Read more

4/11/2024

šŸ’¬

Quality Does Matter: A Detailed Look at the Quality and Utility of Web-Mined Parallel Corpora

Surangika Ranathunga, Nisansa de Silva, Menan Velayuthan, Aloka Fernando, Charitha Rathnayake

YC

0

Reddit

0

We conducted a detailed analysis on the quality of web-mined corpora for two low-resource languages (making three language pairs, English-Sinhala, English-Tamil and Sinhala-Tamil). We ranked each corpus according to a similarity measure and carried out an intrinsic and extrinsic evaluation on different portions of this ranked corpus. We show that there are significant quality differences between different portions of web-mined corpora and that the quality varies across languages and datasets. We also show that, for some web-mined datasets, Neural Machine Translation (NMT) models trained with their highest-ranked 25k portion can be on par with human-curated datasets.

Read more

6/17/2024

Building a Large Japanese Web Corpus for Large Language Models

Building a Large Japanese Web Corpus for Large Language Models

Naoaki Okazaki, Kakeru Hattori, Hirai Shota, Hiroki Iida, Masanari Ohi, Kazuki Fujii, Taishi Nakamura, Mengsay Loem, Rio Yokota, Sakae Mizuki

YC

0

Reddit

0

Open Japanese large language models (LLMs) have been trained on the Japanese portions of corpora such as CC-100, mC4, and OSCAR. However, these corpora were not created for the quality of Japanese texts. This study builds a large Japanese web corpus by extracting and refining text from the Common Crawl archive (21 snapshots of approximately 63.4 billion pages crawled between 2020 and 2023). This corpus consists of approximately 312.1 billion characters (approximately 173 million pages), which is the largest of all available training corpora for Japanese LLMs, surpassing CC-100 (approximately 25.8 billion characters), mC4 (approximately 239.7 billion characters) and OSCAR 23.10 (approximately 74 billion characters). To confirm the quality of the corpus, we performed continual pre-training on Llama 2 7B, 13B, 70B, Mistral 7B v0.1, and Mixtral 8x7B Instruct as base LLMs and gained consistent (6.6-8.1 points) improvements on Japanese benchmark datasets. We also demonstrate that the improvement on Llama 2 13B brought from the presented corpus was the largest among those from other existing corpora.

Read more

4/30/2024

ā†—ļø

Smart Bilingual Focused Crawling of Parallel Documents

Cristian Garc'ia-Romero, Miquel Espl`a-Gomis, Felipe S'anchez-Mart'inez

YC

0

Reddit

0

Crawling parallel texts $unicode{x2014}$texts that are mutual translations$unicode{x2014}$ from the Internet is usually done following a brute-force approach: documents are massively downloaded in an unguided process, and only a fraction of them end up leading to actual parallel content. In this work we propose a smart crawling method that guides the crawl towards finding parallel content more rapidly. Our approach builds on two different models: one that infers the language of a document from its URL, and another that infers whether a pair of URLs link to parallel documents. We evaluate both models in isolation and their integration into a crawling tool. The results demonstrate the individual effectiveness of both models and highlight that their combination enables the early discovery of parallel content during crawling, leading to a reduction in the amount of downloaded documents deemed useless, and yielding a greater quantity of parallel documents compared to conventional crawling approaches.

Read more

5/24/2024