Using crowdsourcing, we collected more than 10,000 URL pairs (parallel top page pairs) of bilingual websites that contain parallel documents and created a Japanese-Chinese parallel corpus of 4.6M sentence pairs from these websites. We used a Japanese-Chinese bilingual dictionary of 160K word pairs for document and sentence alignment. We then used high-quality 1.2M Japanese-Chinese sentence pairs to train a parallel corpus filter based on statistical language models and word translation probabilities. We compared the translation accuracy of the model trained on these 4.6M sentence pairs with that of the model trained on Japanese-Chinese sentence pairs from CCMatrix (12.4M), a parallel corpus from global web mining. Although our corpus is only one-third the size of CCMatrix, we found that the accuracy of the two models was comparable and confirmed that it is feasible to use crowdsourcing for web mining of parallel data.

## Overview

- Collected over 10,000 URL pairs of bilingual websites with parallel documents
- Created a Japanese-Chinese parallel corpus of 4.6M sentence pairs from these websites
- Used a 160K word pair Japanese-Chinese bilingual dictionary for document and sentence alignment
- Trained a parallel corpus filter using 1.2M high-quality Japanese-Chinese sentence pairs
- Compared the translation accuracy of the model trained on 4.6M sentence pairs to the model trained on 12.4M sentence pairs from CCMatrix
- Found the accuracy of the two models was comparable, demonstrating the feasibility of using crowdsourcing for web mining of parallel data

## Plain English Explanation

The researchers [collected a large number of website pairs](https://aimodels.fyi/papers/arxiv/building-large-japanese-web-corpus-large-language) with content in both Japanese and Chinese. They used these websites to create a [parallel corpus](https://aimodels.fyi/papers/arxiv/kazparc-kazakh-parallel-corpus-machine-translation) - a dataset of sentence-level translations between the two languages. This corpus contained 4.6 million sentence pairs.

To help align the sentences between the two languages, the researchers used a Japanese-Chinese dictionary with 160,000 word pairs. They then used a subset of 1.2 million high-quality sentence pairs to train a filtering model. This model could identify which sentence pairs were good translations of each other, based on statistical patterns in the language.

The researchers compared the performance of machine translation models trained on their 4.6 million sentence pair corpus versus a larger 12.4 million sentence pair corpus called CCMatrix. Even though their corpus was only about a third the size, the translation accuracy of the two models was similar. This suggests that it is feasible to use crowdsourcing techniques to build useful parallel corpora for language translation, without needing to rely solely on large web mining efforts.

## Technical Explanation

The researchers [collected a large number of URL pairs](https://aimodels.fyi/papers/arxiv/building-large-japanese-web-corpus-large-language) of bilingual websites that contained parallel documents. From these websites, they created a Japanese-Chinese parallel corpus containing 4.6 million sentence pairs.

To align the documents and sentences between the two languages, the researchers used a Japanese-Chinese bilingual dictionary with 160,000 word pairs. They then used a subset of 1.2 million high-quality sentence pairs to train a parallel corpus filter. This filter used statistical language models and word translation probabilities to identify which sentence pairs were good translations of each other.

The researchers compared the translation accuracy of a model trained on their 4.6 million sentence pair corpus to a model trained on 12.4 million sentence pairs from the CCMatrix corpus. Despite being only about one-third the size, they found the translation accuracy of the two models was comparable. This confirms that it is feasible to use crowdsourcing techniques, such as [identifying parallel web pages](https://aimodels.fyi/papers/arxiv/meta4xnli-crosslingual-parallel-corpus-metaphor-detection-interpretation), to build useful parallel corpora for machine translation, without needing to rely solely on large-scale web mining efforts.

## Critical Analysis

The researchers acknowledge that their corpus of 4.6 million sentence pairs is relatively small compared to the 12.4 million sentence pairs in the CCMatrix corpus. They also note that the quality of the sentence pairs in their corpus may vary, as they were collected through crowdsourcing rather than large-scale web mining.

While the researchers found that the translation accuracy of the two models was comparable, it's possible that the CCMatrix corpus may still have an advantage in certain language domains or use cases. The researchers encourage further research to explore the strengths and limitations of both approaches to parallel corpus construction.

Additionally, the researchers do not provide much detail on the specific crowdsourcing techniques they used to collect the website pairs. [More research](https://aimodels.fyi/papers/arxiv/simultaneous-interpretation-corpus-construction-by-large-language) may be needed to understand the best practices and challenges of using crowdsourcing for parallel data collection.

Overall, the researchers have demonstrated a promising approach to building useful parallel corpora through crowdsourcing, but there is still room for further exploration and refinement of the methodology.

## Conclusion

This research shows that it is possible to use crowdsourcing techniques to build a large-scale parallel corpus for machine translation, without needing to rely solely on web mining efforts. The researchers were able to create a Japanese-Chinese corpus of 4.6 million sentence pairs that achieved comparable translation accuracy to a much larger corpus.

This suggests that crowdsourcing can be a viable and cost-effective approach for building high-quality parallel data, which is a critical component of [cross-lingual language models](https://aimodels.fyi/papers/arxiv/continual-pre-training-cross-lingual-llm-adaptation) and other multilingual natural language processing applications. As the field of machine translation continues to advance, techniques like this that leverage the wisdom of the crowd could play an important role in accelerating progress and making language technologies more accessible to diverse global audiences.