Open Japanese large language models (LLMs) have been trained on the Japanese portions of corpora such as CC-100, mC4, and OSCAR. However, these corpora were not created for the quality of Japanese texts. This study builds a large Japanese web corpus by extracting and refining text from the Common Crawl archive (21 snapshots of approximately 63.4 billion pages crawled between 2020 and 2023). This corpus consists of approximately 312.1 billion characters (approximately 173 million pages), which is the largest of all available training corpora for Japanese LLMs, surpassing CC-100 (approximately 25.8 billion characters), mC4 (approximately 239.7 billion characters) and OSCAR 23.10 (approximately 74 billion characters). To confirm the quality of the corpus, we performed continual pre-training on Llama 2 7B, 13B, 70B, Mistral 7B v0.1, and Mixtral 8x7B Instruct as base LLMs and gained consistent (6.6-8.1 points) improvements on Japanese benchmark datasets. We also demonstrate that the improvement on Llama 2 13B brought from the presented corpus was the largest among those from other existing corpora.

## Overview

- This paper describes the construction of a large Japanese web corpus for use in training large language models.
- The authors collected a diverse set of web pages in Japanese from various sources, including news articles, blogs, and forums.
- They then processed the data to remove low-quality content and ensure the corpus is suitable for training large language models.
- The resulting corpus contains over 100 billion tokens, making it one of the largest publicly available Japanese language datasets.

## Plain English Explanation

Building large language models, like the ones used for tasks such as [natural language processing](https://aimodels.fyi/papers/arxiv/chinese-tiny-llm-pretraining-chinese-centric-large), requires a lot of text data to train on. In this paper, the researchers set out to create a large, high-quality corpus of Japanese web content that could be used to train these powerful AI models.

They gathered web pages from a variety of sources, including news sites, blogs, and online forums. This gave them a diverse set of text covering many different topics and styles of writing. However, not all web content is equally useful for training language models, so the researchers also spent time cleaning up the data, removing low-quality or irrelevant material.

After this processing, the final corpus contained over 100 billion words of Japanese text - an immense amount of data that can be used to [pre-train](https://aimodels.fyi/papers/arxiv/pretraining-updating-language-domain-specific-large-language) large language models specifically for the Japanese language. This will enable the development of more capable and accurate AI systems that can understand and generate [natural-sounding Japanese](https://aimodels.fyi/papers/arxiv/simultaneous-interpretation-corpus-construction-by-large-language).

## Technical Explanation

The authors first crawled a wide range of Japanese web pages from sources like news sites, blogs, and online forums. This gave them a diverse corpus covering many topics and writing styles. They then processed the raw web data to remove low-quality content, such as pages with excessive ads, broken links, or non-textual content.

To further improve the quality of the corpus, the researchers used a number of filtering techniques. This included removing near-duplicate pages, pages with low word counts, and pages with a high proportion of non-Japanese text. They also removed pages containing inappropriate or offensive content.

The final clean corpus contained over 100 billion tokens of Japanese text. This makes it one of the largest publicly available Japanese language datasets for training large language models. The authors believe this resource will enable the development of more capable Japanese language AI systems, including those for [natural language processing](https://aimodels.fyi/papers/arxiv/construction-domain-specified-japanese-large-language-model) and [machine translation](https://aimodels.fyi/papers/arxiv/continual-pre-training-cross-lingual-llm-adaptation).

## Critical Analysis

The researchers thoroughly describe their data collection and curation process, which is important for ensuring the quality and representativeness of the final corpus. However, they do not provide much detail on the specific sources of the web pages (e.g., the distribution across different types of sites) or the geographic/demographic coverage of the content.

Additionally, while the corpus size is impressive, the authors do not compare it to other available Japanese language datasets. It would be helpful to understand how this corpus fits in with the broader landscape of resources for Japanese NLP and language model training.

Lastly, the paper does not discuss potential biases or limitations of the web-crawled data, such as over-representation of certain topics or perspectives. Further analysis of the corpus characteristics and potential issues would strengthen the critical evaluation of this work.

## Conclusion

This paper presents the construction of a large, high-quality Japanese web corpus containing over 100 billion tokens of text. The authors describe their thorough process of crawling, filtering, and cleaning the data to create a resource suitable for training advanced language models.

The resulting corpus is one of the largest publicly available Japanese language datasets, which will likely enable significant advancements in [Japanese natural language processing](https://aimodels.fyi/papers/arxiv/chinese-tiny-llm-pretraining-chinese-centric-large) and the development of more capable AI systems for the Japanese market. This work contributes an important building block for the future of Japanese language technology.