Cross-lingual continual pre-training of large language models (LLMs) initially trained on English corpus allows us to leverage the vast amount of English language resources and reduce the pre-training cost. In this study, we constructed Swallow, an LLM with enhanced Japanese capability, by extending the vocabulary of Llama 2 to include Japanese characters and conducting continual pre-training on a large Japanese web corpus. Experimental results confirmed that the performance on Japanese tasks drastically improved through continual pre-training, and the performance monotonically increased with the amount of training data up to 100B tokens. Consequently, Swallow achieved superior performance compared to other LLMs that were trained from scratch in English and Japanese. An analysis of the effects of continual pre-training revealed that it was particularly effective for Japanese question answering tasks. Furthermore, to elucidate effective methodologies for cross-lingual continual pre-training from English to Japanese, we investigated the impact of vocabulary expansion and the effectiveness of incorporating parallel corpora. The results showed that the efficiency gained through vocabulary expansion had no negative impact on performance, except for the summarization task, and that the combined use of parallel corpora enhanced translation ability.

## Overview

- This paper explores a method for enhancing the Japanese language capabilities of large language models (LLMs) through a process called "continual pre-training."
- The goal is to improve the performance of LLMs on Japanese-language tasks by continuously training the models on Japanese-specific data, while maintaining their general cross-lingual abilities.
- The researchers investigate the effectiveness of this approach by evaluating the model's performance on various Japanese language benchmarks and compare it to other fine-tuning and pre-training techniques.

## Plain English Explanation

The paper describes a way to make large language models (LLMs) better at working with the Japanese language. LLMs are powerful AI systems that can understand and generate human language, but they are often trained on a mix of languages, which can make them less effective at specific languages like Japanese.

To address this, the researchers propose a "continual pre-training" approach. This involves continuously training the LLM on Japanese-specific data, even after the initial training is complete. This helps the model develop a deeper understanding of the Japanese language and improves its performance on Japanese-language tasks, while still maintaining its ability to work with other languages.

The researchers tested this approach by evaluating the model's performance on various Japanese language benchmarks, such as text generation and question answering. They compared the results to other fine-tuning and pre-training techniques, and found that the continual pre-training approach outperformed the alternatives.

## Technical Explanation

The researchers propose a method for enhancing the Japanese language capabilities of large language models (LLMs) through a process called "continual pre-training." This involves continuously training the LLM on Japanese-specific data, even after the initial pre-training is complete.

The key idea is to leverage the cross-lingual abilities of LLMs, which are typically trained on a diverse set of languages, and then further specialize the model for Japanese language tasks. This is achieved by continuing the pre-training process on a large corpus of Japanese text, including web pages, books, and other relevant materials.

The researchers evaluate the effectiveness of this approach by testing the model's performance on various Japanese language benchmarks, such as text generation, question answering, and natural language inference. They compare the results to other fine-tuning and pre-training techniques, including [domain-specific pre-training](https://aimodels.fyi/papers/arxiv/pretraining-updating-language-domain-specific-large-language), [translation-based pre-training](https://aimodels.fyi/papers/arxiv/novel-paradigm-boosting-translation-capabilities-large-language), and [continual learning](https://aimodels.fyi/papers/arxiv/continual-learning-large-language-models-comprehensive-survey).

The experiments show that the continual pre-training approach outperforms these alternative methods, demonstrating its effectiveness in enhancing the Japanese language capabilities of LLMs. The researchers also explore the impact of different pre-training data sources and the stability of the continual pre-training process.

## Critical Analysis

The paper presents a compelling approach for improving the Japanese language capabilities of large language models, but it also acknowledges several caveats and limitations. One key concern is the potential for the continual pre-training process to degrade the model's performance on non-Japanese language tasks, as the model becomes more specialized.

Additionally, the paper does not provide a detailed analysis of the specific Japanese language phenomena that are improved through the continual pre-training process. It would be helpful to understand the linguistic features or tasks that benefit the most from this approach, as this could inform future research and applications.

Another potential limitation is the reliance on a specific set of Japanese language benchmarks for evaluating the model's performance. It would be valuable to explore the model's real-world performance on a wider range of Japanese language tasks, including more conversational and domain-specific applications.

Finally, the paper does not address the computational and resource requirements of the continual pre-training approach, which could be a practical consideration for some users or developers. Further research on the scalability and efficiency of this method would be valuable.

## Conclusion

This paper presents a novel approach for enhancing the Japanese language capabilities of large language models through a process of continual pre-training. By continuously training the model on Japanese-specific data, the researchers are able to improve the model's performance on a range of Japanese language tasks, while maintaining its cross-lingual abilities.

The results of the study suggest that this approach could be a promising way to adapt LLMs for specific language domains, with potential applications in areas such as machine translation, dialogue systems, and content generation. The insights from this research could also inform the development of more efficient and targeted language model training strategies, contributing to the ongoing advancement of natural language processing technologies.