Pretraining and Updating Language- and Domain-specific Large Language Model: A Case Study in Japanese Business Domain

2404.08262

YC

0

Reddit

0

Published 4/17/2024 by Kosuke Takahashi, Takahiro Omi, Kosuke Arima, Tatsuya Ishigaki

💬

Abstract

Several previous studies have considered language- and domain-specific large language models (LLMs) as separate topics. This study explores the combination of a non-English language and a high-demand industry domain, focusing on a Japanese business-specific LLM. This type of a model requires expertise in the business domain, strong language skills, and regular updates of its knowledge. We trained a 13-billion-parameter LLM from scratch using a new dataset of business texts and patents, and continually pretrained it with the latest business documents. Further we propose a new benchmark for Japanese business domain question answering (QA) and evaluate our models on it. The results show that our pretrained model improves QA accuracy without losing general knowledge, and that continual pretraining enhances adaptation to new information. Our pretrained model and business domain benchmark are publicly available.

Get summaries of the top AI research delivered straight to your inbox:

Overview

  • The paper explores pretraining and updating a large language model (LLM) for the Japanese business domain.
  • The researchers used an existing LLM and further trained it on a domain-specific corpus to create a more specialized model.
  • They evaluated the performance of the updated model on various tasks relevant to the Japanese business domain.

Plain English Explanation

In this study, the researchers wanted to create a language model that would be particularly useful for understanding and generating text in the Japanese business domain. They started with an existing large language model, which is a powerful AI system trained on a vast amount of text data from the internet.

To make this model more specialized for business tasks, the researchers gave it additional training using a dataset focused on Japanese business documents, such as financial reports, industry news, and business correspondence. This additional training, called "pretraining," allows the model to learn the unique vocabulary, writing style, and topics that are common in the Japanese business world.

After pretraining, the researchers tested the updated model on a variety of tasks related to the Japanese business domain, such as summarizing financial reports, generating business emails, and answering questions about industry trends. They found that the updated model performed better on these business-focused tasks compared to the original, more general-purpose language model.

The key insight here is that while large language models can be very powerful, they may not be optimally suited for specific domains or tasks. By further training these models on domain-specific data, researchers can create more specialized and effective AI systems for real-world applications, such as assisting with business operations or automating certain business processes.

Technical Explanation

The researchers used an existing large language model as a starting point and further trained it on a dataset of Japanese business documents, including financial reports, industry news, and business correspondence. This additional pretraining step allowed the model to learn the unique vocabulary, writing style, and topical focuses that are common in the Japanese business domain.

After pretraining, the researchers evaluated the updated model's performance on a variety of business-related tasks, such as summarizing financial reports, generating business emails, and answering questions about industry trends. They compared the updated model's performance to the original, more general-purpose language model and found that the specialized model achieved better results on the business-focused tasks.

The key technical innovations in this work include the use of domain-specific pretraining to create a more specialized language model and the comprehensive evaluation of the updated model's performance on a range of business-relevant tasks. The researchers also provide insights into the types of data and tasks that are most important for optimizing language models for specific domains, such as the Japanese business sector.

Critical Analysis

The researchers acknowledge several limitations in their study. First, the dataset used for pretraining, while substantial, may not have fully captured the breadth and complexity of the Japanese business domain. There may be important genres or topics that were underrepresented in the training data, which could limit the model's performance on certain tasks.

Additionally, the evaluation tasks used in the study, while relevant to the business domain, may not be fully representative of the real-world challenges faced by companies and industry professionals. The researchers suggest that future work should involve more extensive testing with end-users to better understand the practical applications and limitations of the updated language model.

Another potential concern is the potential for bias in the pretraining data or the model's outputs. As with any large language model, there is a risk that the system may perpetuate or amplify societal biases present in the training data. The researchers do not address this issue in depth, and further investigation into the fairness and ethics of the updated model would be valuable.

Despite these limitations, the study represents an important step towards creating more specialized and effective language models for real-world business applications. The researchers' approach of domain-specific pretraining and comprehensive evaluation could serve as a useful blueprint for similar efforts in other industries or languages, such as creating German-focused language models for the healthcare domain or developing Chinese-centric language models for the tech sector.

Conclusion

This paper demonstrates the value of pretraining and updating large language models for specific domains, using the Japanese business sector as a case study. By further training an existing LLM on a corpus of Japanese business documents, the researchers were able to create a more specialized model that outperformed the original on a range of business-related tasks.

The insights from this work could inform the development of similar domain-specific language models in other industries and languages, potentially leading to more effective AI-powered tools and services for businesses and professionals. As large language models continue to advance, it will be important for researchers and developers to explore ways to tailor these powerful systems to the unique needs and challenges of different real-world applications and contexts.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

Construction of Domain-specified Japanese Large Language Model for Finance through Continual Pre-training

Masanori Hirano, Kentaro Imajo

YC

0

Reddit

0

Large language models (LLMs) are now widely used in various fields, including finance. However, Japanese financial-specific LLMs have not been proposed yet. Hence, this study aims to construct a Japanese financial-specific LLM through continual pre-training. Before tuning, we constructed Japanese financial-focused datasets for continual pre-training. As a base model, we employed a Japanese LLM that achieved state-of-the-art performance on Japanese financial benchmarks among the 10-billion-class parameter models. After continual pre-training using the datasets and the base model, the tuned model performed better than the original model on the Japanese financial benchmarks. Moreover, the outputs comparison results reveal that the tuned model's outputs tend to be better than the original model's outputs in terms of the quality and length of the answers. These findings indicate that domain-specific continual pre-training is also effective for LLMs. The tuned model is publicly available on Hugging Face.

Read more

4/17/2024

Continual Pre-Training for Cross-Lingual LLM Adaptation: Enhancing Japanese Language Capabilities

Continual Pre-Training for Cross-Lingual LLM Adaptation: Enhancing Japanese Language Capabilities

Kazuki Fujii, Taishi Nakamura, Mengsay Loem, Hiroki Iida, Masanari Ohi, Kakeru Hattori, Hirai Shota, Sakae Mizuki, Rio Yokota, Naoaki Okazaki

YC

0

Reddit

0

Cross-lingual continual pre-training of large language models (LLMs) initially trained on English corpus allows us to leverage the vast amount of English language resources and reduce the pre-training cost. In this study, we constructed Swallow, an LLM with enhanced Japanese capability, by extending the vocabulary of Llama 2 to include Japanese characters and conducting continual pre-training on a large Japanese web corpus. Experimental results confirmed that the performance on Japanese tasks drastically improved through continual pre-training, and the performance monotonically increased with the amount of training data up to 100B tokens. Consequently, Swallow achieved superior performance compared to other LLMs that were trained from scratch in English and Japanese. An analysis of the effects of continual pre-training revealed that it was particularly effective for Japanese question answering tasks. Furthermore, to elucidate effective methodologies for cross-lingual continual pre-training from English to Japanese, we investigated the impact of vocabulary expansion and the effectiveness of incorporating parallel corpora. The results showed that the efficiency gained through vocabulary expansion had no negative impact on performance, except for the summarization task, and that the combined use of parallel corpora enhanced translation ability.

Read more

4/30/2024

Quantifying Memorization of Domain-Specific Pre-trained Language Models using Japanese Newspaper and Paywalls

Quantifying Memorization of Domain-Specific Pre-trained Language Models using Japanese Newspaper and Paywalls

Shotaro Ishihara

YC

0

Reddit

0

Dominant pre-trained language models (PLMs) have been successful in high-quality natural language generation. However, the analysis of their generation is not mature: do they acquire generalizable linguistic abstractions, or do they simply memorize and recover substrings of the training data? Especially, few studies focus on domain-specific PLM. In this study, we pre-trained domain-specific GPT-2 models using a limited corpus of Japanese newspaper articles and quantified memorization of training data by comparing them with general Japanese GPT-2 models. Our experiments revealed that domain-specific PLMs sometimes copy and paste on a large scale. Furthermore, we replicated the empirical finding that memorization is related to duplication, model size, and prompt length, in Japanese the same as in previous English studies. Our evaluations are relieved from data contamination concerns by focusing on newspaper paywalls, which prevent their use as training data. We hope that our paper encourages a sound discussion such as the security and copyright of PLMs.

Read more

4/29/2024

A Novel Paradigm Boosting Translation Capabilities of Large Language Models

A Novel Paradigm Boosting Translation Capabilities of Large Language Models

Jiaxin Guo, Hao Yang, Zongyao Li, Daimeng Wei, Hengchao Shang, Xiaoyu Chen

YC

0

Reddit

0

This paper presents a study on strategies to enhance the translation capabilities of large language models (LLMs) in the context of machine translation (MT) tasks. The paper proposes a novel paradigm consisting of three stages: Secondary Pre-training using Extensive Monolingual Data, Continual Pre-training with Interlinear Text Format Documents, and Leveraging Source-Language Consistent Instruction for Supervised Fine-Tuning. Previous research on LLMs focused on various strategies for supervised fine-tuning (SFT), but their effectiveness has been limited. While traditional machine translation approaches rely on vast amounts of parallel bilingual data, our paradigm highlights the importance of using smaller sets of high-quality bilingual data. We argue that the focus should be on augmenting LLMs' cross-lingual alignment abilities during pre-training rather than solely relying on extensive bilingual data during SFT. Experimental results conducted using the Llama2 model, particularly on Chinese-Llama2 after monolingual augmentation, demonstrate the improved translation capabilities of LLMs. A significant contribution of our approach lies in Stage2: Continual Pre-training with Interlinear Text Format Documents, which requires less than 1B training data, making our method highly efficient. Additionally, in Stage3, we observed that setting instructions consistent with the source language benefits the supervised fine-tuning process. Experimental results demonstrate that our approach surpasses previous work and achieves superior performance compared to models such as NLLB-54B and GPT3.5-text-davinci-003, despite having a significantly smaller parameter count of only 7B or 13B. This achievement establishes our method as a pioneering strategy in the field of machine translation.

Read more

4/16/2024