Pretraining and Updating Language- and Domain-specific Large Language Model: A Case Study in Japanese Business Domain

2404.08262

YC

0

Reddit

0

Published 4/17/2024 by Kosuke Takahashi, Takahiro Omi, Kosuke Arima, Tatsuya Ishigaki

💬

Abstract

Several previous studies have considered language- and domain-specific large language models (LLMs) as separate topics. This study explores the combination of a non-English language and a high-demand industry domain, focusing on a Japanese business-specific LLM. This type of a model requires expertise in the business domain, strong language skills, and regular updates of its knowledge. We trained a 13-billion-parameter LLM from scratch using a new dataset of business texts and patents, and continually pretrained it with the latest business documents. Further we propose a new benchmark for Japanese business domain question answering (QA) and evaluate our models on it. The results show that our pretrained model improves QA accuracy without losing general knowledge, and that continual pretraining enhances adaptation to new information. Our pretrained model and business domain benchmark are publicly available.

Create account to get full access

or

If you already have an account, we'll log you in

Overview

  • The paper explores pretraining and updating a large language model (LLM) for the Japanese business domain.
  • The researchers used an existing LLM and further trained it on a domain-specific corpus to create a more specialized model.
  • They evaluated the performance of the updated model on various tasks relevant to the Japanese business domain.

Plain English Explanation

In this study, the researchers wanted to create a language model that would be particularly useful for understanding and generating text in the Japanese business domain. They started with an existing large language model, which is a powerful AI system trained on a vast amount of text data from the internet.

To make this model more specialized for business tasks, the researchers gave it additional training using a dataset focused on Japanese business documents, such as financial reports, industry news, and business correspondence. This additional training, called "pretraining," allows the model to learn the unique vocabulary, writing style, and topics that are common in the Japanese business world.

After pretraining, the researchers tested the updated model on a variety of tasks related to the Japanese business domain, such as summarizing financial reports, generating business emails, and answering questions about industry trends. They found that the updated model performed better on these business-focused tasks compared to the original, more general-purpose language model.

The key insight here is that while large language models can be very powerful, they may not be optimally suited for specific domains or tasks. By further training these models on domain-specific data, researchers can create more specialized and effective AI systems for real-world applications, such as assisting with business operations or automating certain business processes.

Technical Explanation

The researchers used an existing large language model as a starting point and further trained it on a dataset of Japanese business documents, including financial reports, industry news, and business correspondence. This additional pretraining step allowed the model to learn the unique vocabulary, writing style, and topical focuses that are common in the Japanese business domain.

After pretraining, the researchers evaluated the updated model's performance on a variety of business-related tasks, such as summarizing financial reports, generating business emails, and answering questions about industry trends. They compared the updated model's performance to the original, more general-purpose language model and found that the specialized model achieved better results on the business-focused tasks.

The key technical innovations in this work include the use of domain-specific pretraining to create a more specialized language model and the comprehensive evaluation of the updated model's performance on a range of business-relevant tasks. The researchers also provide insights into the types of data and tasks that are most important for optimizing language models for specific domains, such as the Japanese business sector.

Critical Analysis

The researchers acknowledge several limitations in their study. First, the dataset used for pretraining, while substantial, may not have fully captured the breadth and complexity of the Japanese business domain. There may be important genres or topics that were underrepresented in the training data, which could limit the model's performance on certain tasks.

Additionally, the evaluation tasks used in the study, while relevant to the business domain, may not be fully representative of the real-world challenges faced by companies and industry professionals. The researchers suggest that future work should involve more extensive testing with end-users to better understand the practical applications and limitations of the updated language model.

Another potential concern is the potential for bias in the pretraining data or the model's outputs. As with any large language model, there is a risk that the system may perpetuate or amplify societal biases present in the training data. The researchers do not address this issue in depth, and further investigation into the fairness and ethics of the updated model would be valuable.

Despite these limitations, the study represents an important step towards creating more specialized and effective language models for real-world business applications. The researchers' approach of domain-specific pretraining and comprehensive evaluation could serve as a useful blueprint for similar efforts in other industries or languages, such as creating German-focused language models for the healthcare domain or developing Chinese-centric language models for the tech sector.

Conclusion

This paper demonstrates the value of pretraining and updating large language models for specific domains, using the Japanese business sector as a case study. By further training an existing LLM on a corpus of Japanese business documents, the researchers were able to create a more specialized model that outperformed the original on a range of business-related tasks.

The insights from this work could inform the development of similar domain-specific language models in other industries and languages, potentially leading to more effective AI-powered tools and services for businesses and professionals. As large language models continue to advance, it will be important for researchers and developers to explore ways to tailor these powerful systems to the unique needs and challenges of different real-world applications and contexts.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

Construction of Domain-specified Japanese Large Language Model for Finance through Continual Pre-training

Masanori Hirano, Kentaro Imajo

YC

0

Reddit

0

Large language models (LLMs) are now widely used in various fields, including finance. However, Japanese financial-specific LLMs have not been proposed yet. Hence, this study aims to construct a Japanese financial-specific LLM through continual pre-training. Before tuning, we constructed Japanese financial-focused datasets for continual pre-training. As a base model, we employed a Japanese LLM that achieved state-of-the-art performance on Japanese financial benchmarks among the 10-billion-class parameter models. After continual pre-training using the datasets and the base model, the tuned model performed better than the original model on the Japanese financial benchmarks. Moreover, the outputs comparison results reveal that the tuned model's outputs tend to be better than the original model's outputs in terms of the quality and length of the answers. These findings indicate that domain-specific continual pre-training is also effective for LLMs. The tuned model is publicly available on Hugging Face.

Read more

4/17/2024

Continual Pre-Training for Cross-Lingual LLM Adaptation: Enhancing Japanese Language Capabilities

Continual Pre-Training for Cross-Lingual LLM Adaptation: Enhancing Japanese Language Capabilities

Kazuki Fujii, Taishi Nakamura, Mengsay Loem, Hiroki Iida, Masanari Ohi, Kakeru Hattori, Hirai Shota, Sakae Mizuki, Rio Yokota, Naoaki Okazaki

YC

0

Reddit

0

Cross-lingual continual pre-training of large language models (LLMs) initially trained on English corpus allows us to leverage the vast amount of English language resources and reduce the pre-training cost. In this study, we constructed Swallow, an LLM with enhanced Japanese capability, by extending the vocabulary of Llama 2 to include Japanese characters and conducting continual pre-training on a large Japanese web corpus. Experimental results confirmed that the performance on Japanese tasks drastically improved through continual pre-training, and the performance monotonically increased with the amount of training data up to 100B tokens. Consequently, Swallow achieved superior performance compared to other LLMs that were trained from scratch in English and Japanese. An analysis of the effects of continual pre-training revealed that it was particularly effective for Japanese question answering tasks. Furthermore, to elucidate effective methodologies for cross-lingual continual pre-training from English to Japanese, we investigated the impact of vocabulary expansion and the effectiveness of incorporating parallel corpora. The results showed that the efficiency gained through vocabulary expansion had no negative impact on performance, except for the summarization task, and that the combined use of parallel corpora enhanced translation ability.

Read more

4/30/2024

A Reality check of the benefits of LLM in business

A Reality check of the benefits of LLM in business

Ming Cheung

YC

0

Reddit

0

Large language models (LLMs) have achieved remarkable performance in language understanding and generation tasks by leveraging vast amounts of online texts. Unlike conventional models, LLMs can adapt to new domains through prompt engineering without the need for retraining, making them suitable for various business functions, such as strategic planning, project implementation, and data-driven decision-making. However, their limitations in terms of bias, contextual understanding, and sensitivity to prompts raise concerns about their readiness for real-world applications. This paper thoroughly examines the usefulness and readiness of LLMs for business processes. The limitations and capacities of LLMs are evaluated through experiments conducted on four accessible LLMs using real-world data. The findings have significant implications for organizations seeking to leverage generative AI and provide valuable insights into future research directions. To the best of our knowledge, this represents the first quantified study of LLMs applied to core business operations and challenges.

Read more

6/18/2024

Quantifying Memorization of Domain-Specific Pre-trained Language Models using Japanese Newspaper and Paywalls

Quantifying Memorization of Domain-Specific Pre-trained Language Models using Japanese Newspaper and Paywalls

Shotaro Ishihara

YC

0

Reddit

0

Dominant pre-trained language models (PLMs) have been successful in high-quality natural language generation. However, the analysis of their generation is not mature: do they acquire generalizable linguistic abstractions, or do they simply memorize and recover substrings of the training data? Especially, few studies focus on domain-specific PLM. In this study, we pre-trained domain-specific GPT-2 models using a limited corpus of Japanese newspaper articles and quantified memorization of training data by comparing them with general Japanese GPT-2 models. Our experiments revealed that domain-specific PLMs sometimes copy and paste on a large scale. Furthermore, we replicated the empirical finding that memorization is related to duplication, model size, and prompt length, in Japanese the same as in previous English studies. Our evaluations are relieved from data contamination concerns by focusing on newspaper paywalls, which prevent their use as training data. We hope that our paper encourages a sound discussion such as the security and copyright of PLMs.

Read more

4/29/2024