H2O-Danube-1.8B Technical Report

2401.16818

YC

7

Reddit

35

Published 4/16/2024 by Philipp Singer, Pascal Pfeiffer, Yauhen Babakhin, Maximilian Jeblick, Nischay Dhankhar, Gabor Fodor, Sri Satish Ambati

Abstract

We present H2O-Danube, a series of small 1.8B language models consisting of H2O-Danube-1.8B, trained on 1T tokens, and the incremental improved H2O-Danube2-1.8B trained on an additional 2T tokens. Our models exhibit highly competitive metrics across a multitude of benchmarks and, as of the time of this writing, H2O-Danube2-1.8B achieves the top ranking on Open LLM Leaderboard for all models below the 2B parameter range. The models follow core principles of LLama 2 and Mistral, and we leverage and refine various techniques for pre-training large language models. We additionally release chat models trained with supervised fine-tuning followed by direct preference optimization. We make all models openly available under Apache 2.0 license further democratizing LLMs to a wider audience economically.

Get summaries of the top AI research delivered straight to your inbox:

Overview

  • Presents H2O-Danube, a series of small 1.8B language models
  • H2O-Danube-1.8B is trained on 1T tokens, and H2O-Danube2-1.8B is trained on an additional 2T tokens
  • Models exhibit highly competitive metrics across multiple benchmarks
  • H2O-Danube2-1.8B achieves top ranking on Open LLM Leaderboard for models below 2B parameters
  • Follow core principles of LLama 2 and Mistral, leveraging and refining techniques for pre-training large language models
  • Release chat models trained with supervised fine-tuning and direct preference optimization
  • Models made openly available under Apache 2.0 license to democratize LLMs

Plain English Explanation

The researchers have developed a series of small 1.8 billion parameter language models called H2O-Danube. The first model, H2O-Danube-1.8B, was trained on 1 trillion tokens of text data, while the second model, H2O-Danube2-1.8B, was trained on an additional 2 trillion tokens. These models perform extremely well on a variety of benchmarks, with H2O-Danube2-1.8B even ranking first among all models with under 2 billion parameters on the Open LLM Leaderboard.

The models are built upon the foundations of LLama 2 and Mistral, two other influential large language models. The researchers have further refined and improved the techniques used to pre-train these large models.

In addition to the main language models, the researchers have also released chat models that have been fine-tuned with supervised training and then optimized for direct user preferences. All of these models are made freely available to the public under the Apache 2.0 license, which helps make large language models more accessible and widely usable.

Technical Explanation

The H2O-Danube series of language models consists of two main versions: H2O-Danube-1.8B, which was trained on 1 trillion tokens of text data, and H2O-Danube2-1.8B, which was trained on an additional 2 trillion tokens. Both models have 1.8 billion parameters, placing them in the "small" category of large language models.

These models were developed by leveraging and refining the core principles and techniques used in the LLama 2 and Mistral language models. The researchers integrated various advancements in pre-training large language models to achieve highly competitive performance across a wide range of benchmarks.

In addition to the main language models, the researchers also trained chat models using supervised fine-tuning followed by direct preference optimization. These chat models are designed to engage in more natural, conversational interactions with users.

All of the H2O-Danube models, including the chat variants, are made openly available under the Apache 2.0 license. This open-source approach helps democratize access to large language models, allowing a wider audience to utilize and build upon these powerful AI systems.

Critical Analysis

The H2O-Danube models represent a significant advancement in the field of large language models, particularly in terms of their impressive performance on a wide range of benchmarks. The researchers' approach of building upon the foundations of LLama 2 and Mistral, while further refining and improving the pre-training techniques, has led to the development of highly capable models.

However, it's important to note that the paper does not provide detailed information about the specific techniques and methodologies used in the pre-training process. While the researchers mention leveraging and refining various approaches, a more in-depth explanation of the innovations and modifications would be helpful for a deeper understanding of the models' capabilities and potential limitations.

Additionally, the paper does not discuss the potential biases or ethical considerations associated with the H2O-Danube models. As large language models can sometimes exhibit undesirable biases or generate harmful content, it would be valuable for the researchers to address these concerns and outline their strategies for mitigating such issues.

Furthermore, the paper lacks a comprehensive analysis of the chat models' performance and their ability to engage in natural, contextual conversations. While the release of these chat models is a positive step, a more detailed evaluation of their conversational skills and user experience would provide valuable insights.

Conclusion

The H2O-Danube series of language models represents a significant advancement in the field of large language models. By building upon the foundations of LLama 2 and Mistral and further refining the pre-training techniques, the researchers have developed highly capable models that exhibit strong performance across a variety of benchmarks.

The open-source release of these models, including the chat variants, is a commendable effort to democratize access to powerful AI systems and foster a wider ecosystem of language model development and application. However, the paper could benefit from more detailed explanations of the technical innovations, potential biases and ethical considerations, as well as a more in-depth evaluation of the chat models' conversational abilities.

Overall, the H2O-Danube models are a promising development in the ongoing quest to create highly capable and accessible large language models that can positively impact various domains and applications.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Xmodel-LM Technical Report

Xmodel-LM Technical Report

Yichuan Wang, Yang Liu, Yu Yan, Qun Wang, Shulei Wu, Xucheng Huang, Ling Jiang

YC

0

Reddit

0

We introduce Xmodel-LM, a compact and efficient 1.1B language model pre-trained on around 2 trillion tokens. Trained on our self-built dataset (Xdata), which balances Chinese and English corpora based on downstream task optimization, Xmodel-LM exhibits remarkable performance despite its smaller size. It notably surpasses existing open-source language models of similar scale. Our model checkpoints and code are publicly accessible on GitHub at https://github.com/XiaoduoAILab/XmodelLM.

Read more

6/14/2024

mHuBERT-147: A Compact Multilingual HuBERT Model

mHuBERT-147: A Compact Multilingual HuBERT Model

Marcely Zanon Boito, Vivek Iyer, Nikolaos Lagos, Laurent Besacier, Ioan Calapodescu

YC

0

Reddit

0

We present mHuBERT-147, the first general-purpose massively multilingual HuBERT speech representation model trained on 90K hours of clean, open-license data. To scale up the multi-iteration HuBERT approach, we use faiss-based clustering, achieving 5.2x faster label assignment than the original method. We also apply a new multilingual batching up-sampling strategy, leveraging both language and dataset diversity. After 3 training iterations, our compact 95M parameter mHuBERT-147 outperforms larger models trained on substantially more data. We rank second and first on the ML-SUPERB 10min and 1h leaderboards, with SOTA scores for 3 tasks. Across ASR/LID tasks, our model consistently surpasses XLS-R (300M params; 436K hours) and demonstrates strong competitiveness against the much larger MMS (1B params; 491K hours). Our findings indicate that mHuBERT-147 is a promising model for multilingual speech tasks, offering an unprecedented balance between high performance and parameter efficiency.

Read more

6/12/2024

⚙️

OpenLLM-Ro -- Technical Report on Open-source Romanian LLMs trained starting from Llama 2

Mihai Masala, Denis C. Ilie-Ablachim, Dragos Corlatescu, Miruna Zavelca, Marius Leordeanu, Horia Velicu, Marius Popescu, Mihai Dascalu, Traian Rebedea

YC

0

Reddit

0

In recent years, Large Language Models (LLMs) have achieved almost human-like performance on various tasks. While some LLMs have been trained on multilingual data, most of the training data is in English. Hence, their performance in English greatly exceeds their performance in other languages. This document presents our approach to training and evaluating the first foundational and chat LLM specialized for Romanian.

Read more

5/20/2024

🐍

Tele-FLM Technical Report

Xiang Li, Yiqun Yao, Xin Jiang, Xuezhi Fang, Chao Wang, Xinzhang Liu, Zihan Wang, Yu Zhao, Xin Wang, Yuyao Huang, Shuangyong Song, Yongxiang Li, Zheng Zhang, Bo Zhao, Aixin Sun, Yequan Wang, Zhongjiang He, Zhongyuan Wang, Xuelong Li, Tiejun Huang

YC

0

Reddit

0

Large language models (LLMs) have showcased profound capabilities in language understanding and generation, facilitating a wide array of applications. However, there is a notable paucity of detailed, open-sourced methodologies on efficiently scaling LLMs beyond 50 billion parameters with minimum trial-and-error cost and computational resources. In this report, we introduce Tele-FLM (aka FLM-2), a 52B open-sourced multilingual large language model that features a stable, efficient pre-training paradigm and enhanced factual judgment capabilities. Tele-FLM demonstrates superior multilingual language modeling abilities, measured by BPB on textual corpus. Besides, in both English and Chinese foundation model evaluation, it is comparable to strong open-sourced models that involve larger pre-training FLOPs, such as Llama2-70B and DeepSeek-67B. In addition to the model weights, we share the core designs, engineering practices, and training details, which we expect to benefit both the academic and industrial communities.

Read more

4/26/2024