Get a weekly rundown of the latest AI models and research... subscribe! https://aimodels.substack.com/

Airesearch

Rank:

Average Model Cost: $0.0000

Number of Runs: 87,759

Models by this creator

šŸ—£ļø

wav2vec2-large-xlsr-53-th

The wav2vec2-large-xlsr-53-th model is a finetuned version of the pretrained wav2vec2-large-xlsr-53 model specifically trained for Automatic Speech Recognition (ASR) in Thai language. It is trained on the Thai Common Voice Corpus 7.0 dataset and uses tokenizers such as syllable_tokenize, word_tokenize (PyThaiNLP), and deepcut. The model is benchmarked using Word Error Rate (WER) and Character Error Rate (CER), with and without spell correction. The finetuning process and evaluation codes are provided in the repository. Please note that the APIs are not finetuned with the Common Voice 7.0 data.

Read more

$-/run

66.4K

Huggingface

šŸ¤–

wangchanberta-base-att-spm-uncased

wangchanberta-base-att-spm-uncased is a pretrained RoBERTa BASE model trained on assorted Thai texts. It can be used for various natural language processing tasks such as masked language modeling, multiclass/multilabel text classification, and token classification. The model was trained on a large dataset of Thai sentences, and the vocabulary was created using the SentencePiece unigram model. The model has a maximum sequence length of 416 subword tokens. It was trained for 500,000 steps with the batch size of 4,096 and optimized using Adam with a learning rate of 3e-4.

Read more

$-/run

20.8K

Huggingface

šŸ“Š

wangchanberta-base-wiki-20210520-spm-finetune-qa

Finetuning airesearchth/wangchanberta-base-wiki-20210520-spmd with the training set of iapp_wiki_qa_squad, thaiqa_squad, and nsc_qa (removed examples which have cosine similarity with validation and test examples over 0.8; contexts of the latter two are trimmed to be around 300 newmm words). Benchmarks shared on wandb using validation and test sets of iapp_wiki_qa_squad. Trained with thai2transformers. Run with:

Read more

$-/run

113

Huggingface

šŸ¤Æ

wangchanberta-base-wiki-newmm

Pretrained RoBERTa BASE model on Thai Wikipedia corpus. The script and documentation can be found at this reposiryory. The architecture of the pretrained model is based on RoBERTa [Liu et al., 2019]. You can use the pretrained model for masked language modeling (i.e. predicting a mask token in the input text). In addition, we also provide finetuned models for multiclass/multilabel text classification and token classification task. Multiclass text classification Multilabel text classification Token classification The getting started notebook of WangchanBERTa model can be found at this Colab notebook wangchanberta-base-wiki-newmm model was pretrained on Thai Wikipedia. Specifically, we use the Wikipedia dump articles on 20 August 2020 (dumps.wikimedia.org/thwiki/20200820/). We opt out lists, and tables. Texts are preprocessed with the following rules: Regarding the vocabulary, we use wordl-level token from PyThaiNLP's dictionary-based tokenizer namedly newmm. The total number of word-level tokens in the vocabulary is 97,982. We sample sentences contigously to have the length of at most 512 tokens. For some sentences that overlap the boundary of 512 tokens, we split such sentence with an additional token as document separator. This is the same approach as proposed by [Liu et al., 2019] (called "FULL-SENTENCES"). Regarding the masking procedure, for each sequence, we sampled 15% of the tokens and replace them with Train/Val/Test splits We split sequencially 944,782 sentences for training set, 24,863 sentences for validation set and 24,862 sentences for test set. Pretraining The model was trained on 32 V100 GPUs for 31,250 steps with the batch size of 8,192 (16 sequences per device with 16 accumulation steps) and a sequence length of 512 tokens. The optimizer we used is Adam with the learning rate of $7e-4$, $\beta_1 = 0.9$, $\beta_2= 0.98$ and $\epsilon = 1e-6$. The learning rate is warmed up for the first 1250 steps and linearly decayed to zero. The model checkpoint with minimum validation loss will be selected as the best model checkpoint. BibTeX entry and citation info

Read more

$-/run

105

Huggingface

šŸ“¶

wangchanberta-base-wiki-spm

Pretrained RoBERTa BASE model on Thai Wikipedia corpus. The script and documentation can be found at this reposiryory. The architecture of the pretrained model is based on RoBERTa [Liu et al., 2019]. You can use the pretrained model for masked language modeling (i.e. predicting a mask token in the input text). In addition, we also provide finetuned models for multiclass/multilabel text classification and token classification task. Multiclass text classification Multilabel text classification Token classification The getting started notebook of WangchanBERTa model can be found at this Colab notebook wangchanberta-base-wiki-spm model was pretrained on Thai Wikipedia. Specifically, we use the Wikipedia dump articles on 20 August 2020 (dumps.wikimedia.org/thwiki/20200820/). We opt out lists, and tables. Texts are preprocessed with the following rules: Regarding the vocabulary, we use subword token trained with SentencePice library on the training set of Thai Wikipedia corpus. The total number of subword tokens is 24,000. We sample sentences contigously to have the length of at most 512 tokens. For some sentences that overlap the boundary of 512 tokens, we split such sentence with an additional token as document separator. This is the same approach as proposed by [Liu et al., 2019] (called "FULL-SENTENCES"). Regarding the masking procedure, for each sequence, we sampled 15% of the tokens and replace them with Train/Val/Test splits We split sequencially 944,782 sentences for training set, 24,863 sentences for validation set and 24,862 sentences for test set. Pretraining The model was trained on 32 V100 GPUs for 31,250 steps with the batch size of 8,192 (16 sequences per device with 16 accumulation steps) and a sequence length of 512 tokens. The optimizer we used is Adam with the learning rate of $7e-4$, $\beta_1 = 0.9$, $\beta_2= 0.98$ and $\epsilon = 1e-6$. The learning rate is warmed up for the first 1250 steps and linearly decayed to zero. The model checkpoint with minimum validation loss will be selected as the best model checkpoint. BibTeX entry and citation info

Read more

$-/run

97

Huggingface

šŸ¤·

xlm-roberta-base-finetune-qa

Finetuning xlm-roberta-base with the training set of iapp_wiki_qa_squad, thaiqa_squad, and nsc_qa (removed examples which have cosine similarity with validation and test examples over 0.8; contexts of the latter two are trimmed to be around 300 newmm words). Benchmarks shared on wandb using validation and test sets of iapp_wiki_qa_squad. Trained with thai2transformers. Train with:

Read more

$-/run

58

Huggingface

šŸ–¼ļø

wangchanberta-base-wiki-sefr

Pretrained RoBERTa BASE model on Thai Wikipedia corpus. The script and documentation can be found at this reposiryory. The architecture of the pretrained model is based on RoBERTa [Liu et al., 2019]. You can use the pretrained model for masked language modeling (i.e. predicting a mask token in the input text). In addition, we also provide finetuned models for multiclass/multilabel text classification and token classification task. Multiclass text classification Multilabel text classification Token classification The getting started notebook of WangchanBERTa model can be found at this Colab notebook wangchanberta-base-wiki-sefr model was pretrained on Thai Wikipedia. Specifically, we use the Wikipedia dump articles on 20 August 2020 (dumps.wikimedia.org/thwiki/20200820/). We opt out lists, and tables. Texts are preprocessed with the following rules: Regarding the vocabulary, we use Stacked Ensemble Filter and Refine (SEFR) tokenizer (engine="best") [Limkonchotiwat et al., 2020] based on probablities from CNN-based deepcut [Kittinaradorn et al., 2019]. The total number of word-level tokens in the vocabulary is 92,177. We sample sentences contigously to have the length of at most 512 tokens. For some sentences that overlap the boundary of 512 tokens, we split such sentence with an additional token as document separator. This is the same approach as proposed by [Liu et al., 2019] (called "FULL-SENTENCES"). Regarding the masking procedure, for each sequence, we sampled 15% of the tokens and replace them with Train/Val/Test splits We split sequencially 944,782 sentences for training set, 24,863 sentences for validation set and 24,862 sentences for test set. Pretraining The model was trained on 32 V100 GPUs for 31,250 steps with the batch size of 8,192 (16 sequences per device with 16 accumulation steps) and a sequence length of 512 tokens. The optimizer we used is Adam with the learning rate of $7e-4$, $\beta_1 = 0.9$, $\beta_2= 0.98$ and $\epsilon = 1e-6$. The learning rate is warmed up for the first 1250 steps and linearly decayed to zero. The model checkpoint with minimum validation loss will be selected as the best model checkpoint. BibTeX entry and citation info

Read more

$-/run

49

Huggingface

šŸŒ

bert-base-multilingual-cased-finetune-qa

Finetuning bert-base-multilingual-cased with the training set of iapp_wiki_qa_squad, thaiqa_squad, and nsc_qa (removed examples which have cosine similarity with validation and test examples over 0.8; contexts of the latter two are trimmed to be around 300 newmm words). Benchmarks shared on wandb using validation and test sets of iapp_wiki_qa_squad. Trained with thai2transformers. Run with:

Read more

$-/run

39

Huggingface

šŸ…

wangchanberta-base-wiki-syllable

Pretrained RoBERTa BASE model on Thai Wikipedia corpus. The script and documentation can be found at this reposiryory. The architecture of the pretrained model is based on RoBERTa [Liu et al., 2019]. You can use the pretrained model for masked language modeling (i.e. predicting a mask token in the input text). In addition, we also provide finetuned models for multiclass/multilabel text classification and token classification task. Multiclass text classification Multilabel text classification Token classification The getting started notebook of WangchanBERTa model can be found at this Colab notebook wangchanberta-base-wiki-syllable model was pretrained on Thai Wikipedia. Specifically, we use the Wikipedia dump articles on 20 August 2020 (dumps.wikimedia.org/thwiki/20200820/). We opt out lists, and tables. Texts are preprocessed with the following rules: Regarding the vocabulary, we use a Thai syllable-level dictionary-based tokenizer denoted as syllable from PyThaiNLP [Phatthiyaphaibun et al., 2016]. The total number of word-level tokens in the vocabulary is 59,235. We sample sentences contigously to have the length of at most 512 tokens. For some sentences that overlap the boundary of 512 tokens, we split such sentence with an additional token as document separator. This is the same approach as proposed by [Liu et al., 2019] (called "FULL-SENTENCES"). Regarding the masking procedure, for each sequence, we sampled 15% of the tokens and replace them with Train/Val/Test splits We split sequencially 944,782 sentences for training set, 24,863 sentences for validation set and 24,862 sentences for test set. Pretraining The model was trained on 32 V100 GPUs for 31,250 steps with the batch size of 8,192 (16 sequences per device with 16 accumulation steps) and a sequence length of 512 tokens. The optimizer we used is Adam with the learning rate of $7e-4$, $\beta_1 = 0.9$, $\beta_2= 0.98$ and $\epsilon = 1e-6$. The learning rate is warmed up for the first 1250 steps and linearly decayed to zero. The model checkpoint with minimum validation loss will be selected as the best model checkpoint. BibTeX entry and citation info

Read more

$-/run

25

Huggingface

šŸ”„

xlm-roberta-base-finetuned

Finetuned XLM Roberta BASE model on Thai sequence and token classification datasets The script and documentation can be found at this repository. We use the pretrained cross-lingual RoBERTa model as proposed by [Conneau et al., 2020]. We download the pretrained PyTorch model via HuggingFace's Model Hub (https://huggingface.co/xlm-roberta-base) You can use the finetuned models for multiclass/multilabel text classification and token classification task. Multiclass text classification Multilabel text classification Token classification The example notebook demonstrating how to use finetuned model for inference can be found at this Colab notebook BibTeX entry and citation info

Read more

$-/run

20

Huggingface

Similar creators