Get a weekly rundown of the latest AI models and research... subscribe!

Wangchanberta Base Wiki Sefr



Pretrained RoBERTa BASE model on Thai Wikipedia corpus. The script and documentation can be found at this reposiryory. The architecture of the pretrained model is based on RoBERTa [Liu et al., 2019]. You can use the pretrained model for masked language modeling (i.e. predicting a mask token in the input text). In addition, we also provide finetuned models for multiclass/multilabel text classification and token classification task. Multiclass text classification Multilabel text classification Token classification The getting started notebook of WangchanBERTa model can be found at this Colab notebook wangchanberta-base-wiki-sefr model was pretrained on Thai Wikipedia. Specifically, we use the Wikipedia dump articles on 20 August 2020 ( We opt out lists, and tables. Texts are preprocessed with the following rules: Regarding the vocabulary, we use Stacked Ensemble Filter and Refine (SEFR) tokenizer (engine="best") [Limkonchotiwat et al., 2020] based on probablities from CNN-based deepcut [Kittinaradorn et al., 2019]. The total number of word-level tokens in the vocabulary is 92,177. We sample sentences contigously to have the length of at most 512 tokens. For some sentences that overlap the boundary of 512 tokens, we split such sentence with an additional token as document separator. This is the same approach as proposed by [Liu et al., 2019] (called "FULL-SENTENCES"). Regarding the masking procedure, for each sequence, we sampled 15% of the tokens and replace them with Train/Val/Test splits We split sequencially 944,782 sentences for training set, 24,863 sentences for validation set and 24,862 sentences for test set. Pretraining The model was trained on 32 V100 GPUs for 31,250 steps with the batch size of 8,192 (16 sequences per device with 16 accumulation steps) and a sequence length of 512 tokens. The optimizer we used is Adam with the learning rate of $7e-4$, $\beta_1 = 0.9$, $\beta_2= 0.98$ and $\epsilon = 1e-6$. The learning rate is warmed up for the first 1250 steps and linearly decayed to zero. The model checkpoint with minimum validation loss will be selected as the best model checkpoint. BibTeX entry and citation info


Cost per run
Avg run time

Creator Models

Wangchanberta Base Att Spm Uncased$?20,824
Bert Base Multilingual Cased Finetune Qa$?39
Wangchanberta Base Wiki Newmm$?105
Wangchanberta Base Wiki Spm$?97
Bert Base Multilingual Cased Finetuned$?17

Similar Models

Try it!

You can use this area to play around with demo applications that incorporate the Wangchanberta Base Wiki Sefr model. These demos are maintained and hosted externally by third-party creators. If you see an error, message me on Twitter.

Currently, there are no demos available for this model.


Summary of this model and related resources.

Model NameWangchanberta Base Wiki Sefr

Pretrained RoBERTa BASE model on Thai Wikipedia corpus. The script and docu...

Model LinkView on HuggingFace
API SpecView on HuggingFace
Github LinkNo Github link provided
Paper LinkNo paper link provided


How popular is this model, by number of runs? How popular is the creator, by the sum of all their runs?

Model Rank
Creator Rank


How much does it cost to run this model? How long, on average, does it take to complete a run?

Cost per Run$-
Prediction Hardware-
Average Completion Time-