Snunlp
Rank:Average Model Cost: $0.0000
Number of Runs: 9,826
Models by this creator
KR-FinBert-SC
$-/run
3.6K
Huggingface
KR-ELECTRA-discriminator
KR-ELECTRA-discriminator
KoRean based ELECTRA (KR-ELECTRA) This is a release of a Korean-specific ELECTRA model with comparable or better performances developed by the Computational Linguistics Lab at Seoul National University. Our model shows remarkable performances on tasks related to informal texts such as review documents, while still showing comparable results on other kinds of tasks. Released Model We pre-trained our KR-ELECTRA model following a base-scale model of ELECTRA. We trained the model based on Tensorflow-v1 using a v3-8 TPU of Google Cloud Platform. We followed the training parameters of the base-scale model of ELECTRA. 34GB Korean texts including Wikipedia documents, news articles, legal texts, news comments, product reviews, and so on. These texts are balanced, consisting of the same ratios of written and spoken data. vocab size 30,000 We used morpheme-based unit tokens for our vocabulary based on the Mecab-Ko morpheme analyzer. Tensorflow-v1 model (download) PyTorch models on HuggingFace Finetuning We used and slightly edited the finetuning codes from KoELECTRA, with additionally adjusted hyperparameters. You can download the codes and config files that we used for our model from our github. The baseline results are brought from KoELECTRA's. Citation
$-/run
2.6K
Huggingface
KR-BERT-char16424
KR-BERT-char16424
KoRean based Bert pre-trained (KR-BERT) This is a release of Korean-specific, small-scale BERT models with comparable or better performances developed by Computational Linguistics Lab at Seoul National University, referenced in KR-BERT: A Small-Scale Korean-Specific Language Model. Vocab, Parameters and Data Sub-character Korean text is basically represented with Hangul syllable characters, which can be decomposed into sub-characters, or graphemes. To accommodate such characteristics, we trained a new vocabulary and BERT model on two different representations of a corpus: syllable characters and sub-characters. In case of using our sub-character model, you should preprocess your data with the code below. Tokenization We use the BidirectionalWordPiece model to reduce search costs while maintaining the possibility of choice. This model applies BPE in both forward and backward directions to obtain two candidates and chooses the one that has a higher frequency. Models Requirements transformers == 2.1.1 tensorflow < 2.0 Downstream tasks Naver Sentiment Movie Corpus (NSMC) If you want to use the sub-character version of our models, let the subchar argument be True. And you can use the original BERT WordPiece tokenizer by entering bert for the tokenizer argument, and if you use ranked you can use our BidirectionalWordPiece tokenizer. tensorflow: After downloading our pretrained models, put them in a models directory in the krbert_tensorflow directory. pytorch: After downloading our pretrained models, put them in a pretrained directory in the krbert_pytorch directory. The pytorch code structure refers to that of https://github.com/aisolab/nlp_implementation . NSMC Acc. Citation If you use these models, please cite the following paper: Contacts nlp.snu@gmail.com
$-/run
1.6K
Huggingface
KR-SBERT-V40K-klueNLI-augSTS
$-/run
1.5K
Huggingface
KR-FinBert
$-/run
59
Huggingface
KR-ELECTRA-generator
$-/run
49
Huggingface