Allegro
Rank:Average Model Cost: $0.0000
Number of Runs: 73,576
Models by this creator
herbert-base-cased
herbert-base-cased
HerBERT is a BERT-based language model trained on Polish corpora using Masked Language Modelling (MLM) and Sentence Structural Objective (SSO) with dynamic masking of whole words. It was trained on six different corpora available for the Polish language. The training dataset was tokenized into subwords using a character level byte-pair encoding (CharBPETokenizer) with a vocabulary size of 50k tokens. The model can be used for various NLP tasks in the Polish language. The model was trained by the Machine Learning Research Team at Allegro and the Linguistic Engineering Group at the Institute of Computer Science, Polish Academy of Sciences. It is licensed under CC BY 4.0.
$-/run
66.8K
Huggingface
plt5-base
plt5-base
plT5 Base plT5 models are T5-based language models trained on Polish corpora. The models were optimized for the original T5 denoising target. Corpus plT5 was trained on six different corpora available for Polish language: Tokenizer The training dataset was tokenized into subwords using a sentencepiece unigram model with vocabulary size of 50k tokens. Usage Example code: License CC BY 4.0 Citation If you use this model, please cite the following paper: Authors The model was trained by Machine Learning Research Team at Allegro and Linguistic Engineering Group at Institute of Computer Science, Polish Academy of Sciences. You can contact us at: klejbenchmark@allegro.pl
$-/run
2.2K
Huggingface
herbert-large-cased
herbert-large-cased
HerBERT HerBERT is a BERT-based Language Model trained on Polish corpora using Masked Language Modelling (MLM) and Sentence Structural Objective (SSO) with dynamic masking of whole words. For more details, please refer to: HerBERT: Efficiently Pretrained Transformer-based Language Model for Polish. Model training and experiments were conducted with transformers in version 2.9. Corpus HerBERT was trained on six different corpora available for Polish language: Tokenizer The training dataset was tokenized into subwords using a character level byte-pair encoding (CharBPETokenizer) with a vocabulary size of 50k tokens. The tokenizer itself was trained with a tokenizers library. We kindly encourage you to use the Fast version of the tokenizer, namely HerbertTokenizerFast. Usage Example code: License CC BY 4.0 Citation If you use this model, please cite the following paper: Authors The model was trained by Machine Learning Research Team at Allegro and Linguistic Engineering Group at Institute of Computer Science, Polish Academy of Sciences. You can contact us at: klejbenchmark@allegro.pl
$-/run
1.6K
Huggingface
plt5-small
plt5-small
plT5 Small plT5 models are T5-based language models trained on Polish corpora. The models were optimized for the original T5 denoising target. Corpus plT5 was trained on six different corpora available for Polish language: Tokenizer The training dataset was tokenized into subwords using a sentencepiece unigram model with vocabulary size of 50k tokens. Usage Example code: License CC BY 4.0 Citation If you use this model, please cite the following paper: Authors The model was trained by Machine Learning Research Team at Allegro and Linguistic Engineering Group at Institute of Computer Science, Polish Academy of Sciences. You can contact us at: klejbenchmark@allegro.pl
$-/run
1.2K
Huggingface
herbert-klej-cased-v1
herbert-klej-cased-v1
HerBERT HerBERT is a BERT-based Language Model trained on Polish Corpora using only MLM objective with dynamic masking of whole words. For more details, please refer to: KLEJ: Comprehensive Benchmark for Polish Language Understanding. Dataset HerBERT training dataset is a combination of several publicly available corpora for Polish language: Tokenizer The training dataset was tokenized into subwords using HerBERT Tokenizer; a character level byte-pair encoding with a vocabulary size of 50k tokens. The tokenizer itself was trained on Wolne Lektury and a publicly available subset of National Corpus of Polish with a fastBPE library. Tokenizer utilizes XLMTokenizer implementation for that reason, one should load it as allegro/herbert-klej-cased-tokenizer-v1. HerBERT models summary Model evaluation HerBERT was evaluated on the KLEJ benchmark, publicly available set of nine evaluation tasks for the Polish language understanding. It had the best average performance and obtained the best results for three of them. Full leaderboard is available online. HerBERT usage Model training and experiments were conducted with transformers in version 2.0. Example code: HerBERT can also be loaded using AutoTokenizer and AutoModel: License CC BY-SA 4.0 Citation If you use this model, please cite the following paper: Authors The model was trained by Allegro Machine Learning Research team. You can contact us at: klejbenchmark@allegro.pl
$-/run
770
Huggingface
plt5-large
plt5-large
plT5 Large plT5 models are T5-based language models trained on Polish corpora. The models were optimized for the original T5 denoising target. Corpus plT5 was trained on six different corpora available for Polish language: Tokenizer The training dataset was tokenized into subwords using a sentencepiece unigram model with vocabulary size of 50k tokens. Usage Example code: License CC BY 4.0 Citation If you use this model, please cite the following paper: Authors The model was trained by Machine Learning Research Team at Allegro and Linguistic Engineering Group at Institute of Computer Science, Polish Academy of Sciences. You can contact us at: klejbenchmark@allegro.pl
$-/run
624
Huggingface
herbert-klej-cased-tokenizer-v1
herbert-klej-cased-tokenizer-v1
HerBERT tokenizer HerBERT tokenizer is a character level byte-pair encoding with vocabulary size of 50k tokens. The tokenizer was trained on Wolne Lektury and a publicly available subset of National Corpus of Polish with fastBPE library. Tokenizer utilize XLMTokenizer implementation from transformers. Tokenizer usage Herbert tokenizer should be used together with HerBERT model: License CC BY-SA 4.0 Citation If you use this tokenizer, please cite the following paper: Authors Tokenizer was created by Allegro Machine Learning Research team. You can contact us at: klejbenchmark@allegro.pl
$-/run
362
Huggingface