Aubmindlab
Rank:Average Model Cost: $0.0000
Number of Runs: 59,518
Models by this creator
aragpt2-base
aragpt2-base
Aragpt2-base is a language model that has been trained on a large Arabic dataset. It is based on the GPT2 architecture and is trained using the lamb optimizer. The model can be used for text generation tasks in Arabic language. It is compatible with the transformers library and can be fine-tuned using TensorFlow. The pretraining data for this model includes a filtered version of the OSCAR corpus, Arabic Wikipedia dump, the 1.5B words Arabic Corpus, the OSIAN Corpus, and Assafir news articles. The generated text should be used for research and scientific purposes only. If used, the model should be cited accordingly.
$-/run
18.2K
Huggingface
bert-base-arabertv02
bert-base-arabertv02
AraBERT is an Arabic pretrained language model based on Google's BERT architecture. It has been trained on a larger dataset and for a longer period of time compared to previous versions. AraBERTv1 uses pre-segmented text while AraBERTv2 has better preprocessing and a new vocabulary. The model has been evaluated on various downstream tasks and compared to other models. It is available in four new variants and can be accessed through the HuggingFace model page. It is recommended to apply the provided preprocessing function before training or testing on any dataset. The model can be cited using the provided reference.
$-/run
16.3K
Huggingface
bert-base-arabertv2
bert-base-arabertv2
AraBERT is an Arabic pretrained language model based on Google's BERT architecture. It comes in two versions, AraBERTv0.1 and AraBERTv1, with the latter using pre-segmented text. AraBERT has been evaluated on various downstream tasks such as sentiment analysis, named entity recognition, and question answering. AraBERTv2 is the latest version, which introduces improvements in preprocessing and vocabulary. It has been trained on a larger dataset and for a longer duration. The new dataset includes sources such as the OSCAR corpus, Arabic Wikipedia dump, and Assafir news articles. It is recommended to apply the provided preprocessing function before using AraBERT on any dataset. The model is available in TensorFlow 1.x format and can be downloaded from the HuggingFace models repository. If used, the authors request to be cited.
$-/run
11.2K
Huggingface
bert-base-arabert
bert-base-arabert
!!! A newer version of this model is available !!! AraBERTv2 AraBERT v1 & v2 : Pre-training BERT for Arabic Language Understanding AraBERT is an Arabic pretrained lanaguage model based on Google's BERT architechture. AraBERT uses the same BERT-Base config. More details are available in the AraBERT Paper and in the AraBERT Meetup There are two versions of the model, AraBERTv0.1 and AraBERTv1, with the difference being that AraBERTv1 uses pre-segmented text where prefixes and suffixes were splitted using the Farasa Segmenter. We evalaute AraBERT models on different downstream tasks and compare them to mBERT, and other state of the art models (To the extent of our knowledge). The Tasks were Sentiment Analysis on 6 different datasets (HARD, ASTD-Balanced, ArsenTD-Lev, LABR), Named Entity Recognition with the ANERcorp, and Arabic Question Answering on Arabic-SQuAD and ARCD AraBERTv2 What's New! AraBERT now comes in 4 new variants to replace the old v1 versions: More Detail in the AraBERT folder and in the README and in the AraBERT Paper All models are available in the HuggingFace model page under the aubmindlab name. Checkpoints are available in PyTorch, TF2 and TF1 formats. Better Pre-Processing and New Vocab We identified an issue with AraBERTv1's wordpiece vocabulary. The issue came from punctuations and numbers that were still attached to words when learned the wordpiece vocab. We now insert a space between numbers and characters and around punctuation characters. The new vocabulary was learnt using the BertWordpieceTokenizer from the tokenizers library, and should now support the Fast tokenizer implementation from the transformers library. P.S.: All the old BERT codes should work with the new BERT, just change the model name and check the new preprocessing dunction Please read the section on how to use the preprocessing function Bigger Dataset and More Compute We used ~3.5 times more data, and trained for longer. For Dataset Sources see the Dataset Section Dataset The pretraining data used for the new AraBERT model is also used for Arabic GPT2 and ELECTRA. The dataset consists of 77GB or 200,095,961 lines or 8,655,948,860 words or 82,232,988,358 chars (before applying Farasa Segmentation) For the new dataset we added the unshuffled OSCAR corpus, after we thoroughly filter it, to the previous dataset used in AraBERTv1 but with out the websites that we previously crawled: OSCAR unshuffled and filtered. Arabic Wikipedia dump from 2020/09/01 The 1.5B words Arabic Corpus The OSIAN Corpus Assafir news articles. Huge thank you for Assafir for giving us the data Preprocessing It is recommended to apply our preprocessing function before training/testing on any dataset. Install farasapy to segment text for AraBERT v1 & v2 pip install farasapy Accepted_models TensorFlow 1.x models The TF1.x model are available in the HuggingFace models repo. You can download them as follows: via git-lfs: clone all the models in a repo where MODEL_NAME is any model under the aubmindlab name via wget: Go to the tf1_model.tar.gz file on huggingface.co/models/aubmindlab/MODEL_NAME. copy the oid sha256 then run wget https://cdn-lfs.huggingface.co/aubmindlab/aragpt2-base/INSERT_THE_SHA_HERE (ex: for aragpt2-base: wget https://cdn-lfs.huggingface.co/aubmindlab/aragpt2-base/3766fc03d7c2593ff2fb991d275e96b81b0ecb2098b71ff315611d052ce65248) If you used this model please cite us as : Google Scholar has our Bibtex wrong (missing name), use this instead Acknowledgments Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs, couldn't have done it without this program, and to the AUB MIND Lab Members for the continous support. Also thanks to Yakshof and Assafir for data and storage access. Another thanks for Habib Rahal (https://www.behance.net/rahalhabib), for putting a face to AraBERT. Contacts Wissam Antoun: Linkedin | Twitter | Github | wfa07@mail.aub.edu | wissam.antoun@gmail.com Fady Baly: Linkedin | Twitter | Github | fgb06@mail.aub.edu | baly.fady@gmail.com
$-/run
3.2K
Huggingface
bert-base-arabertv01
bert-base-arabertv01
!!! A newer version of this model is available !!! AraBERTv02 AraBERT v1 & v2 : Pre-training BERT for Arabic Language Understanding AraBERT is an Arabic pretrained lanaguage model based on Google's BERT architechture. AraBERT uses the same BERT-Base config. More details are available in the AraBERT Paper and in the AraBERT Meetup There are two versions of the model, AraBERTv0.1 and AraBERTv1, with the difference being that AraBERTv1 uses pre-segmented text where prefixes and suffixes were splitted using the Farasa Segmenter. We evalaute AraBERT models on different downstream tasks and compare them to mBERT, and other state of the art models (To the extent of our knowledge). The Tasks were Sentiment Analysis on 6 different datasets (HARD, ASTD-Balanced, ArsenTD-Lev, LABR), Named Entity Recognition with the ANERcorp, and Arabic Question Answering on Arabic-SQuAD and ARCD AraBERTv2 What's New! AraBERT now comes in 4 new variants to replace the old v1 versions: More Detail in the AraBERT folder and in the README and in the AraBERT Paper All models are available in the HuggingFace model page under the aubmindlab name. Checkpoints are available in PyTorch, TF2 and TF1 formats. Better Pre-Processing and New Vocab We identified an issue with AraBERTv1's wordpiece vocabulary. The issue came from punctuations and numbers that were still attached to words when learned the wordpiece vocab. We now insert a space between numbers and characters and around punctuation characters. The new vocabulary was learnt using the BertWordpieceTokenizer from the tokenizers library, and should now support the Fast tokenizer implementation from the transformers library. P.S.: All the old BERT codes should work with the new BERT, just change the model name and check the new preprocessing dunction Please read the section on how to use the preprocessing function Bigger Dataset and More Compute We used ~3.5 times more data, and trained for longer. For Dataset Sources see the Dataset Section Dataset The pretraining data used for the new AraBERT model is also used for Arabic GPT2 and ELECTRA. The dataset consists of 77GB or 200,095,961 lines or 8,655,948,860 words or 82,232,988,358 chars (before applying Farasa Segmentation) For the new dataset we added the unshuffled OSCAR corpus, after we thoroughly filter it, to the previous dataset used in AraBERTv1 but with out the websites that we previously crawled: OSCAR unshuffled and filtered. Arabic Wikipedia dump from 2020/09/01 The 1.5B words Arabic Corpus The OSIAN Corpus Assafir news articles. Huge thank you for Assafir for giving us the data Preprocessing It is recommended to apply our preprocessing function before training/testing on any dataset. Install farasapy to segment text for AraBERT v1 & v2 pip install farasapy Accepted_models TensorFlow 1.x models The TF1.x model are available in the HuggingFace models repo. You can download them as follows: via git-lfs: clone all the models in a repo where MODEL_NAME is any model under the aubmindlab name via wget: Go to the tf1_model.tar.gz file on huggingface.co/models/aubmindlab/MODEL_NAME. copy the oid sha256 then run wget https://cdn-lfs.huggingface.co/aubmindlab/aragpt2-base/INSERT_THE_SHA_HERE (ex: for aragpt2-base: wget https://cdn-lfs.huggingface.co/aubmindlab/aragpt2-base/3766fc03d7c2593ff2fb991d275e96b81b0ecb2098b71ff315611d052ce65248) If you used this model please cite us as : Google Scholar has our Bibtex wrong (missing name), use this instead Acknowledgments Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs, couldn't have done it without this program, and to the AUB MIND Lab Members for the continous support. Also thanks to Yakshof and Assafir for data and storage access. Another thanks for Habib Rahal (https://www.behance.net/rahalhabib), for putting a face to AraBERT. Contacts Wissam Antoun: Linkedin | Twitter | Github | wfa07@mail.aub.edu | wissam.antoun@gmail.com Fady Baly: Linkedin | Twitter | Github | fgb06@mail.aub.edu | baly.fady@gmail.com
$-/run
3.2K
Huggingface
bert-base-arabertv02-twitter
bert-base-arabertv02-twitter
AraBERTv0.2-Twitter AraBERTv0.2-Twitter-base/large are two new models for Arabic dialects and tweets, trained by continuing the pre-training using the MLM task on ~60M Arabic tweets (filtered from a collection on 100M). The two new models have had emojies added to their vocabulary in addition to common words that weren't at first present. The pre-training was done with a max sentence length of 64 only for 1 epoch. AraBERT is an Arabic pretrained language model based on Google's BERT architechture. AraBERT uses the same BERT-Base config. More details are available in the AraBERT Paper and in the AraBERT Meetup Other Models Preprocessing The model is trained on a sequence length of 64, using max length beyond 64 might result in degraded performance It is recommended to apply our preprocessing function before training/testing on any dataset. The preprocessor will keep and space out emojis when used with a "twitter" model. If you used this model please cite us as : Google Scholar has our Bibtex wrong (missing name), use this instead Acknowledgments Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs, couldn't have done it without this program, and to the AUB MIND Lab Members for the continuous support. Also thanks to Yakshof and Assafir for data and storage access. Another thanks for Habib Rahal (https://www.behance.net/rahalhabib), for putting a face to AraBERT. Contacts Wissam Antoun: Linkedin | Twitter | Github | wfa07@mail.aub.edu | wissam.antoun@gmail.com Fady Baly: Linkedin | Twitter | Github | fgb06@mail.aub.edu | baly.fady@gmail.com
$-/run
3.0K
Huggingface
bert-large-arabertv2
bert-large-arabertv2
AraBERT v1 & v2 : Pre-training BERT for Arabic Language Understanding AraBERT is an Arabic pretrained language model based on Google's BERT architechture. AraBERT uses the same BERT-Base config. More details are available in the AraBERT Paper and in the AraBERT Meetup There are two versions of the model, AraBERTv0.1 and AraBERTv1, with the difference being that AraBERTv1 uses pre-segmented text where prefixes and suffixes were split using the Farasa Segmenter. We evaluate AraBERT models on different downstream tasks and compare them to mBERT, and other state of the art models (To the extent of our knowledge). The Tasks were Sentiment Analysis on 6 different datasets (HARD, ASTD-Balanced, ArsenTD-Lev, LABR), Named Entity Recognition with the ANERcorp, and Arabic Question Answering on Arabic-SQuAD and ARCD AraBERTv2 What's New! AraBERT now comes in 4 new variants to replace the old v1 versions: More Detail in the AraBERT folder and in the README and in the AraBERT Paper All models are available in the HuggingFace model page under the aubmindlab name. Checkpoints are available in PyTorch, TF2 and TF1 formats. Better Pre-Processing and New Vocab We identified an issue with AraBERTv1's wordpiece vocabulary. The issue came from punctuations and numbers that were still attached to words when learned the wordpiece vocab. We now insert a space between numbers and characters and around punctuation characters. The new vocabulary was learned using the BertWordpieceTokenizer from the tokenizers library, and should now support the Fast tokenizer implementation from the transformers library. P.S.: All the old BERT codes should work with the new BERT, just change the model name and check the new preprocessing function Please read the section on how to use the preprocessing function Bigger Dataset and More Compute We used ~3.5 times more data, and trained for longer. For Dataset Sources see the Dataset Section Dataset The pretraining data used for the new AraBERT model is also used for Arabic GPT2 and ELECTRA. The dataset consists of 77GB or 200,095,961 lines or 8,655,948,860 words or 82,232,988,358 chars (before applying Farasa Segmentation) For the new dataset we added the unshuffled OSCAR corpus, after we thoroughly filter it, to the previous dataset used in AraBERTv1 but with out the websites that we previously crawled: OSCAR unshuffled and filtered. Arabic Wikipedia dump from 2020/09/01 The 1.5B words Arabic Corpus The OSIAN Corpus Assafir news articles. Huge thank you for Assafir for providing us the data Preprocessing It is recommended to apply our preprocessing function before training/testing on any dataset. Install the arabert python package to segment text for AraBERT v1 & v2 or to clean your data pip install arabert TensorFlow 1.x models The TF1.x model are available in the HuggingFace models repo. You can download them as follows: via git-lfs: clone all the models in a repo where MODEL_NAME is any model under the aubmindlab name via wget: Go to the tf1_model.tar.gz file on huggingface.co/models/aubmindlab/MODEL_NAME. copy the oid sha256 then run wget https://cdn-lfs.huggingface.co/aubmindlab/aragpt2-base/INSERT_THE_SHA_HERE (ex: for aragpt2-base: wget https://cdn-lfs.huggingface.co/aubmindlab/aragpt2-base/3766fc03d7c2593ff2fb991d275e96b81b0ecb2098b71ff315611d052ce65248) If you used this model please cite us as : Google Scholar has our Bibtex wrong (missing name), use this instead Acknowledgments Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs, couldn't have done it without this program, and to the AUB MIND Lab Members for the continuous support. Also thanks to Yakshof and Assafir for data and storage access. Another thanks for Habib Rahal (https://www.behance.net/rahalhabib), for putting a face to AraBERT. Contacts Wissam Antoun: Linkedin | Twitter | Github | wfa07@mail.aub.edu | wissam.antoun@gmail.com Fady Baly: Linkedin | Twitter | Github | fgb06@mail.aub.edu | baly.fady@gmail.com
$-/run
1.4K
Huggingface
bert-large-arabertv02
bert-large-arabertv02
AraBERT v1 & v2 : Pre-training BERT for Arabic Language Understanding AraBERT is an Arabic pretrained lanaguage model based on Google's BERT architechture. AraBERT uses the same BERT-Base config. More details are available in the AraBERT Paper and in the AraBERT Meetup There are two versions of the model, AraBERTv0.1 and AraBERTv1, with the difference being that AraBERTv1 uses pre-segmented text where prefixes and suffixes were splitted using the Farasa Segmenter. We evalaute AraBERT models on different downstream tasks and compare them to mBERT, and other state of the art models (To the extent of our knowledge). The Tasks were Sentiment Analysis on 6 different datasets (HARD, ASTD-Balanced, ArsenTD-Lev, LABR), Named Entity Recognition with the ANERcorp, and Arabic Question Answering on Arabic-SQuAD and ARCD AraBERTv2 What's New! AraBERT now comes in 4 new variants to replace the old v1 versions: More Detail in the AraBERT folder and in the README and in the AraBERT Paper All models are available in the HuggingFace model page under the aubmindlab name. Checkpoints are available in PyTorch, TF2 and TF1 formats. Better Pre-Processing and New Vocab We identified an issue with AraBERTv1's wordpiece vocabulary. The issue came from punctuations and numbers that were still attached to words when learned the wordpiece vocab. We now insert a space between numbers and characters and around punctuation characters. The new vocabulary was learnt using the BertWordpieceTokenizer from the tokenizers library, and should now support the Fast tokenizer implementation from the transformers library. P.S.: All the old BERT codes should work with the new BERT, just change the model name and check the new preprocessing dunction Please read the section on how to use the preprocessing function Bigger Dataset and More Compute We used ~3.5 times more data, and trained for longer. For Dataset Sources see the Dataset Section Dataset The pretraining data used for the new AraBERT model is also used for Arabic GPT2 and ELECTRA. The dataset consists of 77GB or 200,095,961 lines or 8,655,948,860 words or 82,232,988,358 chars (before applying Farasa Segmentation) For the new dataset we added the unshuffled OSCAR corpus, after we thoroughly filter it, to the previous dataset used in AraBERTv1 but with out the websites that we previously crawled: OSCAR unshuffled and filtered. Arabic Wikipedia dump from 2020/09/01 The 1.5B words Arabic Corpus The OSIAN Corpus Assafir news articles. Huge thank you for Assafir for giving us the data Preprocessing It is recommended to apply our preprocessing function before training/testing on any dataset. Install farasapy to segment text for AraBERT v1 & v2 pip install farasapy Accepted_models TensorFlow 1.x models The TF1.x model are available in the HuggingFace models repo. You can download them as follows: via git-lfs: clone all the models in a repo where MODEL_NAME is any model under the aubmindlab name via wget: Go to the tf1_model.tar.gz file on huggingface.co/models/aubmindlab/MODEL_NAME. copy the oid sha256 then run wget https://cdn-lfs.huggingface.co/aubmindlab/aragpt2-base/INSERT_THE_SHA_HERE (ex: for aragpt2-base: wget https://cdn-lfs.huggingface.co/aubmindlab/aragpt2-base/3766fc03d7c2593ff2fb991d275e96b81b0ecb2098b71ff315611d052ce65248) If you used this model please cite us as : Google Scholar has our Bibtex wrong (missing name), use this instead Acknowledgments Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs, couldn't have done it without this program, and to the AUB MIND Lab Members for the continous support. Also thanks to Yakshof and Assafir for data and storage access. Another thanks for Habib Rahal (https://www.behance.net/rahalhabib), for putting a face to AraBERT. Contacts Wissam Antoun: Linkedin | Twitter | Github | wfa07@mail.aub.edu | wissam.antoun@gmail.com Fady Baly: Linkedin | Twitter | Github | fgb06@mail.aub.edu | baly.fady@gmail.com
$-/run
1.1K
Huggingface
aragpt2-mega
aragpt2-mega
Arabic GPT2 You can find more information in our paper AraGPT2 The code in this repository was used to train all GPT2 variants. The code support training and fine-tuning GPT2 on GPUs and TPUs via the TPUEstimator API. GPT2-base and medium uses the code from the gpt2 folder and can trains models from the minimaxir/gpt-2-simple repository. These models were trained using the lamb optimizer and follow the same architecture as gpt2 and are fully compatible with the transformers library. GPT2-large and GPT2-mega were trained using the imcaspar/gpt2-ml library, and follow the grover architecture. You can use the pytorch classes found in grover/modeling_gpt2.py as a direct replacement for classes in the transformers library (it should support version v4.x from transformers). Both models are trained using the adafactor optimizer, since the adam and lamb optimizer use too much memory causing the model to not even fit 1 batch on a TPU core. AraGPT2 is trained on the same large Arabic Dataset as AraBERTv2. Usage Testing the model using transformers: You need to use the GPT2LMHeadModel from arabert: pip install arabert Finetunning using transformers: Follow the guide linked here Finetuning using our code with TF 1.15.4: Create the Training TFRecords: Finetuning: Model Sizes All models are available in the HuggingFace model page under the aubmindlab name. Checkpoints are available in PyTorch, TF2 and TF1 formats. Compute For Dataset Source see the Dataset Section Dataset The pretraining data used for the new AraBERT model is also used for GPT2 and ELECTRA. The dataset consists of 77GB or 200,095,961 lines or 8,655,948,860 words or 82,232,988,358 chars (before applying Farasa Segmentation) For the new dataset we added the unshuffled OSCAR corpus, after we thoroughly filter it, to the previous dataset used in AraBERTv1 but with out the websites that we previously crawled: OSCAR unshuffled and filtered. Arabic Wikipedia dump from 2020/09/01 The 1.5B words Arabic Corpus The OSIAN Corpus Assafir news articles. Huge thank you for Assafir for giving us the data Disclaimer The text generated by GPT2 Arabic is automatically generated by a neural network model trained on a large amount of texts, which does not represent the authors' or their institutes' official attitudes and preferences. The text generated by GPT2 Arabic should only be used for research and scientific purposes. If it infringes on your rights and interests or violates social morality, please do not propagate it. If you used this model please cite us as : Acknowledgments Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs, couldn't have done it without this program, and to the AUB MIND Lab Members for the continuous support. Also thanks to Yakshof and Assafir for data and storage access. Another thanks for Habib Rahal (https://www.behance.net/rahalhabib), for putting a face to AraBERT. Contacts Wissam Antoun: Linkedin | Twitter | Github | wfa07@mail.aub.edu | wissam.antoun@gmail.com Fady Baly: Linkedin | Twitter | Github | fgb06@mail.aub.edu | baly.fady@gmail.com
$-/run
983
Huggingface
araelectra-base-discriminator
araelectra-base-discriminator
AraELECTRA ELECTRA is a method for self-supervised language representation learning. It can be used to pre-train transformer networks using relatively little compute. ELECTRA models are trained to distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to the discriminator of a GAN. AraELECTRA achieves state-of-the-art results on Arabic QA dataset. For a detailed description, please refer to the AraELECTRA paper AraELECTRA: Pre-Training Text Discriminators for Arabic Language Understanding. How to use the discriminator in transformers Model Compute Dataset The pretraining data used for the new AraELECTRA model is also used for AraGPT2 and AraBERTv2. The dataset consists of 77GB or 200,095,961 lines or 8,655,948,860 words or 82,232,988,358 chars (before applying Farasa Segmentation) For the new dataset we added the unshuffled OSCAR corpus, after we thoroughly filter it, to the previous dataset used in AraBERTv1 but with out the websites that we previously crawled: OSCAR unshuffled and filtered. Arabic Wikipedia dump from 2020/09/01 The 1.5B words Arabic Corpus The OSIAN Corpus Assafir news articles. Huge thank you for Assafir for giving us the data Preprocessing It is recommended to apply our preprocessing function before training/testing on any dataset. Install the arabert python package to segment text for AraBERT v1 & v2 or to clean your data pip install arabert TensorFlow 1.x models You can find the PyTorch, TF2 and TF1 models in HuggingFace's Transformer Library under the aubmindlab username wget https://huggingface.co/aubmindlab/MODEL_NAME/resolve/main/tf1_model.tar.gz where MODEL_NAME is any model under the aubmindlab name If you used this model please cite us as : Acknowledgments Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs, couldn't have done it without this program, and to the AUB MIND Lab Members for the continous support. Also thanks to Yakshof and Assafir for data and storage access. Another thanks for Habib Rahal (https://www.behance.net/rahalhabib), for putting a face to AraBERT. Contacts Wissam Antoun: Linkedin | Twitter | Github | wfa07@mail.aub.edu | wissam.antoun@gmail.com Fady Baly: Linkedin | Twitter | Github | fgb06@mail.aub.edu | baly.fady@gmail.com
$-/run
841
Huggingface