Anas-awadalla

Rank:

Average Model Cost: $0.0000

Number of Runs: 15,234

Models by this creator

mpt-1b-redpajama-200b

mpt-1b-redpajama-200b

anas-awadalla

MPT-1b-RedPajama-200b is a transformer model with 1.3 billion parameters, trained on the RedPajama dataset. It follows a modified decoder-only transformer architecture and was trained for 200 billion tokens. The model uses training efficiency features such as FlashAttention, ALiBi, and QK LayerNorm. It does not use positional embeddings or biases. The training data consists of a mix of datasets, including RedPajama Common Crawl, C4, RedPajama GitHub, RedPajama Wikipedia, RedPajama Books, RedPajama Arxiv, and RedPajama StackExchange. The model was trained on 440 A100-40GBs using sharded data parallelism.

Read more

$-/run

9.2K

Huggingface

mpt-1b-redpajama-200b-dolly

mpt-1b-redpajama-200b-dolly

MPT-1b-RedPajama-200b-dolly MPT-1b-RedPajama-200b-dolly is a 1.3 billion parameter decoder-only transformer pre-trained on the RedPajama dataset and subsequently fine-tuned on the Databricks Dolly instruction dataset. The model was pre-trained for 200B tokens by sampling from the subsets of the RedPajama dataset in the same proportions as were used by the Llama series of models. This model was trained by MosaicML and follows a modified decoder-only transformer architecture. This model is an instruction fine-tuned version of mpt-1b-redpajama-200b. In other words, the pre-trained version of this model is mpt-1b-redpajama-200b. Model Date April 20, 2023 How to Use Note: This model requires that trust_remote_code=True be passed to the from_pretrained method. This is because we use a custom model architecture MosaicGPT that is not yet part of the transformers package. MosaicGPT includes options for many training efficiency features such as FlashAttention (Dao et al. 2022), ALIBI, QK LayerNorm, and more. To use the optimized triton implementation of FlashAttention, you can load with attn_impl='triton' and move the model to bfloat16 like so: Model Description This model uses the MosaicML LLM codebase, which can be found in the MosaicML Examples Repository. The architecture is a modification of a standard decoder-only transformer. The transformer has 24 layers, 16 attention heads, and width 2048. The model has been modified from a standard transformer in the following ways: It uses ALiBi and does not use positional embeddings. It uses QK LayerNorm. It does not use biases. Training Data Pre-Training The model was pre-trained for 200B tokens (batch size 2200, sequence length 2048). It was trained on the following data mix: 67% RedPajama Common Crawl 15% C4 4.5% RedPajama GitHub 4.5% RedPajama Wikipedia 4.5% RedPajama Books 2.5% RedPajama Arxiv 2% RedPajama StackExchange This is the same mix of data as was used in the Llama series of models](https://arxiv.org/abs/2302.13971). Each sample was chosen from one of the datasets, with the dataset selected with the probability specified above. The examples were shuffled within each dataset. Each example was constructed from as many sequences from that dataset as were necessary to fill the 2048 sequence length. The data was tokenized using the EleutherAI/gpt-neox-20b tokenizer. Fine-Tuning We fine tuned this model on the databricks-dolly-15k dataset released by Databricks, following the same hyperparameters found in their train_dolly.py script. Training Configuration This model was pre-trained on 440 A100-40GBs for about half a day using the MosaicML Platform. The model was pre-trained with sharded data parallelism using FSDP. Acknowledgements This model builds on the work of Together, which created the RedPajama dataset with the goal of mimicking the training data used to create the Llama series of models. We gratefully acknowledge the hard work of the team that put together this dataset, and we hope this model serves as a useful companion to that work. This model also builds on the work of Databricks, which created the Dolly instruction fine-tuning dataset. We also gratefully acknowledge the work of the researchers who created the Llama series of models, which was the impetus for our efforts and those who worked on the RedPajama project.

Read more

$-/run

2.9K

Huggingface

gpt2-large-lr-1e5-span-head-finetuned-squad

gpt2-large-lr-1e5-span-head-finetuned-squad

gpt2-large-lr-1e5-span-head-finetuned-squad This model is a fine-tuned version of gpt2-large on the squad dataset. Model description More information needed Intended uses & limitations More information needed Training and evaluation data More information needed Training procedure Training hyperparameters The following hyperparameters were used during training: learning_rate: 1e-05 train_batch_size: 8 eval_batch_size: 8 seed: 42 distributed_type: multi-GPU num_devices: 2 total_train_batch_size: 16 total_eval_batch_size: 16 optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 lr_scheduler_type: linear num_epochs: 2.0 Training results Framework versions Transformers 4.20.0.dev0 Pytorch 1.11.0+cu113 Datasets 2.3.2 Tokenizers 0.11.6

Read more

$-/run

161

Huggingface

bert-tiny-finetuned-squad

bert-tiny-finetuned-squad

bert-tiny-finetuned-squad This model is a fine-tuned version of prajjwal1/bert-tiny on the squad dataset. Model description More information needed Intended uses & limitations More information needed Training and evaluation data More information needed Training procedure Training hyperparameters The following hyperparameters were used during training: learning_rate: 3e-05 train_batch_size: 64 eval_batch_size: 8 seed: 42 distributed_type: multi-GPU optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 lr_scheduler_type: linear num_epochs: 2.0 Training results Framework versions Transformers 4.17.0 Pytorch 1.11.0+cu113 Datasets 2.0.0 Tokenizers 0.11.6

Read more

$-/run

101

Huggingface

gpt2-span-head-finetuned-squad

gpt2-span-head-finetuned-squad

gpt2-span-head-finetuned-squad This model is a fine-tuned version of gpt2 on the squad dataset. Model description More information needed Intended uses & limitations More information needed Training and evaluation data More information needed Training procedure Training hyperparameters The following hyperparameters were used during training: learning_rate: 3e-05 train_batch_size: 16 eval_batch_size: 16 seed: 42 distributed_type: multi-GPU num_devices: 2 total_train_batch_size: 32 total_eval_batch_size: 32 optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 lr_scheduler_type: linear num_epochs: 2.0 Training results Framework versions Transformers 4.20.0.dev0 Pytorch 1.11.0+cu113 Datasets 2.3.2 Tokenizers 0.11.6

Read more

$-/run

73

Huggingface

gpt2-span-head-few-shot-k-32-finetuned-squad-seed-0

gpt2-span-head-few-shot-k-32-finetuned-squad-seed-0

gpt2-span-head-few-shot-k-32-finetuned-squad-seed-0 This model is a fine-tuned version of gpt2 on the squad dataset. Model description More information needed Intended uses & limitations More information needed Training and evaluation data More information needed Training procedure Training hyperparameters The following hyperparameters were used during training: learning_rate: 3e-05 train_batch_size: 12 eval_batch_size: 8 seed: 0 optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 lr_scheduler_type: linear lr_scheduler_warmup_ratio: 0.1 training_steps: 200 Training results Framework versions Transformers 4.20.0.dev0 Pytorch 1.11.0+cu113 Datasets 2.3.2 Tokenizers 0.11.6

Read more

$-/run

58

Huggingface

bert-medium-pretrained-finetuned-squad

bert-medium-pretrained-finetuned-squad

bert_medium_pretrain_squad This model is a fine-tuned version of anas-awadalla/bert-medium-pretrained-on-squad on the squad dataset. It achieves the following results on the evaluation set: Loss: 0.0973 "exact_match": 77.95648060548723 "f1": 85.85300366384631 Model description More information needed Intended uses & limitations More information needed Training and evaluation data More information needed Training procedure Training hyperparameters The following hyperparameters were used during training: learning_rate: 5e-05 train_batch_size: 16 eval_batch_size: 16 seed: 42 optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 lr_scheduler_type: linear num_epochs: 3.0 Training results Framework versions Transformers 4.16.0.dev0 Pytorch 1.10.1+cu102 Datasets 1.17.0 Tokenizers 0.10.3

Read more

$-/run

57

Huggingface

Similar creators