Ccdv
Rank:Average Model Cost: $0.0000
Number of Runs: 3,857
Models by this creator
lsg-bart-base-16384-pubmed
$-/run
1.1K
Huggingface
lsg-bart-base-4096-pubmed
$-/run
918
Huggingface
lsg-xlm-roberta-base-4096
$-/run
903
Huggingface
lsg-bart-base-4096
lsg-bart-base-4096
LSG model Transformers >= 4.23.1This model relies on a custom modeling file, you need to add trust_remote_code=TrueSee #13467 LSG ArXiv paper. Github/conversion script is available at this link. Usage Parameters Sparse selection type Tasks This model is adapted from BART-base for encoder-decoder tasks without additional pretraining. It uses the same number of parameters/layers and the same tokenizer. This model can handle long sequences but faster and more efficiently than Longformer (LED) or BigBird (Pegasus) from the hub and relies on Local + Sparse + Global attention (LSG). The model requires sequences whose length is a multiple of the block size. The model is "adaptive" and automatically pads the sequences if needed (adaptive=True in config). It is however recommended, thanks to the tokenizer, to truncate the inputs (truncation=True) and optionally to pad with a multiple of the block size (pad_to_multiple_of=...). Implemented in PyTorch. Usage The model relies on a custom modeling file, you need to add trust_remote_code=True to use it. Parameters You can change various parameters like : the number of global tokens (num_global_tokens=1) local block size (block_size=128) sparse block size (sparse_block_size=128) sparsity factor (sparsity_factor=2) mask_first_token (mask first token since it is redundant with the first global token) see config.json file Default parameters work well in practice. If you are short on memory, reduce block sizes, increase sparsity factor and remove dropout in the attention score matrix. Sparse selection type There are 5 different sparse selection patterns. The best type is task dependent. Note that for sequences with length < 2*block_size, the type has no effect. sparsity_type="norm", select highest norm tokens Works best for a small sparsity_factor (2 to 4) Additional parameters: None sparsity_type="pooling", use average pooling to merge tokens Works best for a small sparsity_factor (2 to 4) Additional parameters: None sparsity_type="lsh", use the LSH algorithm to cluster similar tokens Works best for a large sparsity_factor (4+) LSH relies on random projections, thus inference may differ slightly with different seeds Additional parameters: lsg_num_pre_rounds=1, pre merge tokens n times before computing centroids sparsity_type="stride", use a striding mecanism per head Each head will use different tokens strided by sparsify_factor Not recommended if sparsify_factor > num_heads sparsity_type="block_stride", use a striding mecanism per head Each head will use block of tokens strided by sparsify_factor Not recommended if sparsify_factor > num_heads Tasks Seq2Seq example for summarization: Classification example: BART
$-/run
498
Huggingface
lsg-bart-base-16384-mediasum
$-/run
112
Huggingface
lsg-bart-base-16384-arxiv
$-/run
87
Huggingface
lsg-bart-base-16384
lsg-bart-base-16384
LSG model Transformers >= 4.23.1This model relies on a custom modeling file, you need to add trust_remote_code=TrueSee #13467 LSG ArXiv paper. Github/conversion script is available at this link. Usage Parameters Sparse selection type Tasks This model is adapted from BART-base for encoder-decoder tasks without additional pretraining. It uses the same number of parameters/layers and the same tokenizer. This model can handle long sequences but faster and more efficiently than Longformer (LED) or BigBird (Pegasus) from the hub and relies on Local + Sparse + Global attention (LSG). The model requires sequences whose length is a multiple of the block size. The model is "adaptive" and automatically pads the sequences if needed (adaptive=True in config). It is however recommended, thanks to the tokenizer, to truncate the inputs (truncation=True) and optionally to pad with a multiple of the block size (pad_to_multiple_of=...). \ Implemented in PyTorch. Usage The model relies on a custom modeling file, you need to add trust_remote_code=True to use it. Parameters You can change various parameters like : the number of global tokens (num_global_tokens=1) local block size (block_size=128) sparse block size (sparse_block_size=128) sparsity factor (sparsity_factor=2) mask_first_token (mask first token since it is redundant with the first global token) see config.json file Default parameters work well in practice. If you are short on memory, reduce block sizes, increase sparsity factor and remove dropout in the attention score matrix. Sparse selection type There are 5 different sparse selection patterns. The best type is task dependent. Note that for sequences with length < 2*block_size, the type has no effect. sparsity_type="norm", select highest norm tokens Works best for a small sparsity_factor (2 to 4) Additional parameters: None sparsity_type="pooling", use average pooling to merge tokens Works best for a small sparsity_factor (2 to 4) Additional parameters: None sparsity_type="lsh", use the LSH algorithm to cluster similar tokens Works best for a large sparsity_factor (4+) LSH relies on random projections, thus inference may differ slightly with different seeds Additional parameters: lsg_num_pre_rounds=1, pre merge tokens n times before computing centroids sparsity_type="stride", use a striding mecanism per head Each head will use different tokens strided by sparsify_factor Not recommended if sparsify_factor > num_heads sparsity_type="block_stride", use a striding mecanism per head Each head will use block of tokens strided by sparsify_factor Not recommended if sparsify_factor > num_heads Tasks Seq2Seq example for summarization: Classification example: BART
$-/run
82
Huggingface
lsg-bert-base-uncased-4096
$-/run
78
Huggingface
lsg-pegasus-large-4096
lsg-pegasus-large-4096
LSG model Transformers >= 4.23.1This model relies on a custom modeling file, you need to add trust_remote_code=TrueSee #13467 LSG ArXiv paper. Github/conversion script is available at this link. Usage Parameters Sparse selection type Tasks This model is adapted from Pegasus-large for encoder-decoder tasks without additional pretraining. It uses the same number of parameters/layers and the same tokenizer. This model can handle long sequences but faster and more efficiently than Longformer (LED) or BigBird (Pegasus) from the hub and relies on Local + Sparse + Global attention (LSG). The model requires sequences whose length is a multiple of the block size. The model is "adaptive" and automatically pads the sequences if needed (adaptive=True in config). It is however recommended, thanks to the tokenizer, to truncate the inputs (truncation=True) and optionally to pad with a multiple of the block size (pad_to_multiple_of=...). \ Implemented in PyTorch. Usage The model relies on a custom modeling file, you need to add trust_remote_code=True to use it. Parameters You can change various parameters like : the number of global tokens (num_global_tokens=1) local block size (block_size=128) sparse block size (sparse_block_size=128) sparsity factor (sparsity_factor=2) see config.json file Default parameters work well in practice. If you are short on memory, reduce block sizes, increase sparsity factor and remove dropout in the attention score matrix. Sparse selection type There are 5 different sparse selection patterns. The best type is task dependent. Note that for sequences with length < 2*block_size, the type has no effect. sparsity_type="norm", select highest norm tokens Works best for a small sparsity_factor (2 to 4) Additional parameters: None sparsity_type="pooling", use average pooling to merge tokens Works best for a small sparsity_factor (2 to 4) Additional parameters: None sparsity_type="lsh", use the LSH algorithm to cluster similar tokens Works best for a large sparsity_factor (4+) LSH relies on random projections, thus inference may differ slightly with different seeds Additional parameters: lsg_num_pre_rounds=1, pre merge tokens n times before computing centroids sparsity_type="stride", use a striding mecanism per head Each head will use different tokens strided by sparsify_factor Not recommended if sparsify_factor > num_heads sparsity_type="block_stride", use a striding mecanism per head Each head will use block of tokens strided by sparsify_factor Not recommended if sparsify_factor > num_heads Tasks Seq2Seq example for summarization: Classification example: Pegasus
$-/run
60
Huggingface
lsg-legal-base-uncased-4096
lsg-legal-base-uncased-4096
LSG model Transformers >= 4.23.1This model relies on a custom modeling file, you need to add trust_remote_code=TrueSee #13467 LSG ArXiv paper. Github/conversion script is available at this link. Usage Parameters Sparse selection type Tasks Training global tokens This model is adapted from LEGAL-BERT without additional pretraining yet. It uses the same number of parameters/layers and the same tokenizer. This model can handle long sequences but faster and more efficiently than Longformer or BigBird (from Transformers) and relies on Local + Sparse + Global attention (LSG). The model requires sequences whose length is a multiple of the block size. The model is "adaptive" and automatically pads the sequences if needed (adaptive=True in config). It is however recommended, thanks to the tokenizer, to truncate the inputs (truncation=True) and optionally to pad with a multiple of the block size (pad_to_multiple_of=...). Support encoder-decoder but I didnt test it extensively.Implemented in PyTorch. Usage The model relies on a custom modeling file, you need to add trust_remote_code=True to use it. Parameters You can change various parameters like : the number of global tokens (num_global_tokens=1) local block size (block_size=128) sparse block size (sparse_block_size=128) sparsity factor (sparsity_factor=2) mask_first_token (mask first token since it is redundant with the first global token) see config.json file Default parameters work well in practice. If you are short on memory, reduce block sizes, increase sparsity factor and remove dropout in the attention score matrix. Sparse selection type There are 5 different sparse selection patterns. The best type is task dependent. Note that for sequences with length < 2*block_size, the type has no effect. sparsity_type="norm", select highest norm tokens Works best for a small sparsity_factor (2 to 4) Additional parameters: None sparsity_type="pooling", use average pooling to merge tokens Works best for a small sparsity_factor (2 to 4) Additional parameters: None sparsity_type="lsh", use the LSH algorithm to cluster similar tokens Works best for a large sparsity_factor (4+) LSH relies on random projections, thus inference may differ slightly with different seeds Additional parameters: lsg_num_pre_rounds=1, pre merge tokens n times before computing centroids sparsity_type="stride", use a striding mecanism per head Each head will use different tokens strided by sparsify_factor Not recommended if sparsify_factor > num_heads sparsity_type="block_stride", use a striding mecanism per head Each head will use block of tokens strided by sparsify_factor Not recommended if sparsify_factor > num_heads Tasks Fill mask example: Classification example: Training global tokens To train global tokens and the classification head only: LEGAL-BERT
$-/run
56
Huggingface