Monsoon-nlp
Rank:Average Model Cost: $0.0000
Number of Runs: 3,342
Models by this creator
hindi-bert
$-/run
2.4K
Huggingface
bert-base-thai
$-/run
650
Huggingface
bangla-electra
$-/run
126
Huggingface
hindi-tpu-electra
$-/run
46
Huggingface
ar-seq2seq-gender-encoder
$-/run
30
Huggingface
ar-seq2seq-gender-decoder
$-/run
30
Huggingface
gpt-winowhy
$-/run
29
Huggingface
es-seq2seq-gender-encoder
es-seq2seq-gender-encoder
This is a seq2seq model (encoder half) to "flip" gender in Spanish sentences. The model can augment your existing Spanish data, or generate counterfactuals to test a model's decisions (would changing the gender of the subject or speaker change output?). Intended Examples: People's names are unchanged in this version, but you can use packages such as https://pypi.org/project/gender-guesser/ https://colab.research.google.com/drive/1Ta_YkXx93FyxqEu_zJ-W23PjPumMNHe5 I originally developed a gender flip Python script with BETO, the Spanish-language BERT from Universidad de Chile, and spaCy to parse dependencies in sentences. More about this project: https://medium.com/ai-in-plain-english/gender-bias-in-spanish-bert-1f4d76780617 The seq2seq model is trained on gender-flipped text from that script run on the muchocine dataset, and the first 6,853 lines from the OSCAR corpus (Spanish ded-duped). The encoder and decoder started with weights and vocabulary from BETO (uncased). This model is useful to generate male and female text samples, but falls short of capturing gender diversity in the world and in the Spanish language. Some communities prefer the plural -@s to represent -os and -as, or -e and -es for gender-neutral or mixed-gender plural, or use fewer gendered professional nouns (la juez and not jueza). This is not yet embraced by the Royal Spanish Academy and is not represented in the corpora and tokenizers used to build this project. This seq2seq project and script could, in the future, help generate more text samples and prepare NLP models to understand us all better.
$-/run
28
Huggingface
tamillion
tamillion
This is the second version of a Tamil language model trained with Google Research's ELECTRA. Tokenization and pre-training CoLab: https://colab.research.google.com/drive/1Pwia5HJIb6Ad4Hvbx5f-IjND-vCaJzSE?usp=sharing V1: small model with GPU; 190,000 steps; V2 (current): base model with TPU and larger corpus; 224,000 steps Sudalai Rajkumar's Tamil-NLP page contains classification and regression tasks: https://www.kaggle.com/sudalairajkumar/tamil-nlp Notebook: https://colab.research.google.com/drive/1_rW9HZb6G87-5DraxHvhPOzGmSMUc67_?usp=sharin The model outperformed mBERT on news classification: (Random: 16.7%, mBERT: 53.0%, TaMillion: 75.1%) The model slightly outperformed mBERT on movie reviews: (RMSE - mBERT: 0.657, TaMillion: 0.626) Equivalent accuracy on the Tirukkural topic task. I didn't find a Tamil-language question answering dataset, but this model could be finetuned to train a QA model. See Hindi and Bengali examples here: https://colab.research.google.com/drive/1i6fidh2tItf_-IDkljMuaIGmEU6HT2Ar Trained on IndicCorp Tamil (11GB) https://indicnlp.ai4bharat.org/corpora/ and 1 October 2020 dump of https://ta.wikipedia.org (482MB) Included as vocab.txt in the upload
$-/run
28
Huggingface
es-seq2seq-gender-decoder
$-/run
23
Huggingface