[](#twitter-xlm-roberta-base-for-sentiment-analysis)twitter-XLM-roBERTa-base for Sentiment Analysis
===================================================================================================

This is a multilingual XLM-roBERTa-base model trained on ~198M tweets and finetuned for sentiment analysis. The sentiment fine-tuning was done on 8 languages (Ar, En, Fr, De, Hi, It, Sp, Pt) but it can be used for more languages (see paper for details).

*   Paper: [XLM-T: A Multilingual Language Model Toolkit for Twitter](https://arxiv.org/abs/2104.12250).
*   Git Repo: [XLM-T official repository](https://github.com/cardiffnlp/xlm-t).

This model has been integrated into the [TweetNLP library](https://github.com/cardiffnlp/tweetnlp).

[](#example-pipeline)Example Pipeline
-------------------------------------

    from transformers import pipeline
    model_path = "cardiffnlp/twitter-xlm-roberta-base-sentiment"
    sentiment_task = pipeline("sentiment-analysis", model=model_path, tokenizer=model_path)
    sentiment_task("T'estimo!")
    

    [{'label': 'Positive', 'score': 0.6600581407546997}]
    

[](#full-classification-example)Full classification example
-----------------------------------------------------------

    from transformers import AutoModelForSequenceClassification
    from transformers import TFAutoModelForSequenceClassification
    from transformers import AutoTokenizer, AutoConfig
    import numpy as np
    from scipy.special import softmax
    
    # Preprocess text (username and link placeholders)
    def preprocess(text):
        new_text = []
        for t in text.split(" "):
            t = '@user' if t.startswith('@') and len(t) > 1 else t
            t = 'http' if t.startswith('http') else t
            new_text.append(t)
        return " ".join(new_text)
    
    MODEL = f"cardiffnlp/twitter-xlm-roberta-base-sentiment"
    
    tokenizer = AutoTokenizer.from_pretrained(MODEL)
    config = AutoConfig.from_pretrained(MODEL)
    
    # PT
    model = AutoModelForSequenceClassification.from_pretrained(MODEL)
    model.save_pretrained(MODEL)
    
    text = "Good night "
    text = preprocess(text)
    encoded_input = tokenizer(text, return_tensors='pt')
    output = model(**encoded_input)
    scores = output[0][0].detach().numpy()
    scores = softmax(scores)
    
    # # TF
    # model = TFAutoModelForSequenceClassification.from_pretrained(MODEL)
    # model.save_pretrained(MODEL)
    
    # text = "Good night "
    # encoded_input = tokenizer(text, return_tensors='tf')
    # output = model(encoded_input)
    # scores = output[0][0].numpy()
    # scores = softmax(scores)
    
    # Print labels and scores
    ranking = np.argsort(scores)
    ranking = ranking[::-1]
    for i in range(scores.shape[0]):
        l = config.id2label[ranking[i]]
        s = scores[ranking[i]]
        print(f"{i+1}) {l} {np.round(float(s), 4)}")
    

Output:

    1) Positive 0.7673
    2) Neutral 0.2015
    3) Negative 0.0313
    

### [](#reference)Reference

    @inproceedings{barbieri-etal-2022-xlm,
        title = "{XLM}-{T}: Multilingual Language Models in {T}witter for Sentiment Analysis and Beyond",
        author = "Barbieri, Francesco  and
          Espinosa Anke, Luis  and
          Camacho-Collados, Jose",
        booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",
        month = jun,
        year = "2022",
        address = "Marseille, France",
        publisher = "European Language Resources Association",
        url = "https://aclanthology.org/2022.lrec-1.27",
        pages = "258--266"
    }

## Model overview

The `twitter-xlm-roberta-base-sentiment` model is a multilingual XLM-roBERTa-base model trained on ~198M tweets and fine-tuned for sentiment analysis. The model supports sentiment analysis in 8 languages (Arabic, English, French, German, Hindi, Italian, Spanish, and Portuguese), but can potentially be used for more languages as well. This model was developed by [cardiffnlp](https://aimodels.fyi/creators/huggingFace/cardiffnlp).

Similar models include the [xlm-roberta-base-language-detection](https://aimodels.fyi/models/huggingFace/xlm-roberta-base-language-detection-papluca) model, which is a fine-tuned version of the XLM-RoBERTa base model for language identification, and the [xlm-roberta-large](https://aimodels.fyi/models/huggingFace/xlm-roberta-large-facebookai) and [xlm-roberta-base](https://aimodels.fyi/models/huggingFace/xlm-roberta-base-facebookai) models, which are the base and large versions of the multilingual XLM-RoBERTa model.

## Model inputs and outputs

### Inputs
- Text sequences for sentiment analysis

### Outputs
- A label indicating the predicted sentiment (Positive, Negative, or Neutral)
- A score representing the confidence of the prediction

## Capabilities

The `twitter-xlm-roberta-base-sentiment` model can perform sentiment analysis on text in 8 languages: Arabic, English, French, German, Hindi, Italian, Spanish, and Portuguese. It was trained on a large corpus of tweets, giving it the ability to analyze the sentiment of short, informal text.

## What can I use it for?

This model can be used for a variety of applications that require multilingual sentiment analysis, such as social media monitoring, customer service analysis, and market research. By leveraging the model's ability to analyze sentiment in multiple languages, developers can build applications that can process text from a wide range of sources and users.

## Things to try

One interesting thing to try with this model is to experiment with the different languages it supports. Since the model was trained on a diverse dataset of tweets, it may be able to capture nuances in sentiment that are specific to certain cultures or languages. Developers could try using the model to analyze sentiment in languages beyond the 8 it was specifically fine-tuned on, to see how it performs.

Another idea is to compare the performance of this model to other sentiment analysis models, such as the [bart-large-mnli](https://huggingface.co/facebook/bart-large-mnli) or [valhalla](https://huggingface.co/models?filter=pipeline_tag%3Azero-shot-classification&search=valhalla) models, to see how it fares on different types of text and tasks.