[](#cross-english--german-roberta-for-sentence-embeddings)Cross English & German RoBERTa for Sentence Embeddings
================================================================================================================

This model is intended to [compute sentence (text) embeddings](https://www.sbert.net/examples/applications/computing-embeddings/README.html) for English and German text. These embeddings can then be compared with [cosine-similarity](https://en.wikipedia.org/wiki/Cosine_similarity) to find sentences with a similar semantic meaning. For example this can be useful for [semantic textual similarity](https://www.sbert.net/docs/usage/semantic_textual_similarity.html), [semantic search](https://www.sbert.net/docs/usage/semantic_search.html), or [paraphrase mining](https://www.sbert.net/docs/usage/paraphrase_mining.html). To do this you have to use the [Sentence Transformers Python framework](https://github.com/UKPLab/sentence-transformers).

The speciality of this model is that it also works cross-lingually. Regardless of the language, the sentences are translated into very similar vectors according to their semantics. This means that you can, for example, enter a search in German and find results according to the semantics in German and also in English. Using a xlm model and _multilingual finetuning with language-crossing_ we reach performance that even exceeds the best current dedicated English large model (see Evaluation section below).

> Sentence-BERT (SBERT) is a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity. This reduces the effort for finding the most similar pair from 65hours with BERT / RoBERTa to about 5 seconds with SBERT, while maintaining the accuracy from BERT.

Source: [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084)

This model is fine-tuned from [Philip May](https://may.la/) and open-sourced by [T-Systems-onsite](https://www.t-systems-onsite.de/). Special thanks to [Nils Reimers](https://www.nils-reimers.de/) for your awesome open-source work, the Sentence Transformers, the models and your help on GitHub.

[](#how-to-use)How to use
-------------------------

To use this model install the `sentence-transformers` package (see here: [https://github.com/UKPLab/sentence-transformers](https://github.com/UKPLab/sentence-transformers)).

    from sentence_transformers import SentenceTransformer
    model = SentenceTransformer('T-Systems-onsite/cross-en-de-roberta-sentence-transformer')
    

For details of usage and examples see here:

*   [Computing Sentence Embeddings](https://www.sbert.net/docs/usage/computing_sentence_embeddings.html)
*   [Semantic Textual Similarity](https://www.sbert.net/docs/usage/semantic_textual_similarity.html)
*   [Paraphrase Mining](https://www.sbert.net/docs/usage/paraphrase_mining.html)
*   [Semantic Search](https://www.sbert.net/docs/usage/semantic_search.html)
*   [Cross-Encoders](https://www.sbert.net/docs/usage/cross-encoder.html)
*   [Examples on GitHub](https://github.com/UKPLab/sentence-transformers/tree/master/examples)

[](#training)Training
---------------------

The base model is [xlm-roberta-base](https://huggingface.co/xlm-roberta-base). This model has been further trained by [Nils Reimers](https://www.nils-reimers.de/) on a large scale paraphrase dataset for 50+ languages. [Nils Reimers](https://www.nils-reimers.de/) about this [on GitHub](https://github.com/UKPLab/sentence-transformers/issues/509#issuecomment-712243280):

> A paper is upcoming for the paraphrase models.
> 
> These models were trained on various datasets with Millions of examples for paraphrases, mainly derived from Wikipedia edit logs, paraphrases mined from Wikipedia and SimpleWiki, paraphrases from news reports, AllNLI-entailment pairs with in-batch-negative loss etc.
> 
> In internal tests, they perform much better than the NLI+STSb models as they have see more and broader type of training data. NLI+STSb has the issue that they are rather narrow in their domain and do not contain any domain specific words / sentences (like from chemistry, computer science, math etc.). The paraphrase models has seen plenty of sentences from various domains.
> 
> More details with the setup, all the datasets, and a wider evaluation will follow soon.

The resulting model called `xlm-r-distilroberta-base-paraphrase-v1` has been released here: [https://github.com/UKPLab/sentence-transformers/releases/tag/v0.3.8](https://github.com/UKPLab/sentence-transformers/releases/tag/v0.3.8)

Building on this cross language model we fine-tuned it for English and German language on the [STSbenchmark](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark) dataset. For German language we used the dataset of our [German STSbenchmark dataset](https://github.com/t-systems-on-site-services-gmbh/german-STSbenchmark) which has been translated with [deepl.com](https://www.deepl.com/translator). Additionally to the German and English training samples we generated samples of English and German crossed. We call this _multilingual finetuning with language-crossing_. It doubled the traing-datasize and tests show that it further improves performance.

We did an automatic hyperparameter search for 33 trials with [Optuna](https://github.com/optuna/optuna). Using 10-fold crossvalidation on the deepl.com test and dev dataset we found the following best hyperparameters:

*   batch\_size = 8
*   num\_epochs = 2
*   lr = 1.026343323298136e-05,
*   eps = 4.462251033010287e-06
*   weight\_decay = 0.04794438776350409
*   warmup\_steps\_proportion = 0.1609010732760181

The final model was trained with these hyperparameters on the combination of the train and dev datasets from English, German and the crossings of them. The testset was left for testing.

[](#evaluation)Evaluation
=========================

The evaluation has been done on English, German and both languages crossed with the STSbenchmark test data. The evaluation-code is available on [Colab](https://colab.research.google.com/drive/1gtGnKq_dYU_sDYqMohTYVMVpxMJjyH0M?usp=sharing). As the metric for evaluation we use the Spearmans rank correlation between the cosine-similarity of the sentence embeddings and STSbenchmark labels.

Model Name

Spearman  
German

Spearman  
English

Spearman  
EN-DE & DE-EN  
(cross)

xlm-r-distilroberta-base-paraphrase-v1

0.8079

0.8350

0.7983

[xlm-r-100langs-bert-base-nli-stsb-mean-tokens](https://huggingface.co/sentence-transformers/xlm-r-100langs-bert-base-nli-stsb-mean-tokens)

0.7877

0.8465

0.7908

xlm-r-bert-base-nli-stsb-mean-tokens

0.7877

0.8465

0.7908

[roberta-large-nli-stsb-mean-tokens](https://huggingface.co/sentence-transformers/roberta-large-nli-stsb-mean-tokens)

0.6371

0.8639

0.4109

[T-Systems-onsite/  
german-roberta-sentence-transformer-v2](https://huggingface.co/T-Systems-onsite/german-roberta-sentence-transformer-v2)

0.8529

0.8634

0.8415

[paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2)

0.8355

**0.8682**

0.8309

**T-Systems-onsite/  
cross-en-de-roberta-sentence-transformer**

**0.8550**

0.8660

**0.8525**

[](#license)License
-------------------

Copyright (c) 2020 [Philip May](https://philipmay.org), T-Systems on site services GmbH

Licensed under the MIT License (the "License"); you may not use this work except in compliance with the License. You may obtain a copy of the License by reviewing the file [LICENSE](https://huggingface.co/T-Systems-onsite/cross-en-de-roberta-sentence-transformer/blob/main/LICENSE) in the repository.

## Model overview

The `cross-en-de-roberta-sentence-transformer` model is a multilingual sentence embedding model fine-tuned by T-Systems-onsite. It is capable of computing semantically meaningful sentence embeddings for both English and German text. These embeddings can then be compared using cosine similarity to find sentences with similar meanings, which can be useful for tasks like semantic textual similarity, semantic search, and paraphrase mining.

The model is an extension of the Sentence-BERT (SBERT) architecture, which uses a Siamese network structure to derive sentence embeddings that can be efficiently compared. Compared to using standard BERT or RoBERTa models, this reduces the computational effort for finding similar sentence pairs from 65 hours to just 5 seconds, while maintaining high accuracy.

What sets this model apart is its ability to work cross-lingually. Sentences in either English or German are mapped to similar vector representations based on their semantic meaning. This allows you to, for example, search for results in German and also find relevant content in English.

## Model inputs and outputs

### Inputs
- **Text**: The model takes text as input, which can be individual sentences or longer passages.

### Outputs
- **Sentence embeddings**: The model outputs a 768-dimensional vector representation for each input text. These sentence embeddings capture the semantic meaning of the input and can be compared using cosine similarity.

## Capabilities

The `cross-en-de-roberta-sentence-transformer` model is particularly adept at tasks that require understanding the semantic similarity between text, such as:

- **Semantic Textual Similarity**: Comparing the meaning of two sentences or passages and quantifying how similar they are.
- **Semantic Search**: Retrieving the most relevant sentences or documents from a corpus based on the semantic meaning of a query.
- **Paraphrase Mining**: Identifying sentences that express the same meaning using different wording.

The model's cross-lingual capabilities make it well-suited for use cases involving both English and German text, where you may need to find semantically related content across languages.

## What can I use it for?

This model can be a powerful tool for a variety of applications, including:

- **Enterprise search**: Enable users to search your company's knowledge base or documents using natural language queries, and retrieve the most relevant content based on semantic meaning rather than just keyword matching.

- **Customer support**: Automatically surface the most relevant FAQs or help articles for a customer's query, even if the wording doesn't exactly match the available content.

- **Content recommendation**: Suggest related articles, blog posts, or other content to users based on the semantic similarity of the text, rather than just popularity or keyword overlap.

[According to the maintainer's profile](https://aimodels.fyi/creators/huggingFace/T-Systems-onsite), T-Systems-onsite specializes in providing AI solutions for enterprises. They may be interested in partnering with companies looking to apply advanced natural language processing capabilities like this model to their business challenges.

## Things to try

One interesting aspect of this model is its ability to work cross-lingually between English and German. You could experiment with using the model to:

- Implement a bilingual search engine, where users can query in either language and retrieve relevant results in both English and German.
- Develop a paraphrase generation tool that can suggest alternative phrasings of a given sentence, even across the language barrier.
- Analyze the differences in how semantically similar sentences are encoded in the two languages, which could provide insights into cultural or linguistic differences.

By leveraging the unique cross-lingual capabilities of the `cross-en-de-roberta-sentence-transformer` model, you can unlock new possibilities for working with multilingual text data in your applications.