[](#darkbert)DarkBERT
=====================

A BERT-like model pretrained with a Dark Web corpus as described in "DarkBERT: A Language Model for the Dark Side of the Internet (ACL 2023)"

[](#conditions)Conditions
=========================

DarkBERT is available for access upon request. Users may submit their request using the form below, which includes the **name of the user**, the **users institution**, the **users email address that matches the institution** (we especially emphasize this part; any non-academic addresses such as gmail, tutanota, protonmail, etc. are automatically rejected as it makes it difficult for us to verify your affiliation to the institution) and the **purpose of usage**. By requesting and downloading DarkBERT, the user agrees to the following: the user acknowledges that the use of this model is restricted to research and/or academic purposes only. Access to the model will be granted after the request is manually reviewed. A request may be declined if it does not sufficiently describe research purposes that follow the ACM Code of Ethics ([https://www.acm.org/code-of-ethics](https://www.acm.org/code-of-ethics)). The information provided by the requesting user will not be used in any way except for sending the dataset to the user and keeping track of request history for DarkBERT. By requesting for the model, the user agrees to our collection of the provided information. This model shall only be used for non-profit research purposes and in a manner consistent with fair practice. Do not redistribute this dataset to others. The user should indicate the source of this model (found at the bottom of the page) when using or citing the model in their research or article.

[](#what-is-included)What is included?
--------------------------------------

The preprocessed version of DarkBERT.

Benchmark datasets in the `benchmark-dataset` directory.

[](#sample-usage)Sample Usage
-----------------------------

    >>> from transformers import pipeline
    >>> folder_dir = "DarkBERT"
    >>> unmasker = pipeline('fill-mask', model=folder_dir)
    >>> unmasker("RagnarLocker, LockBit, and REvil are types of <mask>.")
    
    [{'score': 0.4952353239059448, 'token': 25346, 'token_str': ' ransomware', 'sequence': 'RagnarLocker, LockBit, and REvil are types of ransomware.'},
    {'score': 0.04661545157432556, 'token': 16886, 'token_str': ' malware', 'sequence': 'RagnarLocker, LockBit, and REvil are types of malware.'},
    {'score': 0.04217657446861267, 'token': 28811, 'token_str': ' wallets', 'sequence': 'RagnarLocker, LockBit, and REvil are types of wallets.'},
    {'score': 0.028982503339648247, 'token': 2196, 'token_str': ' drugs', 'sequence': 'RagnarLocker, LockBit, and REvil are types of drugs.'},
    {'score': 0.020001502707600594, 'token': 11344, 'token_str': ' hackers', 'sequence': 'RagnarLocker, LockBit, and REvil are types of hackers.'}]
    
    >>> from transformers import AutoModel, AutoTokenizer
    >>> model = AutoModel.from_pretrained(folder_dir)
    >>> tokenizer = AutoTokenizer.from_pretrained(folder_dir)
    >>> text = "Recent research has suggested that there are clear differences in the language used in the Dark Web compared to that of the Surface Web."
    >>> encoded = tokenizer(text, return_tensors="pt")
    >>> output = model(**encoded)
    >>> output[0].shape
    
    torch.Size([1, 27, 768])
    

[](#citation)Citation
---------------------

If you are using the DarkBERT model, please cite the following paper accordingly:

    Youngjin Jin, Eugene Jang, Jian Cui, Jin-Woo Chung, Yongjae Lee, and Seungwon Shin. 2023. DarkBERT: A Language Model for the Dark Side of the Internet. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 75157533, Toronto, Canada. Association for Computational Linguistics.

## Model overview

`DarkBERT` is a BERT-like language model that has been pretrained on a corpus of dark web data, as described in the research paper "DarkBERT: A Language Model for the Dark Side of the Internet (ACL 2023)". It was developed by the organization [s2w-ai](https://aimodels.fyi/creators/huggingFace/s2w-ai). This model differs from standard BERT models in that it has been exposed to a dataset focused on the darker corners of the internet, potentially giving it unique capabilities for understanding and processing that type of content.

The `DarkBERT` model shares similarities with other well-known BERT-based models like [BERT-large, uncased, whole-word masking](https://aimodels.fyi/models/huggingFace/bert-large-uncased-whole-word-masking-finetuned-squad-google-bert), [BERT-base, uncased](https://aimodels.fyi/models/huggingFace/bert-base-uncased-google-bert), [BERT-base, cased](https://aimodels.fyi/models/huggingFace/bert-base-cased-google-bert), and [DistilBERT-base, uncased](https://aimodels.fyi/models/huggingFace/distilbert-base-uncased-distilbert). Like these models, `DarkBERT` uses a masked language modeling (MLM) objective during pretraining, which allows it to learn rich contextual representations of text.

## Model inputs and outputs

### Inputs
- Text sequences of up to 512 tokens

### Outputs
- Predicted tokens to fill masked positions in the input text
- Confidence scores for each predicted token

## Capabilities

The `DarkBERT` model has been specifically trained on a dark web corpus, meaning it may have unique capabilities for understanding and processing content related to cybercrime, underground marketplaces, and other illicit activities found on the dark web. This could make it useful for tasks like detecting and analyzing mentions of specific dark web entities, understanding the sentiment and intent behind dark web-related communications, or identifying potential threats or illegal activities.

## What can I use it for?

The `DarkBERT` model could be a valuable tool for researchers, security professionals, and law enforcement agencies working to better understand and combat dark web-related activities. It could be used to aid in the analysis of dark web forum posts, dark web marketplace listings, and other dark web-related text data. Additionally, the model could be fine-tuned for specific tasks like named entity recognition, relation extraction, or text classification to further enhance its capabilities in this domain.

## Things to try

One interesting thing to try with `DarkBERT` would be to compare its performance on dark web-related tasks to that of standard BERT models. This could help shed light on the unique insights the model has gained from its specialized pretraining. You could also experiment with fine-tuning `DarkBERT` on different dark web-related datasets or tasks to further explore its capabilities.