[](#clinicalbert---bio--clinical-bert-model)ClinicalBERT - Bio + Clinical BERT Model
====================================================================================

The [Publicly Available Clinical BERT Embeddings](https://arxiv.org/abs/1904.03323) paper contains four unique clinicalBERT models: initialized with BERT-Base (`cased_L-12_H-768_A-12`) or BioBERT (`BioBERT-Base v1.0 + PubMed 200K + PMC 270K`) & trained on either all MIMIC notes or only discharge summaries.

This model card describes the Bio+Clinical BERT model, which was initialized from [BioBERT](https://arxiv.org/abs/1901.08746) & trained on all MIMIC notes.

[](#pretraining-data)Pretraining Data
-------------------------------------

The `Bio_ClinicalBERT` model was trained on all notes from [MIMIC III](https://www.nature.com/articles/sdata201635), a database containing electronic health records from ICU patients at the Beth Israel Hospital in Boston, MA. For more details on MIMIC, see [here](https://mimic.physionet.org/). All notes from the `NOTEEVENTS` table were included (~880M words).

[](#model-pretraining)Model Pretraining
---------------------------------------

### [](#note-preprocessing)Note Preprocessing

Each note in MIMIC was first split into sections using a rules-based section splitter (e.g. discharge summary notes were split into "History of Present Illness", "Family History", "Brief Hospital Course", etc. sections). Then each section was split into sentences using SciSpacy (`en core sci md` tokenizer).

### [](#pretraining-procedures)Pretraining Procedures

The model was trained using code from [Google's BERT repository](https://github.com/google-research/bert) on a GeForce GTX TITAN X 12 GB GPU. Model parameters were initialized with BioBERT (`BioBERT-Base v1.0 + PubMed 200K + PMC 270K`).

### [](#pretraining-hyperparameters)Pretraining Hyperparameters

We used a batch size of 32, a maximum sequence length of 128, and a learning rate of 5  105 for pre-training our models. The models trained on all MIMIC notes were trained for 150,000 steps. The dup factor for duplicating input data with different masks was set to 5. All other default parameters were used (specifically, masked language model probability = 0.15 and max predictions per sequence = 20).

[](#how-to-use-the-model)How to use the model
---------------------------------------------

Load the model via the transformers library:

    from transformers import AutoTokenizer, AutoModel
    tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
    model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
    

[](#more-information)More Information
-------------------------------------

Refer to the original paper, [Publicly Available Clinical BERT Embeddings](https://arxiv.org/abs/1904.03323) (NAACL Clinical NLP Workshop 2019) for additional details and performance on NLI and NER tasks.

[](#questions)Questions?
------------------------

Post a Github issue on the [clinicalBERT repo](https://github.com/EmilyAlsentzer/clinicalBERT) or email [emilya@mit.edu](mailto:emilya@mit.edu) with any questions.

## Model Overview

The `Bio_ClinicalBERT` model is a specialized language model trained on clinical notes from the MIMIC III dataset. It was initialized from the [BioBERT](https://arxiv.org/abs/1901.08746) model and further trained on the full set of MIMIC III notes, which contain over 880 million words of clinical text. This gives the model specialized knowledge and capabilities for working with biomedical and clinical language.

The `Bio_ClinicalBERT` model can be compared to similar models like [BioMedLM](https://aimodels.fyi/models/huggingFace/biomedlm-stanford-crfm), which was trained on biomedical literature, and the general [BERT-base](https://huggingface.co/bert-base-uncased) and [DistilBERT](https://aimodels.fyi/models/huggingFace/distilbert-base-uncased-distilbert) models, which have more general language understanding capabilities. By focusing the training on clinical notes, the `Bio_ClinicalBERT` model is able to better capture the nuances and specialized vocabulary of the medical domain.

## Model Inputs and Outputs

### Inputs
- Text data, such as clinical notes, research papers, or other biomedical/healthcare-related content

### Outputs
- Contextual embeddings that capture the meaning and relationships between words in the input text
- Predictions for various downstream tasks like named entity recognition, relation extraction, or text classification in the biomedical/clinical domain

## Capabilities

The `Bio_ClinicalBERT` model excels at understanding and processing text in the biomedical and clinical domains. It can be used for tasks like identifying medical entities, extracting relationships between clinical concepts, and classifying notes into different categories. The model's specialized training on the MIMIC III dataset gives it a strong grasp of medical terminology, abbreviations, and the structure of clinical documentation.

## What Can I Use It For?

The `Bio_ClinicalBERT` model can be a powerful tool for a variety of healthcare and biomedical applications. Some potential use cases include:

- Developing clinical decision support systems to assist medical professionals
- Automating the extraction of relevant information from electronic health records
- Improving the accuracy of medical text mining and knowledge discovery
- Building chatbots or virtual assistants to answer patient questions

By leveraging the specialized knowledge captured in the `Bio_ClinicalBERT` model, organizations can enhance their natural language processing capabilities for healthcare and life sciences applications.

## Things to Try

One interesting aspect of the `Bio_ClinicalBERT` model is its ability to handle long-form clinical notes. The model was trained on the full set of MIMIC III notes, which can be quite lengthy and contain a lot of domain-specific terminology and abbreviations. This makes it well-suited for tasks that require understanding the complete context of a clinical encounter, rather than just individual sentences or phrases.

Researchers and developers could explore using the `Bio_ClinicalBERT` model for tasks like summarizing patient histories, identifying key events in a clinical note, or detecting anomalies or potential issues that warrant further investigation by medical professionals.