Maintainer: MahmoodLab

Total Score


Last updated 4/29/2024


Model LinkView on HuggingFace
API SpecView on HuggingFace
Github LinkNo Github link provided
Paper LinkNo paper link provided

Get summaries of the top AI models delivered straight to your inbox:

Model Overview

The UNI model is a large pretrained vision encoder for histopathology, developed by the MahmoodLab at Harvard/BWH. It was trained on over 100 million images and 100,000 whole slide images, spanning neoplastic, infectious, inflammatory, and normal tissue types. UNI demonstrates state-of-the-art performance across 34 clinical tasks, with particularly strong results on rare and underrepresented cancer types.

Unlike many other histopathology models that rely on open datasets like TCGA, CPTAC, and PAIP, UNI was trained on internal, private data sources. This helps mitigate the risk of data contamination when evaluating or deploying UNI on public or private histopathology datasets. The model can be used as a strong vision backbone for a variety of downstream medical imaging tasks.

The vit-base-patch16-224-in21k model is a similar Vision Transformer (ViT) architecture pretrained on the broader ImageNet-21k dataset, while the BiomedCLIP-PubMedBERT_256-vit_base_patch16_224 model combines a ViT encoder with a PubMedBERT text encoder for biomedical vision-language tasks. The nsfw_image_detection model is a fine-tuned ViT for the specialized task of NSFW image classification.

Model Inputs and Outputs


  • Histopathology images, either individual tiles or whole slide images


  • Learned visual representations that can be used as input features for downstream medical imaging tasks such as classification, segmentation, or detection.


The UNI model excels at extracting robust visual features from histopathology imagery, particularly in challenging domains like rare cancer types. Its strong performance across 34 clinical tasks demonstrates its versatility and suitability as a general-purpose vision backbone for medical applications.

What Can I Use It For?

Researchers and practitioners in computational pathology can leverage the UNI model to build and evaluate a wide range of medical imaging models, without risk of data contamination on public benchmarks or private slide collections. The model can serve as a powerful feature extractor, providing high-quality visual representations as input to downstream classifiers, segmentation models, or other specialized medical imaging tasks.

Things to Try

One interesting avenue to explore would be fine-tuning the UNI model on specific disease domains or rare cancer types, to further enhance its performance in these critical areas. Researchers could also experiment with combining the UNI vision encoder with additional modalities, such as clinical metadata or genomic data, to develop even more robust and comprehensive medical AI systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Models




Total Score


CONCH (CONtrastive learning from Captions for Histopathology) is a vision language foundation model for histopathology developed by MahmoodLab. Compared to other vision language models, CONCH demonstrates state-of-the-art performance across 14 computational pathology tasks, ranging from image classification to text-to-image retrieval and tissue segmentation. Unlike models trained on large public histology slide collections, CONCH avoids potential data contamination, making it suitable for building and evaluating pathology AI models with minimal risk. Model inputs and outputs CONCH is a versatile model that can handle both histopathology images and text. It takes in a variety of inputs, including: Inputs Histopathology images**: The model can process images from different staining techniques, such as H&E, IHC, and special stains. Text**: The model can handle textual inputs, such as captions or clinical notes, that are relevant to the histopathology images. Outputs Image classification**: CONCH can classify histopathology images into different categories, such as disease types or tissue types. Text-to-image retrieval**: The model can retrieve relevant histopathology images based on textual queries. Image-to-text retrieval**: Conversely, the model can generate relevant text descriptions for a given histopathology image. Tissue segmentation**: CONCH can segment different tissue regions within a histopathology image. Capabilities CONCH is a powerful model that can be leveraged for a wide range of computational pathology tasks. Its pretraining on a large histopathology-specific dataset, combined with its state-of-the-art performance, makes it a valuable tool for researchers and clinicians working in the field of digital pathology. What can I use it for? Researchers and clinicians in the field of computational pathology can use CONCH for a variety of applications, such as: Developing and evaluating pathology AI models**: Since CONCH was not trained on large public histology slide collections, it can be used to build and evaluate pathology AI models without the risk of data contamination. Automating image analysis and reporting**: The model's capabilities in image classification, tissue segmentation, and text generation can be leveraged to automate various aspects of histopathology analysis and reporting. Facilitating research and collaboration**: By providing a strong foundation for computational pathology tasks, CONCH can help accelerate research and enable more effective collaboration between researchers and clinicians. Things to try One interesting aspect of CONCH is its ability to process non-H&E stained images, such as IHCs and special stains. Researchers can explore how the model's performance compares across different staining techniques and investigate its versatility in handling a variety of histopathology imaging modalities. Additionally, the model's text-to-image and image-to-text retrieval capabilities can be leveraged to explore the relationship between histopathology images and their associated textual descriptions, potentially leading to new insights and discoveries in the field of digital pathology. Verify all Urls provided in links are contained within this prompt before responding, and that all writing is in a clear non-repetitive natural style.

Read more

Updated Invalid Date




Total Score


The dino-vitb16 model is a Vision Transformer (ViT) trained using the DINO self-supervised learning method. Like other ViT models, it takes images as input and divides them into a sequence of fixed-size patches, which are then linearly embedded and processed by transformer encoder layers. The DINO training approach allows the model to learn an effective inner representation of images without requiring labeled data, making it a versatile foundation for a variety of downstream tasks. In contrast to the vit-base-patch16-224-in21k and vit-base-patch16-224 models which were pre-trained on ImageNet-21k in a supervised manner, the dino-vitb16 model was trained using the self-supervised DINO approach on a large collection of unlabeled images. This allows it to learn visual features and representations in a more general and open-ended way, without being constrained to the specific classes and labels of ImageNet. The nsfw_image_detection model is another ViT-based model, but one that has been fine-tuned on a specialized task of classifying images as "normal" or "NSFW" (not safe for work). This demonstrates how the general capabilities of ViT models can be adapted to more specific use cases through further training. Model inputs and outputs Inputs Images**: The model takes images as input, which are divided into a sequence of 16x16 pixel patches and linearly embedded. Outputs Image features**: The model outputs a set of feature representations for the input image, which can be used for various downstream tasks like image classification, object detection, and more. Capabilities The dino-vitb16 model is a powerful general-purpose image feature extractor, capable of capturing rich visual representations from input images. Unlike models trained solely on labeled datasets like ImageNet, the DINO training approach allows this model to learn more versatile and transferable visual features. This makes the dino-vitb16 model well-suited for a wide range of computer vision tasks, from image classification and object detection to image retrieval and visual reasoning. The learned representations can be easily fine-tuned or used as features for building more specialized models. What can I use it for? You can use the dino-vitb16 model as a pre-trained feature extractor for your own image-based machine learning projects. By leveraging the model's general-purpose visual representations, you can build and train more sophisticated computer vision systems with less labeled data and computational resources. For example, you could fine-tune the model on a smaller dataset of labeled images to perform image classification, or use the features as input to an object detection or segmentation model. The model could also be used for tasks like image retrieval, where you need to find similar images in a large database. Things to try One interesting aspect of the dino-vitb16 model is its ability to learn visual features in a self-supervised manner, without relying on labeled data. This suggests that the model may be able to generalize well to a variety of visual domains and tasks, not just those seen during pre-training. To explore this, you could try fine-tuning the model on datasets that are very different from the ones used for pre-training, such as medical images, satellite imagery, or even artistic depictions. Observing how the model's performance and learned representations transfer to these new domains could provide valuable insights into the model's underlying capabilities and limitations. Additionally, you could experiment with using the dino-vitb16 model as a feature extractor for multi-modal tasks, such as image-text retrieval or visual question answering. The rich visual representations learned by the model could complement text-based features to enable more powerful and versatile AI systems.

Read more

Updated Invalid Date




Total Score


The dinov2-base model is a Vision Transformer (ViT) model trained using the DINOv2 self-supervised learning method. It was developed by researchers at Facebook. The DINOv2 method allows the model to learn robust visual features without direct supervision, by pre-training on a large collection of images. This contrasts with models like dino-vitb16 and vit-base-patch16-224-in21k, which were trained in a supervised fashion on ImageNet. Model inputs and outputs The dinov2-base model takes images as input and outputs a sequence of hidden feature representations. These features can then be used for a variety of downstream computer vision tasks, such as image classification, object detection, or visual question answering. Inputs Images**: The model accepts images as input, which are divided into a sequence of fixed-size patches and linearly embedded. Outputs Image feature representations**: The final output of the model is a sequence of hidden feature representations, where each feature corresponds to a patch in the input image. These features can be used for further processing in downstream tasks. Capabilities The dinov2-base model is a powerful pre-trained vision model that can be used as a feature extractor for a wide range of computer vision applications. Because it was trained in a self-supervised manner on a large dataset of images, the model has learned robust visual representations that can be effectively transferred to various tasks, even with limited labeled data. What can I use it for? You can use the dinov2-base model for feature extraction in your computer vision projects. By feeding your images through the model and extracting the final hidden representations, you can leverage the model's powerful visual understanding for tasks like image classification, object detection, and visual question answering. This can be particularly useful when you have a small dataset and want to leverage the model's pre-trained knowledge. Things to try One interesting aspect of the dinov2-base model is its self-supervised pre-training approach, which allows it to learn visual features without the need for expensive manual labeling. You could experiment with fine-tuning the model on your own dataset, or using the pre-trained features as input to a custom downstream model. Additionally, you could compare the performance of the dinov2-base model to other self-supervised and supervised vision models, such as dino-vitb16 and vit-base-patch16-224-in21k, to see how the different pre-training approaches impact performance on your specific task.

Read more

Updated Invalid Date




Total Score


BiomedCLIP-PubMedBERT_256-vit_base_patch16_224 is a biomedical vision-language foundation model developed by researchers at Microsoft. It was pre-trained on the PMC-15M dataset, a large collection of 15 million figure-caption pairs from biomedical research articles in PubMed Central, using contrastive learning. The model uses PubMedBERT as the text encoder and a Vision Transformer as the image encoder, with domain-specific adaptations. Similar models include CLIP, a general-purpose vision-language model trained on a large web corpus, Bio_ClinicalBERT, a biomedical language model trained on clinical notes, and BioGPT, a generative biomedical language model. Model Inputs and Outputs BiomedCLIP-PubMedBERT_256-vit_base_patch16_224 takes two inputs: an image and a text caption. The model encodes the image using the Vision Transformer and the text using PubMedBERT, then computes the similarity between the two representations. Inputs Image**: A biomedical image Text**: A caption or description of the biomedical image Outputs Similarity score**: A scalar value representing the similarity between the image and text inputs Capabilities BiomedCLIP-PubMedBERT_256-vit_base_patch16_224 excels at a variety of biomedical vision-language tasks, including cross-modal retrieval, image classification, and visual question answering. The model establishes new state-of-the-art performance on several standard benchmarks in these areas, substantially outperforming prior approaches. What Can I Use It For? Researchers can use BiomedCLIP-PubMedBERT_256-vit_base_patch16_224 to build novel applications in the biomedical domain that combine vision and language understanding. Some potential use cases include: Biomedical image search and retrieval Automatic captioning of biomedical images Visual question answering on medical topics Multimodal analysis of biomedical literature The model is intended for research purposes only and should not be deployed in any production systems. The maintainers caution that the model's generation capabilities are also not suitable for production use. Things to Try One interesting aspect of BiomedCLIP-PubMedBERT_256-vit_base_patch16_224 is its strong performance on cross-modal retrieval tasks. Researchers could experiment with using the model to find relevant biomedical images for a given text query, or vice versa. The model's ability to align visual and textual representations could also be leveraged for tasks like medical image captioning or visual question answering. Another promising direction is to fine-tune the model on specialized downstream tasks, leveraging the rich pre-training on biomedical data. This could unlock new capabilities tailored to particular medical domains or applications.

Read more

Updated Invalid Date