Shi-labs
Rank:Average Model Cost: $0.0000
Number of Runs: 35,011
Models by this creator
oneformer_ade20k_swin_tiny
oneformer_ade20k_swin_tiny
OneFormer is a multi-task universal image segmentation model trained on the ADE20k dataset using the Swin backbone. It is designed to perform semantic, instance, and panoptic segmentation tasks. The model utilizes a task token to condition the model on the task at hand, enabling it to perform well on different segmentation tasks using a single architecture. This checkpoint provides a tiny-sized version of the OneFormer model.
$-/run
17.4K
Huggingface
oneformer_ade20k_swin_large
oneformer_ade20k_swin_large
OneFormer is a multi-task universal image segmentation framework that outperforms existing models in semantic, instance, and panoptic segmentation tasks. It uses a single model with a universal architecture and a task token to condition the model on the specific task. This particular model is trained on the ADE20k dataset with a large-sized version of the Swin backbone. It can be used for semantic, instance, and panoptic segmentation tasks. Other fine-tuned versions on different datasets are also available.
$-/run
6.6K
Huggingface
oneformer_coco_swin_large
oneformer_coco_swin_large
OneFormer OneFormer model trained on the COCO dataset (large-sized version, Swin backbone). It was introduced in the paper OneFormer: One Transformer to Rule Universal Image Segmentation by Jain et al. and first released in this repository. Model description OneFormer is the first multi-task universal image segmentation framework. It needs to be trained only once with a single universal architecture, a single model, and on a single dataset, to outperform existing specialized models across semantic, instance, and panoptic segmentation tasks. OneFormer uses a task token to condition the model on the task in focus, making the architecture task-guided for training, and task-dynamic for inference, all with a single model. Intended uses & limitations You can use this particular checkpoint for semantic, instance and panoptic segmentation. See the model hub to look for other fine-tuned versions on a different dataset. How to use Here is how to use this model: For more examples, please refer to the documentation. Citation
$-/run
4.4K
Huggingface
oneformer_cityscapes_swin_large
oneformer_cityscapes_swin_large
OneFormer OneFormer model trained on the Cityscapes dataset (large-sized version, Swin backbone). It was introduced in the paper OneFormer: One Transformer to Rule Universal Image Segmentation by Jain et al. and first released in this repository. Model description OneFormer is the first multi-task universal image segmentation framework. It needs to be trained only once with a single universal architecture, a single model, and on a single dataset, to outperform existing specialized models across semantic, instance, and panoptic segmentation tasks. OneFormer uses a task token to condition the model on the task in focus, making the architecture task-guided for training, and task-dynamic for inference, all with a single model. Intended uses & limitations You can use this particular checkpoint for semantic, instance and panoptic segmentation. See the model hub to look for other fine-tuned versions on a different dataset. How to use Here is how to use this model: For more examples, please refer to the documentation. Citation
$-/run
3.6K
Huggingface
versatile-diffusion
versatile-diffusion
Versatile Diffusion V1.0 Model Card We built Versatile Diffusion (VD), the first unified multi-flow multimodal diffusion framework, as a step towards Universal Generative AI. Versatile Diffusion can natively support image-to-text, image-variation, text-to-image, and text-variation, and can be further extended to other applications such as semantic-style disentanglement, image-text dual-guided generation, latent image-to-text-to-image editing, and more. Future versions will support more modalities such as speech, music, video and 3D. Resources for more information: GitHub, arXiv. Model Details One single flow of Versatile Diffusion contains a VAE, a diffuser, and a context encoder, and thus handles one task (e.g., text-to-image) under one data type (e.g., image) and one context type (e.g., text). The multi-flow structure of Versatile Diffusion shows in the following diagram: Developed by: Xingqian Xu, Atlas Wang, Eric Zhang, Kai Wang, and Humphrey Shi Model type: Diffusion-based multimodal generation model Language(s): English License: MIT Resources for more information: GitHub Repository, Paper. Cite as: Usage You can use the model both with the 🧨Diffusers library and the SHI-Labs Versatile Diffusion codebase. 🧨 Diffusers Diffusers let's you both use a unified and more memory-efficient, task-specific pipelines. Make sure to install transformers from "main" in order to use this model.: VersatileDiffusionPipeline To use Versatile Diffusion for all tasks, it is recommend to use the VersatileDiffusionPipeline Task Specific The task specific pipelines load only the weights that are needed onto GPU. You can find all task specific pipelines here. You can use them as follows: Text to Image Original GitHub Repository Follow the instructions here. Cautions, Biases, and Content Acknowledgment We would like the raise the awareness of users of this demo of its potential issues and concerns. Like previous large foundation models, Versatile Diffusion could be problematic in some cases, partially due to the imperfect training data and pretrained network (VAEs / context encoders) with limited scope. In its future research phase, VD may do better on tasks such as text-to-image, image-to-text, etc., with the help of more powerful VAEs, more sophisticated network designs, and more cleaned data. So far, we have kept all features available for research testing both to show the great potential of the VD framework and to collect important feedback to improve the model in the future. We welcome researchers and users to report issues with the HuggingFace community discussion feature or email the authors. Beware that VD may output content that reinforces or exacerbates societal biases, as well as realistic faces, pornography, and violence. VD was trained on the LAION-2B dataset, which scraped non-curated online images and text, and may contain unintended exceptions as we removed illegal content. VD in this demo is meant only for research purposes.
$-/run
1.6K
Huggingface
nat-mini-in1k-224
nat-mini-in1k-224
NAT (mini variant) NAT-Mini trained on ImageNet-1K at 224x224 resolution. It was introduced in the paper Neighborhood Attention Transformer by Hassani et al. and first released in this repository. Model description NAT is a hierarchical vision transformer based on Neighborhood Attention (NA). Neighborhood Attention is a restricted self attention pattern in which each token's receptive field is limited to its nearest neighboring pixels. NA is a sliding-window attention patterns, and as a result is highly flexible and maintains translational equivariance. NA is implemented in PyTorch implementations through its extension, NATTEN. Source Intended uses & limitations You can use the raw model for image classification. See the model hub to look for fine-tuned versions on a task that interests you. Example Here is how to use this model to classify an image from the COCO 2017 dataset into one of the 1,000 ImageNet classes: For more examples, please refer to the documentation. Requirements Other than transformers, this model requires the NATTEN package. If you're on Linux, you can refer to shi-labs.com/natten for instructions on installing with pre-compiled binaries (just select your torch build to get the correct wheel URL). You can alternatively use pip install natten to compile on your device, which may take up to a few minutes. Mac users only have the latter option (no pre-compiled binaries). Refer to NATTEN's GitHub for more information. BibTeX entry and citation info
$-/run
344
Huggingface
oneformer_ade20k_dinat_large
oneformer_ade20k_dinat_large
OneFormer OneFormer model trained on the ADE20k dataset (large-sized version, Dinat backbone). It was introduced in the paper OneFormer: One Transformer to Rule Universal Image Segmentation by Jain et al. and first released in this repository. Model description OneFormer is the first multi-task universal image segmentation framework. It needs to be trained only once with a single universal architecture, a single model, and on a single dataset, to outperform existing specialized models across semantic, instance, and panoptic segmentation tasks. OneFormer uses a task token to condition the model on the task in focus, making the architecture task-guided for training, and task-dynamic for inference, all with a single model. Intended uses & limitations You can use this particular checkpoint for semantic, instance and panoptic segmentation. See the model hub to look for other fine-tuned versions on a different dataset. How to use Here is how to use this model: For more examples, please refer to the documentation. Citation
$-/run
339
Huggingface
oneformer_coco_dinat_large
oneformer_coco_dinat_large
OneFormer OneFormer model trained on the COCO dataset (large-sized version, Dinat backbone). It was introduced in the paper OneFormer: One Transformer to Rule Universal Image Segmentation by Jain et al. and first released in this repository. Model description OneFormer is the first multi-task universal image segmentation framework. It needs to be trained only once with a single universal architecture, a single model, and on a single dataset, to outperform existing specialized models across semantic, instance, and panoptic segmentation tasks. OneFormer uses a task token to condition the model on the task in focus, making the architecture task-guided for training, and task-dynamic for inference, all with a single model. Intended uses & limitations You can use this particular checkpoint for semantic, instance and panoptic segmentation. See the model hub to look for other fine-tuned versions on a different dataset. How to use Here is how to use this model: For more examples, please refer to the documentation. Citation
$-/run
252
Huggingface
oneformer_cityscapes_dinat_large
oneformer_cityscapes_dinat_large
OneFormer OneFormer model trained on the Cityscapes dataset (large-sized version, Dinat backbone). It was introduced in the paper OneFormer: One Transformer to Rule Universal Image Segmentation by Jain et al. and first released in this repository. Model description OneFormer is the first multi-task universal image segmentation framework. It needs to be trained only once with a single universal architecture, a single model, and on a single dataset, to outperform existing specialized models across semantic, instance, and panoptic segmentation tasks. OneFormer uses a task token to condition the model on the task in focus, making the architecture task-guided for training, and task-dynamic for inference, all with a single model. Intended uses & limitations You can use this particular checkpoint for semantic, instance and panoptic segmentation. See the model hub to look for other fine-tuned versions on a different dataset. How to use Here is how to use this model: For more examples, please refer to the documentation. Citation
$-/run
243
Huggingface
dinat-mini-in1k-224
dinat-mini-in1k-224
DiNAT (mini variant) DiNAT-Mini trained on ImageNet-1K at 224x224 resolution. It was introduced in the paper Dilated Neighborhood Attention Transformer by Hassani et al. and first released in this repository. Model description DiNAT is a hierarchical vision transformer based on Neighborhood Attention (NA) and its dilated variant (DiNA). Neighborhood Attention is a restricted self attention pattern in which each token's receptive field is limited to its nearest neighboring pixels. NA and DiNA are therefore sliding-window attention patterns, and as a result are highly flexible and maintain translational equivariance. They come with PyTorch implementations through the NATTEN package. Source Intended uses & limitations You can use the raw model for image classification. See the model hub to look for fine-tuned versions on a task that interests you. Example Here is how to use this model to classify an image from the COCO 2017 dataset into one of the 1,000 ImageNet classes: For more examples, please refer to the documentation. Requirements Other than transformers, this model requires the NATTEN package. If you're on Linux, you can refer to shi-labs.com/natten for instructions on installing with pre-compiled binaries (just select your torch build to get the correct wheel URL). You can alternatively use pip install natten to compile on your device, which may take up to a few minutes. Mac users only have the latter option (no pre-compiled binaries). Refer to NATTEN's GitHub for more information. BibTeX entry and citation info
$-/run
234
Huggingface