Kakaobrain

Rank:

Average Model Cost: $0.0000

Number of Runs: 22,288

Models by this creator

align-base

align-base

kakaobrain

The ALIGN (base model) is a dual-encoder model that aligns visual and text representations using contrastive learning. It consists of an EfficientNet as the vision encoder and BERT as the text encoder. The model is trained on a large-scale noisy dataset called COYO-700M, which contains 700 million image-text pairs. The model is designed for zero-shot image classification and multi-modal embedding retrieval tasks. It can be used with Transformers and is intended for research purposes to explore zero-shot image classification and the potential impact of such models.

Read more

$-/run

9.2K

Huggingface

karlo-v1-alpha-image-variations

karlo-v1-alpha-image-variations

Karlo is a text-conditional image generation model based on OpenAI's unCLIP architecture with the improvement over the standard super-resolution model from 64px to 256px, recovering high-frequency details only in the small number of denoising steps. Karlo is available in diffusers! Karlo is a text-conditional diffusion model based on unCLIP, composed of prior, decoder, and super-resolution modules. In this repository, we include the improved version of the standard super-resolution module for upscaling 64px to 256px only in 7 reverse steps, as illustrated in the figure below: In specific, the standard SR module trained by DDPM objective upscales 64px to 256px in the first 6 denoising steps based on the respacing technique. Then, the additional fine-tuned SR module trained by VQ-GAN-style loss performs the final reverse step to recover high-frequency details. We observe that this approach is very effective to upscale the low-resolution in a small number of reverse steps. We train all components from scratch on 115M image-text pairs including COYO-100M, CC3M, and CC12M. In the case of Prior and Decoder, we use ViT-L/14 provided by OpenAI’s CLIP repository. Unlike the original implementation of unCLIP, we replace the trainable transformer in the decoder into the text encoder in ViT-L/14 for efficiency. In the case of the SR module, we first train the model using the DDPM objective in 1M steps, followed by additional 234K steps to fine-tune the additional component. The table below summarizes the important statistics of our components: In the checkpoint links, ViT-L-14 is equivalent to the original version, but we include it for convenience. We also remark that ViT-L-14-stats is required to normalize the outputs of the prior module. We quantitatively measure the performance of Karlo-v1.0.alpha in the validation split of CC3M and MS-COCO. The table below presents CLIP-score and FID. To measure FID, we resize the image of the shorter side to 256px, followed by cropping it at the center. We set classifier-free guidance scales for prior and decoder to 4 and 8 in all cases. We observe that our model achieves reasonable performance even with 25 sampling steps of decoder. CC3M MS-COCO For more information, please refer to the upcoming technical report. This alpha version of Karlo is trained on 115M image-text pairs, including COYO-100M high-quality subset, CC3M, and CC12M. For those who are interested in a better version of Karlo trained on more large-scale high-quality datasets, please visit the landing page of our application B^DISCOVER. If you find this repository useful in your research, please cite:

Read more

$-/run

1.6K

Huggingface

Similar creators