Naver-clova-ix
Models by this creator
🔍
donut-base-finetuned-docvqa
166
The donut-base-finetuned-docvqa model is a Donut model that has been fine-tuned on the DocVQA dataset. Donut consists of a vision encoder (Swin Transformer) and a text decoder (BART), allowing it to generate text conditioned on an input image. This fine-tuned version is specialized for document visual question answering tasks. The Donut model was originally introduced in the paper OCR-free Document Understanding Transformer by researchers from Naver's Clova AI division. This particular fine-tuned version was released by the same team, though they did not provide an official model card. Model inputs and outputs Inputs Image**: The Donut model takes an image as input, which it encodes using the Swin Transformer vision encoder. Outputs Text**: The model generates text autoregressively using the BART decoder, producing answers or summaries conditioned on the input image. Capabilities The donut-base-finetuned-docvqa model is capable of understanding and answering questions about document images, without the need for optical character recognition (OCR). This can be useful for tasks like extracting information from invoices, forms, or other complex document layouts. What can I use it for? You can use the donut-base-finetuned-docvqa model for document visual question answering tasks, where the goal is to answer questions about the content and layout of document images. This could be helpful for automating information extraction from business documents, scientific papers, or other types of structured text. To see other fine-tuned versions of the Donut model, you can check the Hugging Face model hub. Things to try One key aspect of the Donut model is its ability to understand document layouts and visually complex inputs without relying on OCR. This could be useful for tasks where the document structure or formatting is an important part of the information to be extracted. You could try using the donut-base-finetuned-docvqa model to answer questions about the overall structure and content of documents, rather than just extracting specific pieces of text.
Updated 5/28/2024
📊
donut-base
151
The donut-base model is a pre-trained Donut model. Donut is a document understanding transformer that does not require optical character recognition (OCR). It consists of a vision encoder (Swin Transformer) and a text decoder (BART). The model takes an image as input and generates corresponding text, without the need for OCR preprocessing. Similar models include the donut-base-finetuned-docvqa and donut-base-finetuned-cord-v2, which are fine-tuned versions of the base model on the DocVQA and CORD datasets, respectively. The nougat-base model is also based on the Donut architecture, but is trained on PDF-to-Markdown transcription. Model inputs and outputs Inputs Image**: The model takes an image as input, which can be of any document or visual content. Outputs Text**: The model generates text that corresponds to the input image, without the need for OCR preprocessing. Capabilities The donut-base model is capable of understanding document images and generating relevant text without relying on OCR. This can be useful for tasks such as document image classification, parsing, or visual question answering, where the model can directly process the visual information without the need for an additional OCR step. What can I use it for? You can use the donut-base model as a starting point for fine-tuning on a specific document understanding task, such as those mentioned above. The model hub provides some examples of fine-tuned versions of the model, which you can explore to find one that suits your needs. Things to try One interesting thing to try with the donut-base model is to experiment with different input image sizes and resolutions. Since the model uses a Swin Transformer as the vision encoder, it may be able to handle a wide range of image sizes and still generate accurate text. You could also try using the model on a variety of document types, such as forms, invoices, or scientific papers, to see how it performs in different contexts.
Updated 5/27/2024
🏅
donut-base-finetuned-cord-v2
65
The donut-base-finetuned-cord-v2 model is a fine-tuned version of the Donut model, introduced in the paper "OCR-free Document Understanding Transformer" by Geewok et al. The model consists of a Swin Transformer vision encoder and a BART text decoder, allowing it to perform document understanding tasks without requiring optical character recognition (OCR). This particular model has been fine-tuned on the CORD dataset, a document parsing dataset. It builds upon the capabilities of the base Donut model, which was pre-trained on a large corpus of document images and their corresponding text. Similar models include the nougat-latex-base model, which is fine-tuned for improved LaTeX code generation from images, and the TrOCR model, which is optimized for optical character recognition on printed text. Model inputs and outputs Inputs Image**: The model takes an image as input, which it then encodes using the Swin Transformer vision encoder. Outputs Text**: The model generates text output, conditioned on the encoded image representation. This text output can be used for document understanding tasks such as information extraction, text summarization, or table recognition. Capabilities The donut-base-finetuned-cord-v2 model is capable of performing OCR-free document understanding tasks. It can extract and generate text from document images without requiring a separate OCR step. This can be particularly useful in scenarios where the document layout or formatting makes it challenging to apply traditional OCR techniques. What can I use it for? You can use the donut-base-finetuned-cord-v2 model for a variety of document understanding tasks, such as: Information extraction**: Extracting key information (e.g., names, addresses, dates) from documents like invoices, contracts, or forms. Text summarization**: Generating concise summaries of longer documents, such as research papers or legal documents. Table recognition**: Identifying and extracting structured data from tables within document images. The model's ability to perform these tasks without relying on OCR can make it particularly useful in scenarios where the document layout or formatting is complex or varied. Things to try One interesting aspect of the donut-base-finetuned-cord-v2 model is its potential to generalize beyond the CORD dataset it was fine-tuned on. You could experiment with using the model on other types of document images, such as financial reports, scientific papers, or government forms, to see how it performs. Additionally, you could explore fine-tuning the model further on your own dataset to tailor its performance to your specific use case.
Updated 5/28/2024