donut-base-finetuned-docvqa

Maintainer: naver-clova-ix

Total Score

166

Last updated 5/28/2024

🔍

PropertyValue
Run this modelRun on HuggingFace
API specView on HuggingFace
Github linkNo Github link provided
Paper linkNo paper link provided

Create account to get full access

or

If you already have an account, we'll log you in

Model overview

The donut-base-finetuned-docvqa model is a Donut model that has been fine-tuned on the DocVQA dataset. Donut consists of a vision encoder (Swin Transformer) and a text decoder (BART), allowing it to generate text conditioned on an input image. This fine-tuned version is specialized for document visual question answering tasks.

The Donut model was originally introduced in the paper OCR-free Document Understanding Transformer by researchers from Naver's Clova AI division. This particular fine-tuned version was released by the same team, though they did not provide an official model card.

Model inputs and outputs

Inputs

  • Image: The Donut model takes an image as input, which it encodes using the Swin Transformer vision encoder.

Outputs

  • Text: The model generates text autoregressively using the BART decoder, producing answers or summaries conditioned on the input image.

Capabilities

The donut-base-finetuned-docvqa model is capable of understanding and answering questions about document images, without the need for optical character recognition (OCR). This can be useful for tasks like extracting information from invoices, forms, or other complex document layouts.

What can I use it for?

You can use the donut-base-finetuned-docvqa model for document visual question answering tasks, where the goal is to answer questions about the content and layout of document images. This could be helpful for automating information extraction from business documents, scientific papers, or other types of structured text.

To see other fine-tuned versions of the Donut model, you can check the Hugging Face model hub.

Things to try

One key aspect of the Donut model is its ability to understand document layouts and visually complex inputs without relying on OCR. This could be useful for tasks where the document structure or formatting is an important part of the information to be extracted. You could try using the donut-base-finetuned-docvqa model to answer questions about the overall structure and content of documents, rather than just extracting specific pieces of text.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Models

📊

donut-base

naver-clova-ix

Total Score

151

The donut-base model is a pre-trained Donut model. Donut is a document understanding transformer that does not require optical character recognition (OCR). It consists of a vision encoder (Swin Transformer) and a text decoder (BART). The model takes an image as input and generates corresponding text, without the need for OCR preprocessing. Similar models include the donut-base-finetuned-docvqa and donut-base-finetuned-cord-v2, which are fine-tuned versions of the base model on the DocVQA and CORD datasets, respectively. The nougat-base model is also based on the Donut architecture, but is trained on PDF-to-Markdown transcription. Model inputs and outputs Inputs Image**: The model takes an image as input, which can be of any document or visual content. Outputs Text**: The model generates text that corresponds to the input image, without the need for OCR preprocessing. Capabilities The donut-base model is capable of understanding document images and generating relevant text without relying on OCR. This can be useful for tasks such as document image classification, parsing, or visual question answering, where the model can directly process the visual information without the need for an additional OCR step. What can I use it for? You can use the donut-base model as a starting point for fine-tuning on a specific document understanding task, such as those mentioned above. The model hub provides some examples of fine-tuned versions of the model, which you can explore to find one that suits your needs. Things to try One interesting thing to try with the donut-base model is to experiment with different input image sizes and resolutions. Since the model uses a Swin Transformer as the vision encoder, it may be able to handle a wide range of image sizes and still generate accurate text. You could also try using the model on a variety of document types, such as forms, invoices, or scientific papers, to see how it performs in different contexts.

Read more

Updated Invalid Date

🏅

donut-base-finetuned-cord-v2

naver-clova-ix

Total Score

65

The donut-base-finetuned-cord-v2 model is a fine-tuned version of the Donut model, introduced in the paper "OCR-free Document Understanding Transformer" by Geewok et al. The model consists of a Swin Transformer vision encoder and a BART text decoder, allowing it to perform document understanding tasks without requiring optical character recognition (OCR). This particular model has been fine-tuned on the CORD dataset, a document parsing dataset. It builds upon the capabilities of the base Donut model, which was pre-trained on a large corpus of document images and their corresponding text. Similar models include the nougat-latex-base model, which is fine-tuned for improved LaTeX code generation from images, and the TrOCR model, which is optimized for optical character recognition on printed text. Model inputs and outputs Inputs Image**: The model takes an image as input, which it then encodes using the Swin Transformer vision encoder. Outputs Text**: The model generates text output, conditioned on the encoded image representation. This text output can be used for document understanding tasks such as information extraction, text summarization, or table recognition. Capabilities The donut-base-finetuned-cord-v2 model is capable of performing OCR-free document understanding tasks. It can extract and generate text from document images without requiring a separate OCR step. This can be particularly useful in scenarios where the document layout or formatting makes it challenging to apply traditional OCR techniques. What can I use it for? You can use the donut-base-finetuned-cord-v2 model for a variety of document understanding tasks, such as: Information extraction**: Extracting key information (e.g., names, addresses, dates) from documents like invoices, contracts, or forms. Text summarization**: Generating concise summaries of longer documents, such as research papers or legal documents. Table recognition**: Identifying and extracting structured data from tables within document images. The model's ability to perform these tasks without relying on OCR can make it particularly useful in scenarios where the document layout or formatting is complex or varied. Things to try One interesting aspect of the donut-base-finetuned-cord-v2 model is its potential to generalize beyond the CORD dataset it was fine-tuned on. You could experiment with using the model on other types of document images, such as financial reports, scientific papers, or government forms, to see how it performs. Additionally, you could explore fine-tuning the model further on your own dataset to tailor its performance to your specific use case.

Read more

Updated Invalid Date

📊

OCR-Donut-CORD

jinhybr

Total Score

118

The OCR-Donut-CORD model is a Donut model fine-tuned on the CORD dataset, a document parsing dataset. Donut, introduced in the paper "OCR-free Document Understanding Transformer" by Geewok et al., consists of a Swin Transformer vision encoder and a BART text decoder. Given an image, the encoder first encodes it into a tensor of embeddings, which the decoder then uses to autoregressively generate text. Similar models include the donut-base-finetuned-cord-v2 and donut-base-finetuned-docvqa models, which are also Donut models fine-tuned on different datasets. The donut-base model is the pre-trained base version of Donut, meant to be fine-tuned on a downstream task. Model inputs and outputs Inputs Image of a document Outputs Text extracted from the document image Capabilities The OCR-Donut-CORD model is capable of extracting text from document images without the need for optical character recognition (OCR). This can be useful for tasks like document parsing, where the model can directly generate structured text from the image without a separate OCR step. What can I use it for? You can use the OCR-Donut-CORD model to parse and extract text from document images, such as receipts, forms, or scientific papers. This can be particularly useful in scenarios where you need to process a large volume of documents, as the model can automate the text extraction process. Things to try One interesting thing to try with the OCR-Donut-CORD model is to compare its performance on different types of documents, such as handwritten notes, complex layouts, or documents with low image quality. This can help you understand the model's strengths and limitations, and guide you in selecting the best model for your specific use case.

Read more

Updated Invalid Date

nougat-base

facebook

Total Score

126

The nougat-base model is a Donut model trained by Facebook to transcribe scientific PDFs into an easy-to-use markdown format. It consists of a Swin Transformer as the vision encoder and an mBART model as the text decoder. The model is trained to autoregressively predict the markdown given only the pixels of the PDF image as input. The nougat-base model is similar to other Donut models like the nougat-latex-base and donut-base-finetuned-cord-v2 models. The nougat-latex-base model is fine-tuned from the nougat-base model to boost its proficiency in generating LaTeX code from images, while the donut-base-finetuned-cord-v2 model is a Donut model fine-tuned on the CORD dataset for document parsing tasks. Model inputs and outputs Inputs PDF image**: The nougat-base model takes a PDF image as input and encodes it using a Swin Transformer vision encoder. Outputs Markdown text**: The model outputs the corresponding markdown text for the input PDF image, generated autoregressively by the mBART text decoder. Capabilities The nougat-base model is capable of accurately transcribing scientific PDF documents into a clean, structured markdown format. This can be especially useful for researchers and academics who need to process and share large volumes of PDF content. What can I use it for? You can use the nougat-base model to automate the process of converting PDF documents into an easily consumable markdown format. This could be beneficial for tasks such as: Streamlining the publication process by automatically formatting research papers Organizing and sharing meeting notes or other internal documents more effectively Improving accessibility by converting PDF files into a more user-friendly format Things to try One interesting aspect of the nougat-base model is its ability to handle complex scientific documents, including equations and mathematical notation. You could try experimenting with the model's performance on PDF files containing a lot of technical content, and see how well it is able to capture and represent the information in markdown format. Additionally, you could explore fine-tuning the nougat-base model on your own datasets or domain-specific content to further improve its accuracy and usefulness for your particular use case.

Read more

Updated Invalid Date