[](#layoutlm-for-visual-question-answering)LayoutLM for Visual Question Answering
=================================================================================

This is a fine-tuned version of the multi-modal [LayoutLM](https://aka.ms/layoutlm) model for the task of question answering on documents. It has been fine-tuned using both the [SQuAD2.0](https://huggingface.co/datasets/squad_v2) and [DocVQA](https://www.docvqa.org/) datasets.

[](#getting-started-with-the-model)Getting started with the model
-----------------------------------------------------------------

To run these examples, you must have [PIL](https://pillow.readthedocs.io/en/stable/installation.html), [pytesseract](https://pypi.org/project/pytesseract/), and [PyTorch](https://pytorch.org/get-started/locally/) installed in addition to [transformers](https://huggingface.co/docs/transformers/index).

    from transformers import pipeline
    
    nlp = pipeline(
        "document-question-answering",
        model="impira/layoutlm-document-qa",
    )
    
    nlp(
        "https://templates.invoicehome.com/invoice-template-us-neat-750px.png",
        "What is the invoice number?"
    )
    # {'score': 0.9943977, 'answer': 'us-001', 'start': 15, 'end': 15}
    
    nlp(
        "https://miro.medium.com/max/787/1*iECQRIiOGTmEFLdWkVIH2g.jpeg",
        "What is the purchase amount?"
    )
    # {'score': 0.9912159, 'answer': '$1,000,000,000', 'start': 97, 'end': 97}
    
    nlp(
        "https://www.accountingcoach.com/wp-content/uploads/2013/10/income-statement-example@2x.png",
        "What are the 2020 net sales?"
    )
    # {'score': 0.59147286, 'answer': '$ 3,750', 'start': 19, 'end': 20}
    

**NOTE**: This model and pipeline was recently landed in transformers via [PR #18407](https://github.com/huggingface/transformers/pull/18407) and [PR #18414](https://github.com/huggingface/transformers/pull/18414), so you'll need to use a recent version of transformers, for example:

    pip install git+https://github.com/huggingface/transformers.git@2ef774211733f0acf8d3415f9284c49ef219e991
    

[](#about-us)About us
---------------------

This model was created by the team at [Impira](https://www.impira.com/).

## Model overview

The `layoutlm-document-qa` model is a fine-tuned version of the multi-modal [LayoutLM](https://aka.ms/layoutlm) model, created by the team at [Impira](https://www.impira.com/). It has been fine-tuned for the task of question answering on documents, using both the [SQuAD2.0](https://huggingface.co/datasets/squad_v2) and [DocVQA](https://www.docvqa.org/) datasets.

Another similar model created by Impira is the [layoutlm-invoices](https://aimodels.fyi/models/huggingFace/layoutlm-invoices-impira) model, which is also a fine-tuned version of LayoutLM, but specifically for question answering on invoices and other documents.

## Model inputs and outputs

### Inputs
- **Image**: The model takes an image of a document as input.
- **Question**: The model also takes a natural language question about the document as input.

### Outputs
- **Answer**: The model outputs the answer to the given question, along with a confidence score.
- **Start and end positions**: The model also outputs the start and end positions of the answer within the document.

## Capabilities

The `layoutlm-document-qa` model is capable of answering questions about the content and layout of documents, even when the answer is non-consecutive or spans multiple locations in the document. This is in contrast to other question-answering models that can only extract consecutive tokens.

For example, the model can correctly identify the address in an invoice, even when it is split across multiple lines.

## What can I use it for?

The `layoutlm-document-qa` model can be used for a variety of document-related tasks, such as:
- Automating the process of extracting information from invoices, receipts, and other business documents.
- Enhancing document search and retrieval systems by allowing users to ask natural language questions about document contents.
- Improving document understanding and comprehension for tasks like legal document analysis and medical record processing.

## Things to try

One interesting aspect of the `layoutlm-document-qa` model is its ability to handle non-consecutive tokens in the answer. This can be particularly useful when dealing with documents that have complex layouts or formatting. You could try experimenting with different types of documents, such as forms, tables, or mixed-content pages, to see how the model performs.

Additionally, you could explore fine-tuning the model further on your own specialized document datasets to see if you can improve its performance on your specific use case.