[](#glm-4v-9b)GLM-4V-9B
=======================

Read this in [English](/THUDM/glm-4v-9b/blob/main/README_en.md)

GLM-4V-9B  AI  GLM-4  **GLM-4V-9B**  1120 \* 1120 GLM-4V-9B  GPT-4-turbo-2024-04-09Gemini 1.0 ProQwen-VL-Max  Claude 3 Opus 

### [](#)

GLM-4V-9B 

**MMBench-EN-Test**

**MMBench-CN-Test**

**SEEDBench\_IMG**

**MMStar**

**MMMU**

**MME**

**HallusionBench**

**AI2D**

**OCRBench**



















**GPT-4o, 20240513**

83.4

82.1

77.1

63.9

69.2

2310.3

55

84.6

736

**GPT-4v, 20240409**

81

80.2

73

56

61.7

2070.2

43.9

78.6

656

**GPT-4v, 20231106**

77

74.4

72.3

49.7

53.8

1771.5

46.5

75.9

516

**InternVL-Chat-V1.5**

82.3

80.7

75.2

57.1

46.8

2189.6

47.4

80.6

720

**LlaVA-Next-Yi-34B**

81.1

79

75.7

51.6

48.8

2050.2

34.8

78.9

574

**Step-1V**

80.7

79.9

70.3

50

49.9

2206.4

48.4

79.2

625

**MiniCPM-Llama3-V2.5**

77.6

73.8

72.3

51.8

45.8

2024.6

42.4

78.4

725

**Qwen-VL-Max**

77.6

75.7

72.7

49.5

52

2281.7

41.2

75.7

684

**GeminiProVision**

73.6

74.3

70.7

38.6

49

2148.9

45.7

72.9

680

**Claude-3V Opus**

63.3

59.2

64

45.7

54.9

1586.8

37.8

70.6

694

**GLM-4v-9B**

81.1

79.4

76.8

58.7

47.2

2163.8

46.6

81.1

786

** GLM-4V-9B `8K`**

[](#)
-------------

    import torch
    from PIL import Image
    from transformers import AutoModelForCausalLM, AutoTokenizer
    
    device = "cuda"
    
    tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-4v-9b", trust_remote_code=True)
    
    query = ''
    image = Image.open("your image").convert('RGB')
    inputs = tokenizer.apply_chat_template([{"role": "user", "image": image, "content": query}],
                                           add_generation_prompt=True, tokenize=True, return_tensors="pt",
                                           return_dict=True)  # chat mode
    
    inputs = inputs.to(device)
    model = AutoModelForCausalLM.from_pretrained(
        "THUDM/glm-4v-9b",
        torch_dtype=torch.bfloat16,
        low_cpu_mem_usage=True,
        trust_remote_code=True
    ).to(device).eval()
    
    gen_kwargs = {"max_length": 2500, "do_sample": True, "top_k": 1}
    with torch.no_grad():
        outputs = model.generate(**inputs, **gen_kwargs)
        outputs = outputs[:, inputs['input_ids'].shape[1]:]
        print(tokenizer.decode(outputs[0]))
    

[](#)
---------

GLM-4  [LICENSE](/THUDM/glm-4v-9b/blob/main/LICENSE)

[](#)
---------



    @article{zeng2022glm,
      title={Glm-130b: An open bilingual pre-trained model},
      author={Zeng, Aohan and Liu, Xiao and Du, Zhengxiao and Wang, Zihan and Lai, Hanyu and Ding, Ming and Yang, Zhuoyi and Xu, Yifan and Zheng, Wendi and Xia, Xiao and others},
      journal={arXiv preprint arXiv:2210.02414},
      year={2022}
    }
    

    @inproceedings{du2022glm,
      title={GLM: General Language Model Pretraining with Autoregressive Blank Infilling},
      author={Du, Zhengxiao and Qian, Yujie and Liu, Xiao and Ding, Ming and Qiu, Jiezhong and Yang, Zhilin and Tang, Jie},
      booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
      pages={320--335},
      year={2022}
    }
    

    @misc{wang2023cogvlm,
          title={CogVLM: Visual Expert for Pretrained Language Models}, 
          author={Weihan Wang and Qingsong Lv and Wenmeng Yu and Wenyi Hong and Ji Qi and Yan Wang and Junhui Ji and Zhuoyi Yang and Lei Zhao and Xixuan Song and Jiazheng Xu and Bin Xu and Juanzi Li and Yuxiao Dong and Ming Ding and Jie Tang},
          year={2023},
          eprint={2311.03079},
          archivePrefix={arXiv},
          primaryClass={cs.CV}
    }

## Model Overview

`glm-4v-9b` is a large language model developed by THUDM, a leading AI research group. It is part of the GLM (General Language Model) family, which aims to create open, bilingual language models capable of strong performance across a wide range of tasks.

The `glm-4v-9b` model builds upon the successes of earlier GLM models, incorporating advanced techniques like autoregressive blank infilling and hybrid pretraining objectives. This allows it to achieve impressive results on benchmarks like MMBench-EN-Test, MMBench-CN-Test, and SEEDBench_IMG, outperforming models like GPT-4-turbo-2024-04-09, Gemini 1.0, and Qwen-VL-Max.

Compared to similar large language models, `glm-4v-9b` stands out for its strong multilingual and multimodal capabilities. It can seamlessly handle both English and Chinese, and has been trained to integrate visual information with text, making it well-suited for tasks like image captioning and visual question answering.

## Model Inputs and Outputs

### Inputs
- **Text**: The model can accept text input in the form of a conversation, with the user's message formatted as `{"role": "user", "content": "query"}`.
- **Images**: Along with text, the model can also take image inputs, which are passed through the tokenizer using the `image` field in the input template.

### Outputs
- **Text Response**: The model will generate a text response to the provided input, which can be retrieved by decoding the model's output tokens.
- **Conversation History**: The model maintains a conversation history, which can be passed back into the model to continue the dialogue in a coherent manner.

## Capabilities

The `glm-4v-9b` model has demonstrated strong performance on a wide range of benchmarks, particularly those testing multilingual and multimodal capabilities. For example, it achieves high scores on the MMBench-EN-Test (81.1), MMBench-CN-Test (79.4), and SEEDBench_IMG (76.8) tasks, showcasing its ability to understand and generate text in both English and Chinese, as well as integrate visual information.

Additionally, the model has shown promising results on tasks like MMLU (58.7), AI2D (81.1), and OCRBench (786), indicating its potential for applications in areas like question answering, image understanding, and optical character recognition.

## What Can I Use It For?

The `glm-4v-9b` model's strong multilingual and multimodal capabilities make it a versatile tool for a variety of applications. Some potential use cases include:

- **Intelligent Assistants**: The model's ability to engage in natural language conversations, while also understanding and generating content related to images, makes it well-suited for building advanced virtual assistants that can handle a wide range of user requests.

- **Multimodal Content Generation**: Leveraging the model's text-image integration capabilities, developers can create applications that generate multimedia content, such as image captions, visual narratives, or even animated stories.

- **Multilingual Language Understanding**: Organizations operating in diverse language environments can use `glm-4v-9b` to build applications that can seamlessly handle both English and Chinese, enabling improved cross-cultural communication and collaboration.

- **Research and Development**: As an open-source model, `glm-4v-9b` can be a valuable resource for AI researchers and developers looking to explore the latest advancements in large language models and multimodal learning.

## Things to Try

One key feature of the `glm-4v-9b` model is its ability to effectively utilize both textual and visual information. Developers and researchers can experiment with incorporating image data into their applications, exploring how the model's multimodal capabilities can enhance tasks like image captioning, visual question answering, or even image-guided text generation.

Another avenue to explore is the model's strong multilingual performance. Users can try interacting with the model in both English and Chinese, and observe how it maintains coherence and contextual understanding across languages. This can lead to insights on building truly global AI systems that can bridge language barriers.

Finally, the model's impressive benchmark scores suggest that it could be a valuable starting point for fine-tuning or further pretraining on domain-specific datasets. Developers can experiment with adapting the model to their particular use cases, unlocking new capabilities and expanding the model's utility.