InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing

Shanghai AI Laboratory, InternVL-U Team

Welcome to the official repository for InternVL-U project! If you find our work helpful, please give us a ⭐.

🎉 News

[2026/03/06] 🔥InternVL-U technical report released.
[2026/03/06] ✨Inference code and model checkpoint released.
[2026/03/06] 🛠️ GenEditEvalKit released — a unified evaluation toolkit for multimodal image generation and editing models, designed to help developers efficiently manage inference and evaluation across numerous benchmarks for the unified multimodal model (UMM) and image generation and editing models. Check it out on [GitHub].
[2026/03/06] 📝 TextEdit Benchmark released — a high-quality, multi-scenario benchmark for evaluating text editing capabilities in image generation models. Try it out and see how well your model performs on challenging text editing tasks~. Check it out on [GitHub].

📖 Introduction

InternVL-U is a 4B-parameter unified multimodal model (UMM) that brings multimodal understanding, reasoning, image generation, image editing into a single framework, aiming to democratize omni-capable multimodal intelligence with an efficient and practical model size.

Key Highlights

Unified yet modular design: built on unified contextual modeling with modality-specific modularity and decoupled visual representations.
Strong backbone + strong generator: integrates a state-of-the-art MLLM with a specialized MMDiT-based visual generation head.
High-quality data synthesis: a comprehensive data pipeline for high-semantic-density tasks (e.g., text rendering and editing, scientific reasoning) using Chain-of-Thought (CoT) to align abstract intent with precise visual execution.
Performance–efficiency win: within a limited parameter scale, InternVL-U outperforms unified open-source UMM baselines in generation and editing, while retaining strong multimodal understanding and reasoning.

We hope InternVL-U serves as a strong baseline and accelerates progress toward comprehensive, AGI-oriented omni-capable UMMs.

⚡Quick Start

Before getting started, make sure you have installed all required dependencies.

pip install -r requirements.txt

Model checkpoint is available on Hugging Face.

Inference Demo

We provide the following demos to showcase InternVL-U’s unified pipeline for multimodal understanding, image generation, and image editing, with optional reasoning-guided (text+image) outputs.

Generate Text - Click to expand

import torch
from PIL import Image
from internvlu import InternVLUPipeline

prompt = "What is the amino acid shown in the picture?"
image = Image.open("assets/amino_acid.png").convert("RGB")

pipeline = InternVLUPipeline.from_pretrained(
    "/path/to/internvl-u-checkpoint",
    torch_dtype=torch.bfloat16,
)

pipeline.to("cuda")

tokenizer = pipeline.processor.tokenizer

with torch.no_grad():
    output = pipeline(
        prompt=prompt,
        image=image,
        max_new_tokens=1024,
        generation_mode="text",
    ).generate_output[0]

print(tokenizer.decode(output, skip_special_tokens=True))

Generate Image - Click to expand

import torch
from internvlu import InternVLUPipeline

prompt = """In the deep indigo night sky, a grand fireworks festival is at its peak, with countless dazzling Mars arranged precisely, condensed into the huge and dazzling "InternVL-U" words. The letters are composed of highly saturated electric blue and dreamy purple fluorescent particles, presenting a futuristic streamlined font surrounded by scattered golden fragments resembling stardust, and the final "U" gives off a fluid metallic texture. Below is a brightly lit modern city, with the shimmering sea perfectly reflecting this stunning scene. Amidst the swirling smoke, it showcases the ultimate visual allure of technology and romance intertwined."""

pipeline = InternVLUPipeline.from_pretrained(
    "/path/to/internvl-u-checkpoint",
    torch_dtype=torch.bfloat16,
)

pipeline.to("cuda")

with torch.no_grad():
    image = pipeline(
        prompt=prompt,
        generation_mode="image",
        height=576,
        width=1024,
        generator=torch.Generator(device="cuda").manual_seed(42)
    ).images[0]

image.save(f"example_t2i.png")

Generate Image with Reasoning-guided - Click to expand

import torch
from internvlu import InternVLUPipeline

prompt = """生成一张展现希望正在生长、充满力量感的画面。"""

pipeline = InternVLUPipeline.from_pretrained(
    "/path/to/internvl-u-checkpoint",
    torch_dtype=torch.bfloat16,
)

pipeline.to("cuda")

tokenizer = pipeline.processor.tokenizer

with torch.no_grad():
    output = pipeline(
        prompt=prompt,
        generation_mode="text_image",
        generator=torch.Generator(device="cuda").manual_seed(42)
    )
    output_text = output.generate_output[0]
    image = output.images[0]

print(tokenizer.decode(output_text, skip_special_tokens=True))
image.save(f"example_guided_t2i.png")

Edit Image - Click to expand

import torch
from PIL import Image
from internvlu import InternVLUPipeline

prompt = """将书生置身于一个充满春节氛围的温暖室内空间中，房间内张灯结彩，窗外远处有灯笼，窗边装饰着精致的剪纸，红金配色的彩带在室内轻轻垂落。背景为柔和温暖的橙黄色灯光和柔软的沙发，营造出温馨而治愈的春节夜晚氛围。书生双手捧着一个红包，红包上有金色的边框，脸上洋溢着抑制不住的开心笑容，浓眉大眼闪闪发亮，流露出惊喜与幸福的神情。他身体微微前倾，肩膀自然放松，仿佛忍不住想要展示这份喜悦。整个场景充满温馨、喜庆与治愈感，在柔和的暮色与节日装饰映衬下，展现出春节特有的欢乐与人情味。"""
input_image = Image.open("assets/intern.png").convert("RGB")

pipeline = InternVLUPipeline.from_pretrained(
    "/path/to/internvl-u-checkpoint",
    torch_dtype=torch.bfloat16,
)

pipeline.to("cuda")

with torch.no_grad():
    image = pipeline(
        prompt=prompt,
        image=input_image,
        generation_mode="image",
        height=input_image.size[1],
        width=input_image.size[0],
        generator=torch.Generator(device="cuda").manual_seed(42)
    ).images[0]

image.save(f"example_edit.png")

Edit Image with Reasoning-guided - Click to expand

import torch
from PIL import Image
from internvlu import InternVLUPipeline

prompt = """Convert into the style of a Valentine's Festival picture, suitable for use as a profile picture on social media."""
input_image = Image.open("assets/lines_puppy.png").convert("RGB")

pipeline = InternVLUPipeline.from_pretrained(
    "/path/to/internvl-u-checkpoint",
    torch_dtype=torch.bfloat16,
)

pipeline.to("cuda")

tokenizer = pipeline.processor.tokenizer

with torch.no_grad():
    output = pipeline(
        prompt=prompt,
        image=input_image,
        num_beams=5,
        generation_mode="text_image",
        generator=torch.Generator(device="cuda").manual_seed(42),
    )
    output_text = output.generate_output[0]
    image = output.images[0]

print(tokenizer.decode(output_text, skip_special_tokens=True))
image.save(f"example_guided_edit.png")

Batch Inference - Click to expand

import torch
from PIL import Image
from internvlu import InternVLUPipeline

prompts = [
    "Segment the little boy",
    "Provide the bounding box for the little boy"
]
input_images = [Image.open("assets/panda_and_boy.png").convert("RGB")] * 2


pipeline = InternVLUPipeline.from_pretrained(
    "/path/to/internvl-u-checkpoint",
    torch_dtype=torch.bfloat16,
)

pipeline.to("cuda")

with torch.no_grad():
    output_images = pipeline(
        prompt=prompts,
        image=input_images,
        generation_mode="image",
        height=input_images[0].size[1],
        width=input_images[0].size[0],
        generator=torch.Generator(device="cuda").manual_seed(42)
    ).images

for idx, image in enumerate(output_images):
    image.save(f"example_edit_{idx}.png")

Citation

If you find our InternVL-U useful, please cite our InternVL-U technical report using this BibTeX.

🙏 Acknowledgement

We sincerely thank the contributors of the following open-source projects for their valuable code, models, and datasets. InternVL-U is built upon and inspired by these outstanding works:

InternVL3.5, BAGEL, Qwen2.5-VL, Qwen3-VL, Qwen-Image, BLIP3-o, OpenGPT-4o-Image, ShareGPT-4o-Image, OmniGen2, UniWorld-V1, PicoBanana, Nano-Consist, and NHR-Edit, and so on.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
internvlu		internvlu
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
tech_report.pdf		tech_report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing

🎉 News

📖 Introduction

Key Highlights

⚡Quick Start

Inference Demo

Citation

🙏 Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing

🎉 News

📖 Introduction

Key Highlights

⚡Quick Start

Inference Demo

Citation

🙏 Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages