Skip to content

OpenGVLab/InternVL-U

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

InternVL-U Logo
InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing

arXiv  Hugging Face  GenEditEvalKit  TextEdit Benchmark

Shanghai AI Laboratory, InternVL-U Team

Welcome to the official repository for InternVL-U project! If you find our work helpful, please give us a ⭐.

🎉 News

  • [2026/03/06] 🔥InternVL-U technical report released.
  • [2026/03/06]Inference code and model checkpoint released.
  • [2026/03/06] 🛠️ GenEditEvalKit released — a unified evaluation toolkit for multimodal image generation and editing models, designed to help developers efficiently manage inference and evaluation across numerous benchmarks for the unified multimodal model (UMM) and image generation and editing models. Check it out on [GitHub].
  • [2026/03/06] 📝 TextEdit Benchmark released — a high-quality, multi-scenario benchmark for evaluating text editing capabilities in image generation models. Try it out and see how well your model performs on challenging text editing tasks~. Check it out on [GitHub].

📖 Introduction

InternVL-U is a 4B-parameter unified multimodal model (UMM) that brings multimodal understanding, reasoning, image generation, image editing into a single framework, aiming to democratize omni-capable multimodal intelligence with an efficient and practical model size.

Key Highlights

  • Unified yet modular design: built on unified contextual modeling with modality-specific modularity and decoupled visual representations.
  • Strong backbone + strong generator: integrates a state-of-the-art MLLM with a specialized MMDiT-based visual generation head.
  • High-quality data synthesis: a comprehensive data pipeline for high-semantic-density tasks (e.g., text rendering and editing, scientific reasoning) using Chain-of-Thought (CoT) to align abstract intent with precise visual execution.
  • Performance–efficiency win: within a limited parameter scale, InternVL-U outperforms unified open-source UMM baselines in generation and editing, while retaining strong multimodal understanding and reasoning.

We hope InternVL-U serves as a strong baseline and accelerates progress toward comprehensive, AGI-oriented omni-capable UMMs.

⚡Quick Start

Before getting started, make sure you have installed all required dependencies.

pip install -r requirements.txt

Model checkpoint is available on Hugging Face.

Inference Demo

We provide the following demos to showcase InternVL-U’s unified pipeline for multimodal understanding, image generation, and image editing, with optional reasoning-guided (text+image) outputs.

Generate Text - Click to expand
import torch
from PIL import Image
from internvlu import InternVLUPipeline

prompt = "What is the amino acid shown in the picture?"
image = Image.open("assets/amino_acid.png").convert("RGB")

pipeline = InternVLUPipeline.from_pretrained(
    "/path/to/internvl-u-checkpoint",
    torch_dtype=torch.bfloat16,
)

pipeline.to("cuda")

tokenizer = pipeline.processor.tokenizer

with torch.no_grad():
    output = pipeline(
        prompt=prompt,
        image=image,
        max_new_tokens=1024,
        generation_mode="text",
    ).generate_output[0]

print(tokenizer.decode(output, skip_special_tokens=True))
Generate Image - Click to expand
import torch
from internvlu import InternVLUPipeline

prompt = """In the deep indigo night sky, a grand fireworks festival is at its peak, with countless dazzling Mars arranged precisely, condensed into the huge and dazzling "InternVL-U" words. The letters are composed of highly saturated electric blue and dreamy purple fluorescent particles, presenting a futuristic streamlined font surrounded by scattered golden fragments resembling stardust, and the final "U" gives off a fluid metallic texture. Below is a brightly lit modern city, with the shimmering sea perfectly reflecting this stunning scene. Amidst the swirling smoke, it showcases the ultimate visual allure of technology and romance intertwined."""

pipeline = InternVLUPipeline.from_pretrained(
    "/path/to/internvl-u-checkpoint",
    torch_dtype=torch.bfloat16,
)

pipeline.to("cuda")

with torch.no_grad():
    image = pipeline(
        prompt=prompt,
        generation_mode="image",
        height=576,
        width=1024,
        generator=torch.Generator(device="cuda").manual_seed(42)
    ).images[0]

image.save(f"example_t2i.png")
Generate Image with Reasoning-guided - Click to expand
import torch
from internvlu import InternVLUPipeline

prompt = """生成一张展现希望正在生长、充满力量感的画面。"""

pipeline = InternVLUPipeline.from_pretrained(
    "/path/to/internvl-u-checkpoint",
    torch_dtype=torch.bfloat16,
)

pipeline.to("cuda")

tokenizer = pipeline.processor.tokenizer

with torch.no_grad():
    output = pipeline(
        prompt=prompt,
        generation_mode="text_image",
        generator=torch.Generator(device="cuda").manual_seed(42)
    )
    output_text = output.generate_output[0]
    image = output.images[0]

print(tokenizer.decode(output_text, skip_special_tokens=True))
image.save(f"example_guided_t2i.png")
Edit Image - Click to expand
import torch
from PIL import Image
from internvlu import InternVLUPipeline

prompt = """将书生置身于一个充满春节氛围的温暖室内空间中,房间内张灯结彩,窗外远处有灯笼,窗边装饰着精致的剪纸,红金配色的彩带在室内轻轻垂落。背景为柔和温暖的橙黄色灯光和柔软的沙发,营造出温馨而治愈的春节夜晚氛围。书生双手捧着一个红包,红包上有金色的边框,脸上洋溢着抑制不住的开心笑容,浓眉大眼闪闪发亮,流露出惊喜与幸福的神情。他身体微微前倾,肩膀自然放松,仿佛忍不住想要展示这份喜悦。整个场景充满温馨、喜庆与治愈感,在柔和的暮色与节日装饰映衬下,展现出春节特有的欢乐与人情味。"""
input_image = Image.open("assets/intern.png").convert("RGB")

pipeline = InternVLUPipeline.from_pretrained(
    "/path/to/internvl-u-checkpoint",
    torch_dtype=torch.bfloat16,
)

pipeline.to("cuda")

with torch.no_grad():
    image = pipeline(
        prompt=prompt,
        image=input_image,
        generation_mode="image",
        height=input_image.size[1],
        width=input_image.size[0],
        generator=torch.Generator(device="cuda").manual_seed(42)
    ).images[0]

image.save(f"example_edit.png")
Edit Image with Reasoning-guided - Click to expand
import torch
from PIL import Image
from internvlu import InternVLUPipeline

prompt = """Convert into the style of a Valentine's Festival picture, suitable for use as a profile picture on social media."""
input_image = Image.open("assets/lines_puppy.png").convert("RGB")

pipeline = InternVLUPipeline.from_pretrained(
    "/path/to/internvl-u-checkpoint",
    torch_dtype=torch.bfloat16,
)

pipeline.to("cuda")

tokenizer = pipeline.processor.tokenizer

with torch.no_grad():
    output = pipeline(
        prompt=prompt,
        image=input_image,
        num_beams=5,
        generation_mode="text_image",
        generator=torch.Generator(device="cuda").manual_seed(42),
    )
    output_text = output.generate_output[0]
    image = output.images[0]

print(tokenizer.decode(output_text, skip_special_tokens=True))
image.save(f"example_guided_edit.png")
Batch Inference - Click to expand
import torch
from PIL import Image
from internvlu import InternVLUPipeline

prompts = [
    "Segment the little boy",
    "Provide the bounding box for the little boy"
]
input_images = [Image.open("assets/panda_and_boy.png").convert("RGB")] * 2


pipeline = InternVLUPipeline.from_pretrained(
    "/path/to/internvl-u-checkpoint",
    torch_dtype=torch.bfloat16,
)

pipeline.to("cuda")

with torch.no_grad():
    output_images = pipeline(
        prompt=prompts,
        image=input_images,
        generation_mode="image",
        height=input_images[0].size[1],
        width=input_images[0].size[0],
        generator=torch.Generator(device="cuda").manual_seed(42)
    ).images

for idx, image in enumerate(output_images):
    image.save(f"example_edit_{idx}.png")

Citation

If you find our InternVL-U useful, please cite our InternVL-U technical report using this BibTeX.



🙏 Acknowledgement

We sincerely thank the contributors of the following open-source projects for their valuable code, models, and datasets. InternVL-U is built upon and inspired by these outstanding works:

InternVL3.5, BAGEL, Qwen2.5-VL, Qwen3-VL, Qwen-Image, BLIP3-o, OpenGPT-4o-Image, ShareGPT-4o-Image, OmniGen2, UniWorld-V1, PicoBanana, Nano-Consist, and NHR-Edit, and so on.

About

InternVL-U is a 4B-parameter unified multimodal model (UMM) that brings multimodal understanding, reasoning, image generation, image editing into a single framework.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages