VIGA: Vision-as-Inverse-Graphics Agent
via Interleaved Multimodal Reasoning

An agent that thinks with graphics engine

Shaofeng Yin1,4 Jiaxin Ge1 Zora Zhiruo Wang2 Xiuyu Li1
Michael J. Black3 Trevor Darrell1 Angjoo Kanazawa1 Haiwen Feng1,3,4
1University of California, Berkeley 2Carnegie Mellon University 3Max Planck Institute for Intelligent Systems 4Impossible Inc.

TL;DR: We introduce a multimodal agent that reconstructs images as editable scene programs through an analysis-by-synthesis loop, employing interleaved multimodal reasoning and an evolving contextual memory to faithfully recover the scene, its physics, and various interactions.

Pipeline Video Placeholder Replace with: video/pipeline.mp4

Given any input image, VIGA reconstructs the 3D scene in Blender through an analysis-by-synthesis process. A generation agent proposes executable code that modifies the scene, then a verification agent observes the updated scene and produces feedback for refinement.

Vibe-code a Physical Scene

VIGA iteratively synthesizes an executable 3D scene program through interleaved multimodal reasoning. Beyond 3D reconstruction, users can even prompt the physical interactions they want, and VIGA augments the scene code to realize diverse effects such as collision, fragmentation, and more. VIGA can not only use basic geometric primitives in Blender to build scenes, but also control state-of-the-art generative models like SAM-3D to generate high-quality assets (*: the assets in the scene are generated by SAM-3D).

Art*

"Throw a ball to knock over all the objects."

Art Input Input
Static
Dynamic

Mirror

"Throw a ball to break the mirror."

Mirror Input Input
Static
Dynamic

Earthquake*

"Objects on the table shake and fall off."

Earthquake Input Input
Static
Dynamic

Golden Gate Bridge

Golden Gate Bridge Input Input
Golden Gate Bridge Render Reconstruction

Restroom

Restroom Input Input
Restroom Render Reconstruction

Kitchen

Kitchen Input Input
Kitchen Render Reconstruction

Blue Room

Blue Room Input Input
Blue Room Render Reconstruction

Bathroom

Bathroom Input Input
Bathroom Render Reconstruction

Billiards

Billiards Input Input
Billiards Render Reconstruction

Fine-grained Visual Refinement

See how VIGA performs fine-grained visual refinement through an example of placing a basketball on a table.

BlenderBench

We introduce BlenderBench, a comprehensive benchmark for evaluating 3D scene reconstruction with 30 challenging tasks. VIGA achieves +80.24% average improvement.

BlenderBench consists of 30 challenging tasks covering camera adjustment (Task 1), multi-step editing (Task 2), and compositional editing (Task 3). We evaluate reconstruction fidelity and agentic behaviors including multi-view spatial consistency, discrepancy localization, and sustained multi-modal tool use. PL = Photometric Loss (lower is better); VLM = VLM Score (higher is better); IMPR. = Improvement.

BlenderBench Overview
Method Task 1
(PL ↓)
Task 1
(VLM ↑)
Task 2
(PL ↓)
Task 2
(VLM ↑)
Task 3
(PL ↓)
Task 3
(VLM ↑)
Impr. ↑
GPT-4o 48.16 0.58 7.36 2.75 30.14 0.25 -
VIGA (GPT-4o) 8.56 ↓82% 1.44 ↑148% 5.11 ↓31% 3.58 ↑30% 14.51 ↓52% 1.53 ↑512% 114%
Claude-Sonnet-4 20.26 1.36 4.47 2.75 26.66 1.44 -
VIGA (Claude-Sonnet-4) 9.42 ↓54% 2.47 ↑82% 1.34 ↓70% 3.67 ↑34% 8.98 ↓66% 1.80 ↑25% 53%
Gemini-2.5-Pro 40.65 1.75 10.14 2.78 31.66 0.64 -
VIGA (Gemini-2.5-Pro) 17.77 ↓56% 2.33 ↑33% 1.99 ↓80% 3.75 ↑35% 27.41 ↓13% 0.66 ↑3% 41%
Qwen3-VL-8B 60.82 0.28 33.14 1.61 22.81 1.25 -
VIGA (Qwen3-VL-8B) 10.54 ↓83% 1.31 ↑368% 3.85 ↓88% 3.33 ↑107% 11.35 ↓50% 2.25 ↑80% 113%

BlenderGym

VIGA also achieves significant improvements on existing benchmarks, demonstrating strong generalization capability.

BlenderGym evaluates single-step graphics editing across five categories. VIGA improves base models by +32.65% on average. N-CLIP = Negative CLIP Scores (lower is better).

Method Blend Shape ↓ Placement ↓ Geometry ↓ Lighting ↓ Material ↓ Impr. ↑
GPT-4o 19.96 29.31 24.91 2.17 15.25 -
VIGA (GPT-4o) 16.70 ↓16% 26.87 ↓8% 15.76 ↓37% 1.41 ↓35% 9.39 ↓38% 27%
Claude-Sonnet-4 19.92 42.55 32.52 2.26 18.21 -
VIGA (Claude-Sonnet-4) 16.16 ↓19% 32.66 ↓23% 21.23 ↓35% 1.24 ↓45% 9.46 ↓48% 34%
Gemini-2.5-Pro 17.62 37.83 39.97 2.42 15.33 -
VIGA (Gemini-2.5-Pro) 11.73 ↓33% 21.11 ↓44% 9.71 ↓76% 1.15 ↓53% 4.96 ↓68% 47%
Qwen3-VL-8B 21.78 41.44 14.02 12.30 17.16 -
VIGA (Qwen3-VL-8B) 27.07 28.12 ↓32% 12.24 ↓13% 2.58 ↓79% 11.53 ↓33% 23%

Trajectory Visualization

Visualization Trajectory

* As shown in the last line, VIGA also generalizes to other programmatic content creation tasks such as 2D slide design.

BibTeX

If you find our work useful, please consider citing:

@misc{yin2026visionasinversegraphicsagentinterleavedmultimodal,
      title={Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning}, 
      author={Shaofeng Yin and Jiaxin Ge and Zora Zhiruo Wang and Xiuyu Li and Michael J. Black and Trevor Darrell and Angjoo Kanazawa and Haiwen Feng},
      year={2026},
      eprint={2601.11109},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.11109}, 
}