Vision-as-Inverse-Graphics Agent
via Interleaved Multimodal Reasoning

An agent that thinks with a graphics engine like Blender

Shaofeng Yin1,4 Jiaxin Ge1 Zora Zhiruo Wang2 Xiuyu Li1
Michael J. Black3 Trevor Darrell1 Angjoo Kanazawa1 Haiwen Feng1,3,4,†
1University of California, Berkeley 2Carnegie Mellon University 3Max Planck Institute for Intelligent Systems 4Impossible Inc.

TL;DR: We introduce VIGA (Vision-as-Inverse-Graphics Agent), a multimodal agent that reconstructs images as editable scene programs through an analysis-by-synthesis loop, employing interleaved multimodal reasoning and an evolving contextual memory to "vibe code" the scene, its physics, and interactions.

Pipeline Video Placeholder Replace with: video/pipeline.mp4

Given any input image, VIGA reconstructs the 3D scene in Blender through interleaved multimodal reasoning. It alternates between generator and verifier roles, proposing executable code that modifies the scene, then observing the updated scene to produce feedback for refinement.

Vibe-code a Physical Scene with Interactions

VIGA can build the scene from scratch by creating 3D assets from basic geometric primitives (e.g., parametric cuboids, spheroids) in Blender, and it can also leverage off-the-shelf 3D asset generation tools such as Meshy or SAM-3D to obtain high-quality assets that provide a better starting point for building up a physical scene with interactions (*: Assets are generated from SAM-3D).

Input "Reconstruct the scene"* "Throw a ball to knock over all the objects"
Art Input
Input "Reconstruct the scene" "Throw a ball to break the mirror"
Mirror Input
Input "Reconstruct the scene"* "Simulate an earthquake"
Earthquake Input

Result Gallery

Golden Gate Bridge Input Input
Golden Gate Bridge Render Reconstruction

Restroom Input Input
Restroom Render Reconstruction

Kitchen Input Input
Kitchen Render Reconstruction

Blue Room Input Input
Blue Room Render Reconstruction

Bathroom Input Input
Bathroom Render Reconstruction

* Scroll left or right to see more results.

Fine-grained Visual Refinement

Method Overview

See how VIGA performs fine-grained visual refinement through an example of placing a basketball on a table.

BlenderBench

We introduce BlenderBench, a comprehensive benchmark for evaluating 3D scene reconstruction with 30 challenging tasks. VIGA achieves +124.70% average improvement on BlenderBench.

BlenderBench consists of 30 challenging tasks covering camera adjustment (Task 1), multi-step editing (Task 2), and compositional editing (Task 3). We compare VIGA against a one-shot baseline and BlenderAlchemy, a memory-less variant of VIGA. We report PL = Photometric Loss (lower is better); VLM = VLM Score (higher is better); IMPR. = Improvement (calculated across all the metrics in the original paper, not just the ones mentioned here).

BlenderBench Overview
Method Task 1
(PL ↓)
Task 1
(VLM ↑)
Task 2
(PL ↓)
Task 2
(VLM ↑)
Task 3
(PL ↓)
Task 3
(VLM ↑)
Impr. ↑
(%)
GPT-4o (One-Shot) 48.16 0.58 7.36 2.75 30.14 0.25 -
BlenderAlchemy (best-of-1) 10.62 1.75 6.13 3.08 19.60 0.67 67.56
VIGA-Ours (best-of-1) 8.56 1.44 5.11 3.58 14.51 1.53 113.96
BlenderAlchemy (best-of-4) 14.50 1.75 1.95 3.53 20.62 0.56 77.48
VIGA-Ours (best-of-4) 5.47 3.25 2.94 3.83 12.62 1.61 159.19
Qwen3-VL-8B (One-Shot) 60.82 0.28 33.14 1.61 22.81 1.25 -
BlenderAlchemy (best-of-1) 25.61 0.36 10.69 2.64 30.45 0.64 27.36
VIGA-Ours (best-of-1) 10.54 1.31 3.85 3.33 11.35 2.25 112.79
BlenderAlchemy (best-of-4) 11.57 1.19 5.38 2.94 23.62 0.93 82.24
VIGA-Ours (best-of-4) 8.80 1.38 5.02 3.08 9.08 2.02 112.87

BlenderGym

VIGA also achieves +35.32% improvement on an existing benchmark, BlenderGym, demonstrating strong generalization capability across diverse graphics editing tasks.

BlenderGym evaluates single-step graphics editing across five categories. We report N-CLIP = Negative CLIP Scores (lower is better); IMPR. = Improvement (calculated across all the metrics in the original paper, not just the one mentioned here).

Method Blend Shape ↓ Placement ↓ Geometry ↓ Lighting ↓ Material ↓ Impr. ↑
GPT-4o (One-Shot) 19.96 29.31 24.91 2.17 15.25 -
BlenderAlchemy (best-of-1) 20.19 28.07 20.61 1.82 12.05 12.33
VIGA-Ours (best-of-1) 16.70 26.87 15.76 1.41 9.39 26.88
BlenderAlchemy (best-of-4) 17.19 25.58 11.68 1.23 12.20 27.63
VIGA-Ours (best-of-4) 13.08 17.48 7.36 1.01 7.63 48.86
Qwen3-VL-8B (One-Shot) 21.78 41.44 14.02 12.30 17.16 -
BlenderAlchemy (best-of-1) 31.50 39.47 14.47 6.44 16.98 5.81
VIGA-Ours (best-of-1) 27.07 28.12 12.24 2.58 11.53 22.64
BlenderAlchemy (best-of-4) 19.60 36.34 12.62 4.23 17.12 23.23
VIGA-Ours (best-of-4) 17.97 24.28 10.08 2.53 9.30 42.88

Trajectory Visualization

Visualization Trajectory

* As shown in the last line, VIGA also generalizes to other programmatic content creation tasks such as 2D slide design.

BibTeX

If you find our work useful, please consider citing:

@misc{yin2026visionasinversegraphicsagentinterleavedmultimodal,
      title={Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning}, 
      author={Shaofeng Yin and Jiaxin Ge and Zora Zhiruo Wang and Xiuyu Li and Michael J. Black and Trevor Darrell and Angjoo Kanazawa and Haiwen Feng},
      year={2026},
      eprint={2601.11109},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.11109}, 
}