Vision-as-Inverse-Graphics Agent
via Interleaved Multimodal Reasoning

An agent that thinks with a graphics engine like Blender

Shaofeng Yin^1,4 Jiaxin Ge¹ Zora Zhiruo Wang² Xiuyu Li¹

Michael J. Black³ Trevor Darrell¹ Angjoo Kanazawa¹ Haiwen Feng^1,3,4,†

¹University of California, Berkeley ²Carnegie Mellon University ³Max Planck Institute for Intelligent Systems ⁴Impossible Inc.

Paper Code

Benchmark

TL;DR: We introduce VIGA (Vision-as-Inverse-Graphics Agent), a multimodal agent that reconstructs images as editable scene programs through an analysis-by-synthesis loop, employing interleaved multimodal reasoning and an evolving contextual memory to "vibe code" the scene, its physics, and interactions.

Pipeline Video Placeholder Replace with: video/pipeline.mp4

Given any input image, VIGA reconstructs the 3D scene in Blender through interleaved multimodal reasoning. It alternates between generator and verifier roles, proposing executable code that modifies the scene, then observing the updated scene to produce feedback for refinement.

Vibe-code a Physical Scene with Interactions

VIGA can build the scene from scratch by creating 3D assets from basic geometric primitives (e.g., parametric cuboids, spheroids) in Blender, and it can also leverage off-the-shelf 3D asset generation tools such as Meshy or SAM-3D to obtain high-quality assets that provide a better starting point for building up a physical scene with interactions (*: Assets are generated from SAM-3D).

Input "Reconstruct the scene"* "Throw a ball to knock over all the objects"

Input "Reconstruct the scene" "Throw a ball to break the mirror"

Input "Reconstruct the scene"* "Simulate an earthquake"

Result Gallery

Input

Reconstruction

Input

Reconstruction

Input

Reconstruction

Input

Reconstruction

Input

Reconstruction

* Scroll left or right to see more results.

Fine-grained Visual Refinement

See how VIGA performs fine-grained visual refinement through an example of placing a basketball on a table.

BlenderBench

We introduce BlenderBench, a comprehensive benchmark for evaluating 3D scene reconstruction with 30 challenging tasks. VIGA achieves +124.70% average improvement on BlenderBench.

BlenderBench consists of 30 challenging tasks covering camera adjustment (Task 1), multi-step editing (Task 2), and compositional editing (Task 3). We compare VIGA against a one-shot baseline and BlenderAlchemy, a memory-less variant of VIGA. We report PL = Photometric Loss (lower is better); VLM = VLM Score (higher is better); IMPR. = Improvement (calculated across all the metrics in the original paper, not just the ones mentioned here).

Method	Task 1 (PL ↓)	Task 1 (VLM ↑)	Task 2 (PL ↓)	Task 2 (VLM ↑)	Task 3 (PL ↓)	Task 3 (VLM ↑)	Impr. ↑ (%)
GPT-4o (One-Shot)	48.16	0.58	7.36	2.75	30.14	0.25	-
BlenderAlchemy (best-of-1)	10.62	1.75	6.13	3.08	19.60	0.67	67.56
VIGA-Ours (best-of-1)	8.56	1.44	5.11	3.58	14.51	1.53	113.96
BlenderAlchemy (best-of-4)	14.50	1.75	1.95	3.53	20.62	0.56	77.48
VIGA-Ours (best-of-4)	5.47	3.25	2.94	3.83	12.62	1.61	159.19
Qwen3-VL-8B (One-Shot)	60.82	0.28	33.14	1.61	22.81	1.25	-
BlenderAlchemy (best-of-1)	25.61	0.36	10.69	2.64	30.45	0.64	27.36
VIGA-Ours (best-of-1)	10.54	1.31	3.85	3.33	11.35	2.25	112.79
BlenderAlchemy (best-of-4)	11.57	1.19	5.38	2.94	23.62	0.93	82.24
VIGA-Ours (best-of-4)	8.80	1.38	5.02	3.08	9.08	2.02	112.87

BlenderGym

VIGA also achieves +35.32% improvement on an existing benchmark, BlenderGym, demonstrating strong generalization capability across diverse graphics editing tasks.

BlenderGym evaluates single-step graphics editing across five categories. We report N-CLIP = Negative CLIP Scores (lower is better); IMPR. = Improvement (calculated across all the metrics in the original paper, not just the one mentioned here).

Method	Blend Shape ↓	Placement ↓	Geometry ↓	Lighting ↓	Material ↓	Impr. ↑
GPT-4o (One-Shot)	19.96	29.31	24.91	2.17	15.25	-
BlenderAlchemy (best-of-1)	20.19	28.07	20.61	1.82	12.05	12.33
VIGA-Ours (best-of-1)	16.70	26.87	15.76	1.41	9.39	26.88
BlenderAlchemy (best-of-4)	17.19	25.58	11.68	1.23	12.20	27.63
VIGA-Ours (best-of-4)	13.08	17.48	7.36	1.01	7.63	48.86
Qwen3-VL-8B (One-Shot)	21.78	41.44	14.02	12.30	17.16	-
BlenderAlchemy (best-of-1)	31.50	39.47	14.47	6.44	16.98	5.81
VIGA-Ours (best-of-1)	27.07	28.12	12.24	2.58	11.53	22.64
BlenderAlchemy (best-of-4)	19.60	36.34	12.62	4.23	17.12	23.23
VIGA-Ours (best-of-4)	17.97	24.28	10.08	2.53	9.30	42.88

Trajectory Visualization

* As shown in the last line, VIGA also generalizes to other programmatic content creation tasks such as 2D slide design.

BibTeX

If you find our work useful, please consider citing:

@misc{yin2026visionasinversegraphicsagentinterleavedmultimodal,
      title={Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning}, 
      author={Shaofeng Yin and Jiaxin Ge and Zora Zhiruo Wang and Xiuyu Li and Michael J. Black and Trevor Darrell and Angjoo Kanazawa and Haiwen Feng},
      year={2026},
      eprint={2601.11109},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.11109}, 
}

Vision-as-Inverse-Graphics Agentvia Interleaved Multimodal Reasoning

Vibe-code a Physical Scene with Interactions

Result Gallery

Fine-grained Visual Refinement

BlenderBench

BlenderGym

Trajectory Visualization

BibTeX

Vision-as-Inverse-Graphics Agent
via Interleaved Multimodal Reasoning