An agent that thinks with a graphics engine like Blender
TL;DR: We introduce VIGA (Vision-as-Inverse-Graphics Agent), a multimodal agent that reconstructs images as editable scene programs through an analysis-by-synthesis loop, employing interleaved multimodal reasoning and an evolving contextual memory to "vibe code" the scene, its physics, and interactions.
Given any input image, VIGA reconstructs the 3D scene in Blender through interleaved multimodal reasoning. It alternates between generator and verifier roles, proposing executable code that modifies the scene, then observing the updated scene to produce feedback for refinement.
VIGA can build the scene from scratch by creating 3D assets from basic geometric primitives (e.g., parametric cuboids, spheroids) in Blender, and it can also leverage off-the-shelf 3D asset generation tools such as Meshy or SAM-3D to obtain high-quality assets that provide a better starting point for building up a physical scene with interactions (*: Assets are generated from SAM-3D).
Input
Reconstruction
Input
Reconstruction
Input
Reconstruction
Input
Reconstruction
Input
Reconstruction
* Scroll left or right to see more results.
See how VIGA performs fine-grained visual refinement through an example of placing a basketball on a table.
We introduce BlenderBench, a comprehensive benchmark for evaluating 3D scene reconstruction with 30 challenging tasks. VIGA achieves +124.70% average improvement on BlenderBench.
BlenderBench consists of 30 challenging tasks covering camera adjustment (Task 1), multi-step editing (Task 2), and compositional editing (Task 3). We compare VIGA against a one-shot baseline and BlenderAlchemy, a memory-less variant of VIGA. We report PL = Photometric Loss (lower is better); VLM = VLM Score (higher is better); IMPR. = Improvement (calculated across all the metrics in the original paper, not just the ones mentioned here).
| Method | Task 1 (PL ↓) |
Task 1 (VLM ↑) |
Task 2 (PL ↓) |
Task 2 (VLM ↑) |
Task 3 (PL ↓) |
Task 3 (VLM ↑) |
Impr. ↑ (%) |
|---|---|---|---|---|---|---|---|
| GPT-4o (One-Shot) | 48.16 | 0.58 | 7.36 | 2.75 | 30.14 | 0.25 | - |
| BlenderAlchemy (best-of-1) | 10.62 | 1.75 | 6.13 | 3.08 | 19.60 | 0.67 | 67.56 |
| VIGA-Ours (best-of-1) | 8.56 | 1.44 | 5.11 | 3.58 | 14.51 | 1.53 | 113.96 |
| BlenderAlchemy (best-of-4) | 14.50 | 1.75 | 1.95 | 3.53 | 20.62 | 0.56 | 77.48 |
| VIGA-Ours (best-of-4) | 5.47 | 3.25 | 2.94 | 3.83 | 12.62 | 1.61 | 159.19 |
| Qwen3-VL-8B (One-Shot) | 60.82 | 0.28 | 33.14 | 1.61 | 22.81 | 1.25 | - |
| BlenderAlchemy (best-of-1) | 25.61 | 0.36 | 10.69 | 2.64 | 30.45 | 0.64 | 27.36 |
| VIGA-Ours (best-of-1) | 10.54 | 1.31 | 3.85 | 3.33 | 11.35 | 2.25 | 112.79 |
| BlenderAlchemy (best-of-4) | 11.57 | 1.19 | 5.38 | 2.94 | 23.62 | 0.93 | 82.24 |
| VIGA-Ours (best-of-4) | 8.80 | 1.38 | 5.02 | 3.08 | 9.08 | 2.02 | 112.87 |
VIGA also achieves +35.32% improvement on an existing benchmark, BlenderGym, demonstrating strong generalization capability across diverse graphics editing tasks.
BlenderGym evaluates single-step graphics editing across five categories. We report N-CLIP = Negative CLIP Scores (lower is better); IMPR. = Improvement (calculated across all the metrics in the original paper, not just the one mentioned here).
| Method | Blend Shape ↓ | Placement ↓ | Geometry ↓ | Lighting ↓ | Material ↓ | Impr. ↑ |
|---|---|---|---|---|---|---|
| GPT-4o (One-Shot) | 19.96 | 29.31 | 24.91 | 2.17 | 15.25 | - |
| BlenderAlchemy (best-of-1) | 20.19 | 28.07 | 20.61 | 1.82 | 12.05 | 12.33 |
| VIGA-Ours (best-of-1) | 16.70 | 26.87 | 15.76 | 1.41 | 9.39 | 26.88 |
| BlenderAlchemy (best-of-4) | 17.19 | 25.58 | 11.68 | 1.23 | 12.20 | 27.63 |
| VIGA-Ours (best-of-4) | 13.08 | 17.48 | 7.36 | 1.01 | 7.63 | 48.86 |
| Qwen3-VL-8B (One-Shot) | 21.78 | 41.44 | 14.02 | 12.30 | 17.16 | - |
| BlenderAlchemy (best-of-1) | 31.50 | 39.47 | 14.47 | 6.44 | 16.98 | 5.81 |
| VIGA-Ours (best-of-1) | 27.07 | 28.12 | 12.24 | 2.58 | 11.53 | 22.64 |
| BlenderAlchemy (best-of-4) | 19.60 | 36.34 | 12.62 | 4.23 | 17.12 | 23.23 |
| VIGA-Ours (best-of-4) | 17.97 | 24.28 | 10.08 | 2.53 | 9.30 | 42.88 |
* As shown in the last line, VIGA also generalizes to other programmatic content creation tasks such as 2D slide design.
If you find our work useful, please consider citing:
@misc{yin2026visionasinversegraphicsagentinterleavedmultimodal,
title={Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning},
author={Shaofeng Yin and Jiaxin Ge and Zora Zhiruo Wang and Xiuyu Li and Michael J. Black and Trevor Darrell and Angjoo Kanazawa and Haiwen Feng},
year={2026},
eprint={2601.11109},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2601.11109},
}