An agent that thinks with graphics engine
TL;DR: We introduce a multimodal agent that reconstructs images as editable scene programs through an analysis-by-synthesis loop, employing interleaved multimodal reasoning and an evolving contextual memory to faithfully recover the scene, its physics, and various interactions.
Given any input image, VIGA reconstructs the 3D scene in Blender through an analysis-by-synthesis process. A generation agent proposes executable code that modifies the scene, then a verification agent observes the updated scene and produces feedback for refinement.
VIGA iteratively synthesizes an executable 3D scene program through interleaved multimodal reasoning. Beyond 3D reconstruction, users can even prompt the physical interactions they want, and VIGA augments the scene code to realize diverse effects such as collision, fragmentation, and more. VIGA can not only use basic geometric primitives in Blender to build scenes, but also control state-of-the-art generative models like SAM-3D to generate high-quality assets (*: the assets in the scene are generated by SAM-3D).
"Throw a ball to knock over all the objects."
Input
"Throw a ball to break the mirror."
Input
"Objects on the table shake and fall off."
Input
Input
Reconstruction
Input
Reconstruction
Input
Reconstruction
Input
Reconstruction
Input
Reconstruction
Input
Reconstruction
See how VIGA performs fine-grained visual refinement through an example of placing a basketball on a table.
We introduce BlenderBench, a comprehensive benchmark for evaluating 3D scene reconstruction with 30 challenging tasks. VIGA achieves +80.24% average improvement.
BlenderBench consists of 30 challenging tasks covering camera adjustment (Task 1), multi-step editing (Task 2), and compositional editing (Task 3). We evaluate reconstruction fidelity and agentic behaviors including multi-view spatial consistency, discrepancy localization, and sustained multi-modal tool use. PL = Photometric Loss (lower is better); VLM = VLM Score (higher is better); IMPR. = Improvement.
| Method | Task 1 (PL ↓) |
Task 1 (VLM ↑) |
Task 2 (PL ↓) |
Task 2 (VLM ↑) |
Task 3 (PL ↓) |
Task 3 (VLM ↑) |
Impr. ↑ |
|---|---|---|---|---|---|---|---|
| GPT-4o | 48.16 | 0.58 | 7.36 | 2.75 | 30.14 | 0.25 | - |
| VIGA (GPT-4o) | 8.56 ↓82% | 1.44 ↑148% | 5.11 ↓31% | 3.58 ↑30% | 14.51 ↓52% | 1.53 ↑512% | 114% |
| Claude-Sonnet-4 | 20.26 | 1.36 | 4.47 | 2.75 | 26.66 | 1.44 | - |
| VIGA (Claude-Sonnet-4) | 9.42 ↓54% | 2.47 ↑82% | 1.34 ↓70% | 3.67 ↑34% | 8.98 ↓66% | 1.80 ↑25% | 53% |
| Gemini-2.5-Pro | 40.65 | 1.75 | 10.14 | 2.78 | 31.66 | 0.64 | - |
| VIGA (Gemini-2.5-Pro) | 17.77 ↓56% | 2.33 ↑33% | 1.99 ↓80% | 3.75 ↑35% | 27.41 ↓13% | 0.66 ↑3% | 41% |
| Qwen3-VL-8B | 60.82 | 0.28 | 33.14 | 1.61 | 22.81 | 1.25 | - |
| VIGA (Qwen3-VL-8B) | 10.54 ↓83% | 1.31 ↑368% | 3.85 ↓88% | 3.33 ↑107% | 11.35 ↓50% | 2.25 ↑80% | 113% |
VIGA also achieves significant improvements on existing benchmarks, demonstrating strong generalization capability.
BlenderGym evaluates single-step graphics editing across five categories. VIGA improves base models by +32.65% on average. N-CLIP = Negative CLIP Scores (lower is better).
| Method | Blend Shape ↓ | Placement ↓ | Geometry ↓ | Lighting ↓ | Material ↓ | Impr. ↑ |
|---|---|---|---|---|---|---|
| GPT-4o | 19.96 | 29.31 | 24.91 | 2.17 | 15.25 | - |
| VIGA (GPT-4o) | 16.70 ↓16% | 26.87 ↓8% | 15.76 ↓37% | 1.41 ↓35% | 9.39 ↓38% | 27% |
| Claude-Sonnet-4 | 19.92 | 42.55 | 32.52 | 2.26 | 18.21 | - |
| VIGA (Claude-Sonnet-4) | 16.16 ↓19% | 32.66 ↓23% | 21.23 ↓35% | 1.24 ↓45% | 9.46 ↓48% | 34% |
| Gemini-2.5-Pro | 17.62 | 37.83 | 39.97 | 2.42 | 15.33 | - |
| VIGA (Gemini-2.5-Pro) | 11.73 ↓33% | 21.11 ↓44% | 9.71 ↓76% | 1.15 ↓53% | 4.96 ↓68% | 47% |
| Qwen3-VL-8B | 21.78 | 41.44 | 14.02 | 12.30 | 17.16 | - |
| VIGA (Qwen3-VL-8B) | 27.07 | 28.12 ↓32% | 12.24 ↓13% | 2.58 ↓79% | 11.53 ↓33% | 23% |
* As shown in the last line, VIGA also generalizes to other programmatic content creation tasks such as 2D slide design.
If you find our work useful, please consider citing:
@misc{yin2026visionasinversegraphicsagentinterleavedmultimodal,
title={Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning},
author={Shaofeng Yin and Jiaxin Ge and Zora Zhiruo Wang and Xiuyu Li and Michael J. Black and Trevor Darrell and Angjoo Kanazawa and Haiwen Feng},
year={2026},
eprint={2601.11109},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2601.11109},
}