ToolVQA: A Dataset for Multi-step Reasoning VQA with External Tools

Peking University
ICCV 2025
Teaser Image

ToolVQA contains 23K instances across 7 diverse task domains, with an average inference length of 2.78 reasoning steps per instance. It involves 10 tools with different functions.

Abstract

Integrating external tools into Large Foundation Models (LFMs) has emerged as a promising approach to enhance their problem-solving capabilities. While existing studies have demonstrated strong performance in tool-augmented Visual Question Answering (VQA), recent benchmarks reveal significant gaps in real-world tool-use proficiency, particularly in functionally diverse multimodal settings requiring multi-step reasoning. In this work, we introduce ToolVQA, a large-scale multimodal dataset comprising 23K samples, designed to bridge this gap. Unlike previous datasets that rely on synthetic scenarios and simplified queries, ToolVQA features real-world visual contexts and challenging implicit multi-step reasoning tasks, better aligning with real user interactions. To construct this dataset, we propose ToolEngine, a novel data generation pipeline that employs image-guided Depth-First Search (DFS) with a Longest Common Subsequence (LCS)-based example matching mechanism to simulate human-like tool-use reasoning. ToolVQA encompasses 10 multimodal tools across 7 diverse domains, with an average inference length of 2.78 reasoning steps per sample. The LLaVA-7B model fine-tuned on ToolVQA not only achieves impressive performance on the ToolVQA test set, but also surpasses the large closed-source model GPT-3.5-turbo on five out-of-distribution (OOD) datasets, showing strong generalizability in real-world tool-use scenarios.

MY ALT TEXT

Through the study of real user logs, we find that real-world tool usage involves:
1. real-world scenarios: complex visual scenarios with real-world context (e.g. real-taken photos rather than synthetic images);
2. real-world queries: challenging queries with implicit multi-step reasoning process.

However, existing datasets often lack these two critical properties. The scenarios are synthesized instead of real-taken, resulting in overly simplistic scenes that do not align with real-world scenarios. The queries require only single-step reasoning or explicitly provide hints about the reasoning process, such as "using the Cheap YouTube API tool".

MY ALT TEXT

To bridge these gaps, we introduce ToolEngine, a data construction pipeline designed to generate multi-step reasoning VQA data for tool usage. It only needs a single image to generate high-quality visual reasoning data. ToolEngine employs a Depth-First Search (DFS) approach and a Longest Common Subsequence (LCS)-based matching mechanism to simulate human-like tool-use reasoning.

MY ALT TEXT MY ALT TEXT

The LLaVA-7B model fine-tuned on ToolVQA not only achieves impressive performance on the ToolVQA test set, but also surpasses the large closed-source model GPT-3.5-turbo on various out-of-distribution (OOD) datasets, showing strong generalizability in real-world tool-use scenarios.

MY ALT TEXT

ToolVQA contain high-quality multi-step visual reasoning instances, serving as both a benchmark and a training ground for developing more capable, generalizable tool using agents.

BibTeX

@misc{yin2025toolvqadatasetmultistepreasoning,
  title={ToolVQA: A Dataset for Multi-step Reasoning VQA with External Tools}, 
  author={Shaofeng Yin and Ting Lei and Yang Liu},
  year={2025},
  eprint={2508.03284},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2508.03284}, 
}