ToolVQA: A Dataset for Multi-step Reasoning VQA with External Tools

Peking University
ICCV 2025
Teaser Image

In a gpt-driven data generation pipeline, when a fixed example limits the diversity of generation, LCS matching can integrate multiple examples to enhance generation.

Abstract

Integrating external tools into Large Foundation Models (LFMs) has emerged as a promising approach to enhance their problem-solving capabilities. While existing studies have demonstrated strong performance in tool-augmented Visual Question Answering (VQA), recent benchmarks reveal significant gaps in real-world tool-use proficiency, particularly in functionally diverse multimodal settings requiring multi-step reasoning. In this work, we introduce ToolVQA, a large-scale multimodal dataset comprising 23K instances, designed to bridge this gap. Unlike previous datasets that rely on synthetic scenarios and simplified queries, ToolVQA features real-world visual contexts and challenging implicit multi-step reasoning tasks, better aligning with real user interactions. To construct this dataset, we propose ToolEngine, a novel data generation pipeline that employs Depth-First Search (DFS) with a dynamic in-context example matching mechanism to simulate human-like tool-use reasoning. ToolVQA encompasses 10 multimodal tools across 7 diverse task domains, with an average inference length of 2.78 reasoning steps per instance. The LLaVA-7B model fine-tuned on ToolVQA not only achieves impressive performance on the ToolVQA test set, but also surpasses the large closed-source model GPT-3.5-turbo on various out-of-distribution (OOD) datasets, showing strong generalizability in real-world tool-use scenarios.

MY ALT TEXT

One effective approach to enhancing LFM's tool-use capability is fine-tuning on large-scale datasets. However, there remains a significant gap between existing large-scale datasets and real-world user needs. Through the study of real user logs, we find that real-world tool usage involves (1) real-world scenarios: complex visual scenarios with real-world context (e.g. real-taken photos rather than synthetic images), and (2) real-world queries: challenging queries with implicit multi-step reasoning process.

However, existing datasets often lack these two critical properties. The scenarios are synthesized instead of real-taken, resulting in overly simplistic scenes that do not align with real-world scenarios. The queries require only single-step reasoning or explicitly provide hints about the reasoning process, such as ``using the Cheap YouTube API tool''.

These setups overly simplify the task, creating a gap between synthetic and real-world tool usage, which limits our ability to analyze LFMs' performance on this task. Additionally, some datasets rely on costly human annotations, making them difficult to scale, which limits fine-tuning to enhance LFMs' tool-using capabilities.

MY ALT TEXT

To bridge these gaps, we introduce ToolEngine, a data construction pipeline designed to generate multi-step reasoning VQA data for tool usage. ToolEngine employs a Depth-First Search (DFS) approach to simulate the human-like multi-step reasoning process under the guidance of real user context examples. To facilitate multi-step reasoning in data, we introduce a novel dynamic example matching mechanism. During each step in the iterative search process, this mechanism matches different real-world tool-use examples based on the previous search trajectory, guiding the controller in making the next selection.

This mechanism enables (1) Extracting rich visual information from the input image. (2) Integrating this information during the reasoning process to propose challenging queries. By freeing query generation from the constraints of fixed templates, ToolEngine enhances the utilization of visual information, consequently increasing the reasoning complexity of the generated queries.

The generated dataset, ToolVQA, contains 23K instances across 7 diverse task domains, with an average inference length of 2.78 reasoning steps per instance. The LLaVA-7B model fine-tuned on ToolVQA not only achieves impressive performance on the ToolVQA test set, but also surpasses the large closed-source model GPT-3.5-turbo on various out-of-distribution (OOD) datasets, showing strong generalizability in real-world tool-use scenarios.

MY ALT TEXT

BibTeX

BibTex Code Here