Shaofeng Yin (殷绍峰)

I'm a junior undergraduate student at Peking University, majoring in artificial intelligence. I'm currently visiting Berkeley AI Research and have had a wonderful time working with Haven, Jiaxin and Zora.

My early research focused on generalization in computer vision. As vision-language models (VLMs) have grown increasingly powerful, I've become deeply interested in the study of visual agents.

In life, I have a deep appreciation for art, literature, and philosophy. I'm also deeply commited to volunteer teaching programs. I used to be passionate about algorithm competitions and earned the title of Codeforces Master at the age of 15, but I later stepped away from it due to burnout.

Email  /  Scholar  /  Linkedin  /  Github

profile photo

📚 Selected Publications

ToolVQA: A Dataset for Multi-step Reasoning VQA with External Tools
Shaofeng Yin, Ting Lei, Yang Liu
ICCV, 2025
project page (coming soon!) / arXiv (coming soon!)

Recent benchmarks reveal significant gaps in real-world tool-use proficiency, particularly in functionally diverse multimodal settings requiring multi-step reasoning. To bridge this gap, we propose ToolEngine, a novel data generation pipeline that employs Depth-First Search (DFS) with a dynamic in-context example matching mechanism to simulate human-like tool-use reasoning.

Exploring Conditional Multi-Modal Prompts for Zero-shot HOI Detection
Ting Lei, Shaofeng Yin, Yuxin Peng, Yang Liu
ECCV, 2024
project page / arXiv

In this paper, we introduce a novel framework for zero-shot HOI detection using Conditional Multi-Modal Prompts, namely CMMP. This approach enhances the generalization of large foundation models, such as CLIP, when fine-tuned for HOI detection.

Exploring the Potential of Large Foundation Models for Open-Vocabulary HOI Detection
Ting Lei, Shaofeng Yin, Yang Liu
CVPR, 2024
project page / arXiv

In this paper, we introduce a novel end-to-end open vocabulary HOI detection framework with conditional multi-level decoding and fine-grained semantic enhancement (CMD-SE), harnessing the potential of Visual-Language Models (VLMs).

🏆 Selected Awards

2025: First Prize in CVPR International CulturalVQA Benchmark Challenge
2025: SenseTime Scholarship (30/year in China)
2024: National Scholarship (Highest honor for undergraduates)

📸 Selected Photography

Photography 1
Photography 2
Photography 3
Photography 4
Photography 5
Photography 5
Photography 4
Photography 3
Photography 1
Photography 0

The template is stole from Jon Barron.