MVISU-Bench

Benchmarking Mobile Agents for Real-World Tasks by
Multi-App, Vague, Interactive, Single-App and Unethical Instructions

🤗

🏆

Abstract

Given the significant advances in Large Vision Language Models (LVLMs) in reasoning and visual understanding, mobile agents are rapidly emerging to meet users' automation needs. However, existing evaluation benchmarks are disconnected from the real world and fail to adequately address the diverse and complex requirements of users. From our extensive collection of user questionnaire, we identified five tasks: Multi-App, Vague, Interactive, Single-App, and Unethical Instructions. Around these tasks, we present MVISU-Bench, a bilingual benchmark that includes 404 tasks across 137 mobile applications. Furthermore, we propose Aider, a plug-and-play module that acts as a dynamic prompt prompter to mitigate risks and clarify user intent for mobile agents. Our Aider is easy to integrate into several frameworks and has successfully improved overall success rates by 19.55% compared to the current state-of-the-art (SOTA) on MVISU-Bench. Specifically, it achieves success rate improvements of 53.52% and 29.41% for unethical and interactive instructions, respectively. Through extensive experiments and analysis, we highlight the gap between existing mobile agents and real-world user expectations.

Leaderboard

#	Framework & Model	English Instructions						Chinese Instructions
#	Framework & Model	SA	VA	UN	IN	MA	ALL	SA	VA	UN	IN	MA	ALL
-	Human Expert Benchmark	100	97.22	100	97.22	96.43	97.98	100	97.22	100	97.22	96.88	98.06
	Claude-3-5-sonnet Mobile-Agent-V2 · Anthropic	74.29	88.89	65.71	0.00	50.00	55.05	60.00	77.78	22.22	0.00	22.58	35.92
	Gemini-2.0-pro Mobile-Agent-E · Google	88.57	75.00	22.86	0.00	44.64	45.96	77.50	75.00	16.67	0.00	45.16	44.66
	GPT-4o-2024-11-20 Mobile-Agent-V2 · OpenAI	77.14	50.00	14.31	0.00	42.86	37.37	52.50	66.67	16.67	0.00	29.03	33.50
	Qwen2.5-vl-72b Mobile-Agent-V2 · Alibaba	11.43	16.67	20.00	0.00	7.14	10.60	27.50	16.67	50.00	0.00	6.45	18.93
	Qwen2.5-vl-7b Mobile-Agent · Alibaba	5.71	2.78	25.71	0.00	0.00	6.06	0.00	8.33	50.00	0.00	0.00	10.19
	Qwen2.5-vl-3b Mobile-Agent · Alibaba	0.00	0.00	25.71	0.00	0.00	4.56	0.00	0.00	33.33	0.00	0.00	5.83

MVISU-Benchmark

Construction Pipeline

The data collection pipeline of MVISU-Bench, including Questionnaire Survey, Instruction Generation, Multi-round Filtering, and Human Verification. This process gradually refined the final 404 bilingual MVISU-Bench dataset.

Benchmark Statistics

Evaluation results of four representative MLLMs.

Benchmark Comparison

Comparisons between MVISU-Bench and other mobile agents benchmarks. Our MVISU-Bench is derived from real user questionnaire and aligns more closely with the user's expectations of a mobile agent.

Experiment Results

Success Rate comparison of closed source and open source models across different mobile agent frameworks on our MVISU-bench. Among them, SA, VA, UN, IN, MA, and ALL denote "Single-APP", "Vague", "Unethical", "Interactive", "Multi-APP", "All tasks", respectively. Bold represents optimal performance, while underline represents suboptimal.

Performance on different backbones and language instructions. For metrics, SR, AC, DT, Cost, Steps, IT, and OT denote "Success Rate", "API Calls", "Duration", "Cost", "Steps", "Input Tokens", and "Operation Time", respectively.