Sunday, September 15, 2024

Windows Agent Arena (WAA): A Scalable Open-Sourced Windows AI Agent Platform for Testing and Benchmarking Multi-modal, Desktop AI Agent

Windows Agent Arena (WAA) brings practical AI solutions to enhance human productivity in the Windows operating system. By using AI agents powered by large language models, tasks can be automated, providing significant value for personal and professional productivity. Evaluating AI agent performance has been challenging due to existing benchmarks failing to capture the complexity of real-world tasks on platforms like Windows. This has made large-scale evaluations slow and inefficient. To address this, the WindowsAgentArena Benchmark has been introduced. It is a comprehensive benchmark designed for evaluating AI agents in a Windows OS environment. It leverages cloud infrastructure to parallelize evaluations, allowing for rapid and realistic testing of agent behavior. The benchmark suite includes over 154 diverse tasks mirroring everyday Windows workflows, with a novel evaluation criterion rewarding agents based on task completion. It seamlessly integrates with Docker containers for secure testing and scalability. The Navi AI agent achieved a success rate of 19.5% on the WindowsAgentArena benchmark, demonstrating the potential for improvement as AI technologies evolve. Navi also showed strong performance in a secondary web-based benchmark, Mind2Web. Navi benefits from advanced perception techniques, such as visual markers and screen parsing techniques, enabling more precise agent interactions and paving the way for more capable and efficient AI agents in the future. WindowsAgentArena offers a scalable, reproducible, and realistic testing platform for AI agents in the Windows OS ecosystem, providing researchers and developers with the tools to push the boundaries of AI agent development. For more information and free consultation, visit the AI Lab in Telegram @itinai or follow on Twitter @itinaicom.

No comments:

Post a Comment