โ† Back to all models
๐Ÿ’ฌ

ByteDance: UI-TARS 7B

bytedanceยทText Generation
๐Ÿ”ฅ 73trending

UI-TARS-1.5 is a multimodal vision-language agent optimized for GUI-based environments, including desktop interfaces, web browsers, mobile systems, and games. Built by ByteDance, it builds upon the UI-TARS framework with reinforcement learning-based reasoning, enabling robust action planning and execution across virtual interfaces. This model achieves state-of-the-art results on a range of interactive and grounding benchmarks, including OSworld, WebVoyager, AndroidWorld, and ScreenSpot. It also demonstrates perfect task completion across diverse Poki games and outperforms prior models in Minecraft agent tasks. UI-TARS-1.5 supports thought decomposition during inference and shows strong scaling across variants, with the 1.5 version notably exceeding the performance of earlier 72B and 7B checkpoints.

#text+image->text#top-provider
๐Ÿงฎ

Undisclosed

Parameters

๐Ÿ“

128K tokens

Context Window

๐Ÿ”’

Proprietary

License

๐Ÿ“…

Jul 22, 2025

Released

๐Ÿ’ฐ Pricing

Input

$0.10

per 1M tokens

Output

$0.20

per 1M tokens

๐Ÿ”Œ

API Available

This model is accessible via API for integration into your applications.