OmniParser, developed by Microsoft Research, is a solution that transforms user interface screenshots (e.g., a mobile application) into structured, text-based elements. This tool greatly facilitates the analysis of these interfaces by models like GPT-4V to generate precise actions based on specific regions of the interface. Using detection and captioning models, OmniParser identifies interactive icons and extracts semantics from detected elements.
Testing performed
During a series of tests, OmniParser was primarily evaluated on mobile applications, but also on computer software. The results were extremely satisfactory, with 90% detection of interface elements without any particular adjustment. By adjusting configurations, even higher precision could be achieved.