Human attention is intricately linked with and shapes decision-making behavior, such as subjective preferences and ratings. Yet prior research has often studied these in isolation. For example, there’s a large body of work on predictive models of human attention, which are known to be useful for various applications, ranging from reducing visual distraction to optimizing interaction designs and faster (progressive) rendering of very large images. Additionally, there’s a separate body of work on models of explicit, later-stage decision-making behavior such as subjective preferences and aesthetic quality.
Recently, we began to focus our research on whether we can simultaneously predict different types of human interaction and feedback to unlock exciting human-centric applications. In our previous blogpost we demonstrated how a single machine learning (ML) model can predict rich human feedback on generated images (e.g., text-image misalignment, aesthetic quality, problematic regions with artifacts along with an explanation), and use those predictions to evaluate and improve image generation results.
Following up on this effort, in “UniAR: A Unified model for predicting human Attention and Responses on diverse visual content”, we introduce a multimodal model that attempts to unify various tasks of human visual behavior. We find its performance to be comparable to the best-performing domain- and task-specific models. Inspired by the recent progress in large vision-language models, we adopt a multimodal encoder-decoder transformer model to unify the various human behavior modeling tasks.
This model enables a wide variety of applications. For example, it can provide near-instant feedback on the effectiveness of UIs and visual content, enabling designers and content-creation models to optimize their work for human-centric improvements. To the best of our knowledge, this represents the first attempt to unify modeling of both implicit, early-perceptual behavior of what catches people’s attention and explicit, later-stage decision-making on subjective preferences across UIs, including real images, mobile web pages, mobile UIs, and more.