Screenshot of index page at vab.bakelab.ai

Researchers at BakeLab, a leading artificial intelligence research group at the University of Washington, in collaboration with UCSB, Stanford University, University of Notre Dame, and IBM Research, have announced the launch of the Visual Aesthetic Benchmark (VAB). This groundbreaking project aims to bridge the gap between technical image generation and human-centric aesthetic value, providing the most comprehensive framework to date for evaluating how "good" AI-generated images actually look.

As generative AI models like Midjourney, DALL-E, and Stable Diffusion flood the digital landscape, the industry has struggled to move beyond simple accuracy metrics. VAB introduces a sophisticated, human-aligned evaluation system grounded in over 13,000 expert judgments and spanning 2,000+ hours of commissioned professional critiques.

Unlike previous benchmarks that treat all images equally, VAB categorizes visual quality across three distinct domains:

  • Fine Art: Evaluating nuances in calligraphy, ink-and-wash, and sketch techniques.
  • Illustration: Focusing on digital painting, anime-manga styles, and pixel art.
  • Photography: Assessing professional-grade standards in landscape, macro, and architectural photography.

By testing 20+ frontier AI models, the VAB project provides a transparent leaderboard that reveals which models truly capture human artistic intent versus those that merely produce high-resolution noise.

The game of Aesthetic Judgment At the heart of the project’s public engagement is the VAB Arena. Designed as an interactive, gamified experience, the Arena invites users to participate in the scientific process. In a "blind taste test" format, participants are presented with side-by-side images generated by different AI models. Players select the "best" and "worst" based on their personal aesthetic preferences. These crowdsourced "human-in-the-loop" insights directly inform the benchmark, helping the AI community understand the subjective nature of beauty.

With ACTION, hardening an LLM as a judge for aesthetics will only help our mission to improve LLMs in security operations. We want AI to recognize intent and nuance, and gauging a response to art seems like a good way to do so. If an LLM can be benchmarked to judge art like a human, then the model could have a reasonable understanding of human psychology and culture. For instance, the techniques used in VAB to evaluate AI judgment can be repurposed to build defensive AI that detects when an email "feels" artificially manufactured or "too perfect" to be human.

BakeLab is a research initiative at the University of Washington focused on the intersection of machine learning, human-computer interaction, and creative technologies.

Date
Location
Seattle, Washington