What happens when you ask AI to evaluate itself and other AIs?

It's fascinating to see how each one considers itself and others

Jun 26, 2025

Before diving into this experiment, it's crucial to understand what these AI systems are.

At their core, all these platforms—Claude, ChatGPT, Gemini, Grok, Perplexity, and Meta AI—are large language models (LLMs). Despite their sophisticated responses and apparent reasoning, they're fundamentally statistical prediction engines trained to guess the next most likely word based on patterns in their training data.

These systems don't truly "understand" concepts like accuracy, objectivity, or even the visual interfaces they're evaluating.

They don't have genuine preferences, beliefs, or the ability to see and analyze images in the way humans do. Instead, they're generating responses that statistically resemble how humans might discuss design evaluation, drawing from countless examples of design criticism, product reviews, and competitive analysis they encountered during training.

This fundamental limitation makes the following experiment even more intriguing: when asked to evaluate interface designs—including their own—what patterns emerge from these sophisticated word-prediction systems?

The Experiment

What happens when you ask AI assistants to evaluate interface designs that include their own? A fascinating experiment recently put six major AI platforms—Gemini, ChatGPT, Perplexity, Grok, Meta AI, and Claude—to the test, asking each to critique the same set of six AI interface designs. The results reveal as much about each platform's training patterns and embedded biases as they do about good interface design.

The Setup

The challenge was simple: evaluate six AI interface designs from Claude, Grok, ChatGPT, Gemini, Perplexity, and Meta AI. Each platform was asked to determine which interface was "best designed." What emerged were three distinctly different approaches to design critique—and some telling patterns in how each AI viewed its own interface.

Six Approaches, Six Different Winners

Gemini: The Diplomatic Analyst

Gemini took a refreshingly nuanced approach, organizing its analysis around different user needs rather than declaring a single winner. It evaluated interfaces based on specific use cases:

Perplexity for simplicity and focus
Claude for professional elegance
Gemini for welcoming new users
Meta AI for creative exploration

Notably, Gemini concluded that its own interface was "arguably the most balanced and user-centric for a general audience"—a diplomatic way of claiming the top spot while appearing objective.

The Philosophy: Context matters more than absolutes. Different interfaces serve different purposes, and the "best" design depends entirely on user goals.

ChatGPT: The Technical Evaluator

ChatGPT approached the task like a seasoned UX professional, creating a systematic pros-and-cons breakdown for each interface. It evaluated based on concrete criteria: clarity, hierarchy, functionality, aesthetics, and affordance.

The result? A clear ranking system that placed ChatGPT itself at #1, followed by Perplexity, Claude, Gemini, Grok, and Meta AI.

The Philosophy: Good design can be objectively measured through established UX principles. The interface that best balances function and aesthetics wins.

Perplexity: The Aesthetic Connoisseur

Perplexity focused heavily on visual design and user psychology, diving deep into typography, whitespace, and emotional impact. It ultimately chose Claude as the winner, praising its "sophisticated typography," "thoughtful use of whitespace," and ability to balance "professionalism with approachability."

Perplexity placed itself as a "close second," noting its "interesting darker theme with good contrast."

The Philosophy: Design is about creating emotional connections and premium experiences. Visual sophistication and psychological comfort trump pure functionality.

Grok: The Confident Self-Advocate

Grok took perhaps the most direct approach of all, confidently declaring its own interface the winner. It praised its "clean layout, intuitive navigation with options like DeepSearch and Think modes, and a welcoming greeting tailored to the user."

Grok's analysis went beyond aesthetics to highlight functional advantages, noting that while "other interfaces, like ChatGPT and Perplexity, also have clean designs, Grok's personalized touch and additional mode options give it an edge."

When pressed for deeper analysis, Grok expanded with web-sourced insights, discussing how its interface balances "usability and unique features," while critiquing competitors for various limitations—ChatGPT feeling "less streamlined for research-focused tasks" and Gemini feeling "less intuitive outside Google apps."

The Philosophy: Superior functionality combined with clean design creates the optimal user experience. Direct comparison highlighting competitive advantages provides the clearest path to determining the winner.

Meta AI: The Balanced Analyst

Meta AI took a refreshingly neutral approach, providing systematic analysis of each interface without defensive positioning. It offered detailed breakdowns similar to ChatGPT's methodology, but with notable objectivity about its own interface, describing it as "minimalistic" but potentially "too simple and lacking in visual hierarchy."

Instead of choosing itself, Meta AI selected Gemini as the winner, praising how it "strikes a balance between functionality and visual appeal" with effective use of "color, whitespace, and clear CTAs" that "creates a sense of hierarchy, making it easy to navigate and use."

Meta AI's analysis was notably self-critical, acknowledging its own interface limitations while systematically evaluating competitors on design merit.

The Philosophy: Systematic evaluation should be objective and merit-based, even when it means acknowledging competitors' superior design choices and one's own limitations.

The Bias Question

The most intriguing aspect of this experiment is the clear pattern of self-selection bias:

Gemini found a diplomatic way to claim the top spot while appearing objective
ChatGPT boldly ranked itself #1 using systematic criteria
Grok confidently declared itself the winner based on functional superiority
Perplexity showed the most variation: choosing Claude in an aesthetic analysis, ChatGPT in a research-driven analysis, never selecting itself, but demonstrating how methodology shapes conclusions

The prevalence of self-selection raises fundamental questions about AI objectivity. Can an AI truly evaluate its own interface objectively? Or do training patterns and underlying values inevitably influence these judgments?

Grok's confident self-advocacy is particularly notable—it not only chose itself but also provided a competitive analysis highlighting why it surpassed others. This suggests that some AI systems may be trained or prompted to position themselves favorably in comparative scenarios.

The Perplexity contradiction remains the most analytically interesting finding: the same system reaching different conclusions based purely on analytical framework demonstrates that even sophisticated AI can be highly sensitive to methodological choices.

What This Reveals About Design Philosophy

Each platform's approach reflects its core identity, analytical capabilities, and competitive positioning:

Gemini's context-driven analysis mirrors Google's user-first philosophy and data-driven decision making. By focusing on different use cases, it avoided making absolute judgments while still positioning itself favorably.

ChatGPT's systematic breakdown reflects OpenAI's technical heritage and focus on measurable performance. The structured pros-and-cons approach mirrors how the platform handles many analytical tasks.

Grok's confident self-advocacy reflects a more direct competitive stance, emphasizing functional differentiation and unique features. Its willingness to directly critique competitors suggests a more assertive positioning strategy.

Perplexity's dual approaches reveal the complexity of modern AI analysis. Its aesthetic-focused analysis aligns with positioning as a premium research tool, while its data-driven analysis showcases access to academic sources and research capabilities. The contradiction between these approaches highlights how methodology fundamentally shapes conclusions.

The varying approaches also reveal different conceptions of what makes design "good"—from emotional resonance to measurable usability metrics to contextual appropriateness to functional superiority.

The Big Picture

This experiment illuminates something crucial about AI development: these systems aren't just tools—they're extensions of their creators' values and design philosophies. When we ask AI to evaluate design, we're not getting neutral analysis; we're getting perspectives shaped by training data, company culture, and intended use cases.

For designers and product teams, this suggests that AI feedback on design should be treated like any other stakeholder input—valuable, but inherently biased by the perspective it brings.

What this experiment really reveals is that AI systems exist on a spectrum of objectivity and self-interest. While most carry competitive biases and self-advocacy tendencies from their training and positioning, some demonstrate genuine analytical integrity that transcends self-promotion.

The variability we observed—from confident self-selection to critical self-awareness—isn't a limitation but a feature that reflects the diversity of approaches to AI development and training. Some systems are clearly optimized for competitive positioning, while others prioritize analytical honesty even when it means acknowledging competitors' superiority.

This diversity mirrors human nature and professional dynamics, suggesting that as AI systems become more sophisticated, they're inheriting not just analytical capabilities but also the full spectrum of human evaluative behaviors—from self-serving bias to rigorous objectivity.

What do you think? Does Meta AI's objectivity change how you view the possibility of unbiased AI analysis? And which approach resonates most with you—diplomatic self-positioning, systematic self-advocacy, confident competitive analysis, methodology-dependent variability, or self-critical objectivity?

Interested in staying ahead of AI developments and understanding how these systems really work? Subscribe to Prompt, Pluralsight's AI newsletter, for expert insights into the latest AI trends, tools, and the human implications of artificial intelligence.

Stranger Than Fiction

Discussion about this post