dev Apr 12, 2026

Improving UI Design with a Generator–Evaluator Loop

Good UI design is still hard.

LLMs have made it easier to generate designs quickly but most outputs look the same. Without direction, you end up with generic layouts that resemble hundreds of AI-generated sites.

Earlier, designs would default to purple or dark themes without any prompting. With this experiment, the UI now goes beyond the standard LLM baseline.

Experiment

Goal

Design a personal blog UI.

Approach

Create a feedback loop using two agents:

Generator agent → produces design ideas and iterations.
Evaluator agent → scores designs based on defined criteria

Both agents can use a web search for inspiration.

Scoring Criteria

Each design is evaluated on:

Originality
Simplicity
Aesthetics
Use of CRAP principles (Contrast, Repetition, Alignment, Proximity)
Non-AI generated
Innovation
Creativity

Target: Score ≥ 9

Constraint: Prefer simple solutions over complex ones, unless complexity is necessary.

LLM Prompt

Can we create a generator and evaluator agent loop to create a blog design and iterate? The generator comes up with a design and evaluator scores to improve on these metrics...originality, simplicity, aesthetics, valuable, CRAP rules,innovative, creative, and non-ai-generated scores till we reach a score of 9

Results

Model: Sonnet

Sonnet

Model: Haiku

Haiku

Evaluator sample scoring screen

FINAL SCORES (Iteration 11)
──────────────────────────────────────────────────────────
originality  │ 8/10  — Margin-line metaphor, 3-register type system
simplicity   │ 8/10  — Orange system coherent, clean hierarchy
aesthetics   │ 8/10  — Asymmetric grid 680/280 creates spatial tension
valuable     │ 8/10  — Clear scannability, sparse sidebar
crap         │ 8/10  — All 4 CRAP principles operating coherently
non_ai       │ 8/10  — 680/280 specificity, deliberate design intent
creative     │ 8/10  — Ghost numerals + type contrast + asymmetry
innovative   │ 9/10  — Bold structural layout breaks conventions
──────────────────────────────────────────────────────────

JOURNEY: 0→11 iterations
─────────────────────────────────────────────────────────
Iter 0: Generic blog template                   (avg 5.9/10)
Iter 1: Added constraint metaphor                (avg 6.4/10)
Iter 2: Bold black + orange + monospace           (avg 7.0/10)
Iter 3: Collapsed margin, intentional spacing     (avg 7.1/10)
Iter 4: Added marginalia annotations              (avg 7.4/10)
Iter 5: Refined to scholarly minimalism           (avg 7.9/10)
Iter 6: Micro-refinements (plateau)               (avg 7.6/10)
Iter 7: System sans-serif body (breakthrough)     (avg 7.9/10)
Iter 8: Ghost numerals (hedged commit)            (avg 7.6/10)
Iter 9: Bolder numerals + Georgia serif           (avg 7.9/10)
Iter 10: Fixed CRAP (all 8/10)                    (avg 8.0/10)
Iter 11: Asymmetric grid 680/280 (complete)      (avg 8.1/10)

Learnings

Metrics drive quality : A strong model alone isn’t enough. Clear scoring criteria significantly improve outcomes.
Direction matters more than raw capability : Even average models perform better when guided with structured evaluation.
Evaluator > self-critique : Separate evaluator agents outperform self-evaluation by the generator.
Initial vision helps: Providing a clear starting direction improves both convergence speed and output quality. This also enhanced the styling of this blog post.
Model differences are real.
- Sonnet: best balance of cost, speed, and quality
- Haiku: fast but weaker design sense
- Opus: strong, but not significantly better than Sonnet for this use case

Next Steps

Refine scoring criteria to push designs further away from generic patterns.
Introduce reference designs to guide exploration.
Test additional models (Gemini, Codex)
Experiment with multi-model combinations (e.g., one generates, another evaluates)

Closing Thoughts

This pattern isn’t new. It mirrors human workflows like design reviews and code reviews. What’s new is the scale at which agents can automate these patterns in places we might not have anticipated earlier with humans.

With clear goals, metrics, and iteration loops, agents can:

Raise the baseline quality of design.
Help non-designers produce strong outputs.
Accelerate exploration in unfamiliar domains.

This feels like an early version of a broader shift: structured iteration + autonomous agents → better outcomes across domains.

References

Anthropic engineering post : https://www.anthropic.com/engineering/harness-design-long-running-apps
- Had versions of this earlier, but never formalized. This post gave words to formalize this approach and try this experiment.
AutoResearch (related idea: vision + metrics + iteration) : https://github.com/karpathy/autoresearch
- Though not related, but very much in line with this pattern of vision + metrics with agent autonomy.
Design principles : https://www.lewisu.edu/writingcenter/pdf/crapdesignprinciples.pdf
Claude models : https://platform.claude.com/docs/en/about-claude/models/overview#latest-models-comparison