Skip to content

Get real human feedback on every prompt you ship.

Share one link. Experts, teammates, or customers annotate. The optimizer turns their feedback into a better prompt.

Getting human feedback on prompts is broken.

  • You asked your team in Slack. The thread died on Tuesday.

  • You tried a Google Doc. Feedback is scattered across 12 places.

  • You shipped it anyway. And hoped the prompt held up in prod.

So prompts ship on vibes — yours, not the people who actually read the output.

Who gives you feedback

The people you actually want feedback from.

Not another account signup. One link, annotated in the browser, closed the tab.

Experts
Your compliance lawyer gets a link by email, annotates three outputs on her phone, and closes the tab.
Teammates
Reviewers see outputs — not prompts, not authors. Their feedback lands without bias.
Customers
Embed an evaluator link in your product or onboarding emails. Real users, real reactions.

How Blind Bench works

Collect real feedback. Apply it. Ship a better prompt.

1

Set up

Write your prompt. Pick models to compare. Five minutes.

2

Share one link

Experts, teammates, and customers annotate in the browser. No account.

3

Apply the feedback

The optimizer rewrites your prompt from the annotations. Every edit cites the comment that drove it.

This is what your reviewers see.

Two responses to the same customer. Pick the one you'd send.

Customer message

“Hi, I just noticed I was charged twice for my subscription this month — $49 on March 3rd and again on March 5th. I’ve been a customer for two years and this is really frustrating. Can you help me get this sorted out?”

Blind evaluation — Which response is better?
2 outputs · Blind mode Click the response you’d send

That’s one blind test. Blind Bench makes it the default for every prompt you ship.

Start collecting feedback

Their feedback becomes your next prompt.

Two annotations. One optimizer pass. A cleaner system message — with every edit cited to the comment that drove it.

Optimizer · v3 → v4
v3 Previous system message

You are a helpful customer support agent. Answer the user’s question.

Reviewer annotations from v3 run
A Output A

“Reads like a corporate form letter.”

on "Dear valued customer, I appreciate your inquiry."

B Output B

“Too robotic — "I understand your concern" is filler.”

on "I understand your concern and will assist you."

v4 Proposed system message

You are a customer support agent. Be friendly and professional — like a helpful coworker, not a corporate robot. Avoid formal openers (“Dear valued customer”) and form-letter phrases (“I understand your concern”). Get to the point and be warm.

Why these changes

Both outputs drifted on tone. Output A went corporate; Output B defaulted to empty filler. The rewrite adds explicit prohibitions and a positive anchor (“helpful coworker, not a corporate robot”).

What Blind Bench does that others don’t.

Prompt tools edit. Eval tools measure. Blind Bench gets real humans in the loop — and turns their feedback into edits.

Prompt tools

Promptfoo, PromptLayer

What went wrong
Edits prompts. No humans in the loop.
Blind Bench
One-link feedback from real humans.

Eval tools

LangSmith, HumanLoop

What went wrong
Measures performance. No qualitative feedback.
Blind Bench
Annotations, not just scores.

Google Docs / Slack

or email threads

What went wrong
No blinding. Feedback scattered. Nothing applies it.
Blind Bench
Blind by default. Feedback becomes edits.
  • BYOK

    Your OpenRouter key. Your compute. Your costs.

  • Blind by design

    Metadata stripped at the API boundary — not the DOM.

  • No reviewer accounts

    One link. They annotate. They close the tab.

  • Five-minute setup

    No credit card. No sales call.

Stop shipping prompts blind.

Real humans in the loop. First setup takes five minutes.