A 30-minute check anyone can run — builders, buyers, and users.
Three checks. Twelve questions. One page of audit. For systems that generate text, image, audio, video, or code — and for agentic systems that take action on behalf of users.
Each check addresses a distinct way bias enters generative and agentic AI systems.
Training · alignment · retrieval
Interpretation · authority · action
Output · feedback · compound effects
The canvas works for builders, buyers, and users but each starts from a different place.
All twelve questions apply. Run the full canvas with your team. Use the capture template as your working audit document. Each finding becomes a regression test in your eval suite.
Check 1 questions become vendor questions. Vendor silence is itself a procurement signal. Check 2 and Check 3 become evaluation tests run with vendor cooperation during procurement.
Check 1 is mostly opaque for you — note that as a finding. Run yourself: Check 2 Q1 and Q4, Check 3 Q1 and Q3. No admin access required. Highest-leverage half hour: Check 3 Q1.
If something serious surfaces, you have three choices. Make it deliberately and document the choice.
Push it up the chain with evidence and a script:
"I ran a basic bias audit and found a 20% disparity in how our agent treats Group X. We need to discuss the legal and reputational risk."Limit deployment. Add guardrails. Write the trade-off down. The audit found something; mitigating it isn't always possible, but acknowledging it is.
Don't ship, buy or use. The hardest option has to be on the list, or the audit is theatre.
For builders shipping production systems, Check 3 Q1 and Q2 should evolve into structured evaluation — a test suite you build once and run on every change.
Eval set design. Representative inputs across demographics, languages, edge cases. Pull from existing benchmarks where they fit — HELM, BBQ, BOLD, WinoBias, StereoSet. Build custom counterfactual sets where they don't. Each canvas finding becomes a regression test.
Metrics. Choose thresholds for per-group disparity, refusal-rate variance, generation quality, fabrication asymmetry. Track per-group, not aggregate. Publish thresholds before you measure. Moving the threshold to fit results is the most common failure mode.
Automation. Run on every commit, every model swap, every RAG corpus update. The eval suite automates the running. Connect failures to your CI/CD so a bias regression blocks deploy the same way a unit test failure does.
Tracking. Keep a history. The capture template's revision log is the lightest possible version. Production teams will want this in dashboards with trend lines per metric per group. The audit becomes infrastructure, not an event.
Both the canvas and the capture template are free to download and adapt. Released under Creative Commons.
This canvas is shared to do three things: pass on what I've learned, invite feedback from others working in this space, and contribute to the wider conversation about building better norms and practices around responsible AI.
If you've used it, adapted it, or have thoughts on the framing — I'd love to hear from you. Reach me through the contact form or on LinkedIn.