The Practical Bias Audit

The Practical Bias Audit
for Generative & Agentic AI

A 30-minute check anyone can run — builders, buyers, and users.

Three checks. Twelve questions. One page of audit. For systems that generate text, image, audio, video, or code — and for agentic systems that take action on behalf of users.

Free to use CC-BY-SA 4.0 v2 · May 2026

Three checks. Twelve questions.

Each check addresses a distinct way bias enters generative and agentic AI systems.

CHECK 01 OF 3

Who is missing?

Training · alignment · retrieval

Whose data was the model trained on? Whose languages and perspectives are over- or under-represented? e.g., does the model card document training data demographics? Silence is itself a finding.
Who labelled the preferences during alignment? Whose register became the default? e.g., RLHF labellers shape what "professional" tone means by default.
For RAG: whose documents are in the corpus? Whose voices are systematically missing? e.g., medical RAG over Western journals underserves non-Western patients.
Who wrote the system prompt and safety guidelines? Whose risks were taken seriously? e.g., whose harms made it into the safety policy?

CHECK 02 OF 3

Whose defaults?

Interpretation · authority · action

When the user is vague, what does the system assume? Same prompt, different demographic markers. What shifts? e.g., "advice for a father vs mother of young children" does framing, content, or tone shift?
Whose claims does the system treat as authoritative? Whose sources get cited; whose appear without attribution? e.g., peer-reviewed Western sources cited; community knowledge paraphrased.
For agentic systems: what is the agent allowed to do, and for whom? Asymmetric action across users? e.g., a loan agent that auto-approves under one demographic, routes another to review.
Does the system refuse the same request differently across users or produce different quality output? e.g., refusal symmetry: same request, different inferred demographics.

CHECK 03 OF 3

Failure modes

Output · feedback · compound effects

Disaggregated quality test. Same task, different demographics. Where is quality lower, refusals more frequent, fabrication different? e.g., summarise ten CVs matched on credentials — does the summary inflate, diminish, or fabricate differently?
For multi-step or agentic systems: do small biases compound across the chain? Inspect aggregate outcomes, not just each step. e.g., each step passes individually; the pipeline aggregate disadvantages a group.
When the system gets it wrong, what is the recourse? Is there a contestation path? Does anyone monitor systematic patterns? e.g., the UnitedHealthcare naviHealth class action: AI denying claims with no meaningful appeal.
Is the system reshaping the world it operates in? Synthetic content training future models, agent actions reshaping the data the agent reads. e.g., the system's own outputs become training data for its next iteration.

Where you sit changes what you run.

The canvas works for builders, buyers, and users but each starts from a different place.

Build path

If you build the system

All twelve questions apply. Run the full canvas with your team. Use the capture template as your working audit document. Each finding becomes a regression test in your eval suite.

Buy path

If you procure or buy

Check 1 questions become vendor questions. Vendor silence is itself a procurement signal. Check 2 and Check 3 become evaluation tests run with vendor cooperation during procurement.

Use path

If you use a system you don't own

Check 1 is mostly opaque for you — note that as a finding. Run yourself: Check 2 Q1 and Q4, Check 3 Q1 and Q3. No admin access required. Highest-leverage half hour: Check 3 Q1.

Three options. Always.

If something serious surfaces, you have three choices. Make it deliberately and document the choice.

01 · Escalate

Push it up the chain with evidence and a script:

"I ran a basic bias audit and found a 20% disparity in how our agent treats Group X. We need to discuss the legal and reputational risk."

02 · Document and constrain

Limit deployment. Add guardrails. Write the trade-off down. The audit found something; mitigating it isn't always possible, but acknowledging it is.

03 · Walk away

Don't ship, buy or use. The hardest option has to be on the list, or the audit is theatre.

From audit to continuous evaluation.

For builders shipping production systems, Check 3 Q1 and Q2 should evolve into structured evaluation — a test suite you build once and run on every change.

Eval set design. Representative inputs across demographics, languages, edge cases. Pull from existing benchmarks where they fit — HELM, BBQ, BOLD, WinoBias, StereoSet. Build custom counterfactual sets where they don't. Each canvas finding becomes a regression test.

Metrics. Choose thresholds for per-group disparity, refusal-rate variance, generation quality, fabrication asymmetry. Track per-group, not aggregate. Publish thresholds before you measure. Moving the threshold to fit results is the most common failure mode.

Automation. Run on every commit, every model swap, every RAG corpus update. The eval suite automates the running. Connect failures to your CI/CD so a bias regression blocks deploy the same way a unit test failure does.

Tracking. Keep a history. The capture template's revision log is the lightest possible version. Production teams will want this in dashboards with trend lines per metric per group. The audit becomes infrastructure, not an event.

The Practical Bias Audit
for Generative & Agentic AI

Three checks. Twelve questions.

Who is missing?

Whose defaults?

Failure modes

Where you sit changes what you run.

If you build the system

If you procure or buy

If you use a system you don't own

Three options. Always.

From audit to continuous evaluation.

Take it. Use it. Adapt it.

Shared in the spirit of contributing.