AI Governance is a Requirements Problem. Treat It Like One.
Most enterprise AI governance frameworks describe the goal without specifying the work. Here’s what it actually takes to turn “human oversight” from a compliance posture into something that holds up in practice.
The previous piece in this series closed with a question worth sitting with: have you defined, in writing, what your human reviewers are supposed to evaluate, and the criteria they should apply? For most organizations running AI systems today, the honest answer is no. This piece is about what closing that gap actually looks like.
To make the prescription tangible rather than abstract, I want to work through a single example that we’ll carry through this entire piece. Consider an enterprise legal team that has deployed an AI-powered contract review system. The AI reads incoming vendor contracts, flags clauses that deviate from the company’s approved terms, categorizes each flag by risk level, and presents its analysis to a contracts manager for review before sign-off. This is a system where human oversight isn’t optional; the consequences of a missed liability clause or an unfavorable indemnification term are measured in real dollars. It’s also a system where the oversight is particularly vulnerable to fatigue, because most contract language is standard boilerplate, the AI is correct most of the time, and the reviewer’s attention degrades with every contract that passes without incident.
Here is an illustrative governance specification for that system, showing what oversight looks like when you treat it as a requirements problem rather than a compliance posture:
| Parameter | Specification |
|---|---|
| Reviewer | Contracts Manager, minimum 3 years vendor agreement experience |
| Output under review | AI-flagged contract clauses, categorized by risk level (high / medium / low), with AI’s reasoning for each flag |
| Review criteria | Evaluate flagged clauses against the company’s approved terms playbook; confirm AI’s risk categorization; assess whether recommended redline language is appropriate for contract type and counterparty |
| Authority | Accept AI recommendation, modify suggested redline, override AI’s risk classification with documented justification, or escalate to senior counsel |
| Load constraint | Maximum 15 contracts per reviewer per day; system auto-escalates to senior counsel if reviewer approval rate exceeds 90% in any rolling session |
| Time-to-respond monitoring | System tracks elapsed time between flag presentation and reviewer disposition; rapid-succession approvals (e.g., under 30 seconds per flag) trigger a pause prompt or auto-escalation |
| Cross-contract consistency | AI monitors for disposition inconsistencies across contracts (e.g., reviewer approved an indemnification clause in one contract but revised a similar clause in another); inconsistencies flagged for reviewer acknowledgment before final sign-off |
| Escalation path | Uncertain flags route to senior counsel (system does not default to approve); review backlog exceeding 48 hours triggers alert to Legal Ops manager |
If your reaction to that table is “we don’t have anything like this,” you’re in the majority. Most organizations deploying AI systems with human oversight have a policy that says oversight is required. Almost none have a specification that says what the oversight actually consists of.
Without it, the scenario is predictable. The contracts manager reviews the first five contracts carefully, evaluating each flagged clause against the playbook, confirming the AI’s reasoning, applying genuine judgment. By contract ten, the flags are starting to look familiar, because most vendor agreements use similar language and the AI has been correct on every flag so far. By contract twenty, the reviewer is scanning rather than reading, approving flags in batches, trusting the pattern. By contract thirty (if the system even allows them to get that far in a day), the liability clause buried in a non-standard vendor agreement gets the same reflexive approval as the boilerplate confidentiality flag twelve contracts earlier.
This is not a personnel failure. It is a design failure. The system was deployed without specifying the cognitive limits of the review task, without defining the escalation logic that protects against fatigue-driven drift, and without designing the AI’s output to make genuine scrutiny sustainable. Governance, when treated as requirements work, prevents exactly this. Governance treated as a posture just documents that a reviewer was assigned.
NIST’s AI Risk Management Framework for Generative AI (AI 600-1) describes a four-stage lifecycle: Govern, Map, Measure, Manage. Most organizations treat this as a compliance checklist. I assert that if you take it seriously as a process, it is structurally a requirements workflow. Here is what each stage produces when applied to the contract review system:
| NIST Stage | Requirements Equivalent | Applied to Contract Review |
|---|---|---|
| Govern | Scope and constraints | Legal Operations owns the outcome if a bad clause passes review. Risk tolerance: zero for high-risk flags (indemnification, liability caps, IP assignment); medium for low-risk formatting deviations. Reviewer authority: override with documented justification. |
| Map | Use case analysis | Not all contracts need the same level of review. Standard NDAs with pre-approved counterparties can be auto-approved. Non-standard vendor agreements and contracts above a dollar threshold require full human review. This is where you decide what the reviewer’s attention is actually spent on. |
| Measure | Acceptance criteria | The reviewer applies the approved terms playbook, which specifies acceptable clause language by contract type. If the criteria aren’t specific enough to write down and hand to a new hire, the review is opinion, not oversight. |
| Manage | Escalation and workflow design | Maximum 15 contracts per day. Auto-escalation if approval rate exceeds 90%. Time-to-respond monitoring flags rapid-succession approvals. Cross-contract consistency checks catch disposition discrepancies. Uncertain flags route to senior counsel rather than defaulting to approve. These are the guardrails that prevent fatigue from becoming the system’s actual decision-maker. |
What this table represents is not a compliance artifact. It’s a requirements document for an oversight architecture, and every row answers a question that, left unanswered, becomes a gap where the rubber-stamping lives.
Here’s where I want to say something that seems obvious from where I sit, but isn’t how most enterprises are organized. The specification I’ve been building throughout this piece cannot be produced by a governance team working alone, or a technology team, or a legal team. It requires UX strategy and AI strategy in the same room, at the same table, before deployment.
Return to the contract review system. The AI design question is: given that the AI reads every clause in every contract, which flagged clauses genuinely require human review, and how should the AI structure and present those flags so that the review task is tractable rather than overwhelming? If the AI flags forty clauses per contract because its sensitivity is set too high, no amount of UX design can prevent the resulting fatigue. The AI’s scoping decisions are, functionally, a UX decision whether anyone frames them that way or not.
The UX design question runs in the other direction: given what we know about cognitive load and automation bias, how do we design the review interface so that genuine scrutiny is the path of least resistance? Does the interface show the flagged clause alongside the approved playbook language, so the reviewer can compare directly rather than switching between documents? Does it require the reviewer to select a specific disposition (accept, modify, or escalate) with a reason, rather than offering a single “approve” button? Does the system surface a different visual treatment for the first high-risk flag the reviewer encounters after a long run of low-risk approvals, to counteract the attentional drift that accumulates when nothing has required real judgment for a while?
And there is a third dimension that only becomes visible when both disciplines are at the same table: the AI can monitor the reviewer’s own behavior. If the system tracks time-to-respond and detects that the reviewer is approving flags in rapid succession, that’s a fatigue signal the system can act on, whether by prompting a break, auto-escalating the next batch, or surfacing a simple check: “You’ve approved the last twelve flags in under two minutes. Would you like to continue or pause?” Similarly, if the AI notices that the reviewer approved an indemnification clause in one contract but revised a nearly identical clause in another, it can flag the inconsistency before final sign-off rather than letting it pass silently. Most people would not think to build these safeguards without first understanding, through the UX lens, how cognitive fatigue manifests in repetitive review tasks. The safeguards that actually prevent fatigue-driven failure don’t come from governance, AI engineering, or UX design working in isolation; they come from the conversation among all three.
The safeguards that actually prevent fatigue-driven failure don’t come from governance, AI engineering, or UX design working in isolation; they come from the conversation among all three.
These questions constrain each other, and that is the point. The answer to the AI question shapes what the UX question even needs to ask, and vice versa. The HBS finding I discussed in the previous piece, where narrative AI made evaluators 19 percentage points more likely to defer to AI recommendations, is a UX finding as much as it is an AI finding: the design of how the recommendation was presented shaped the quality of the decision. If the contract review system’s AI presents its analysis alongside a confident explanation, and the UX doesn’t counterbalance that confidence with a friction point that asks the reviewer to engage critically, the HBS result predicts exactly what will happen.
What produces meaningful oversight is the integration of these two design streams: specifying what the AI needs to do for the human, and what the human needs to do for the system, before either one is built. That integration is requirements work. The EU AI Act’s mandate for “people with diverse profiles from different parts of the company” in AI governance, enforceable as of August 2026, reflects this reality at a regulatory level: you can’t build a governance architecture that functions from inside a single discipline. You need UX designers at that table because the user experience of the oversight mechanism determines whether the oversight actually works.
The contract review example is one system. Your organization almost certainly has others: AI-generated reports distributed to leadership, customer-facing AI assistants, automated decisioning in procurement or hiring or fraud detection. The specifics will differ, but the underlying questions are the same, and they are worth asking about any AI deployment with a human oversight component.
What outputs carry enough consequence to require human review, and what qualifies someone to do that reviewing? What specific criteria should the reviewer apply, and are those criteria clear enough to hand to someone who hasn’t seen the system before? Who owns the outcome if the reviewer misses something, and what real authority does the reviewer have to override the AI’s output? And if the reviewer is fatigued, uncertain, or overloaded, what does the system do?
If you can answer all four in writing, you have the skeleton of a governance specification. If you can’t answer one of them, you’ve identified exactly where the gap is. The next step is not a better policy document; it’s getting the people designing the AI and the people designing the interaction into the same room, because that conversation is where the specification actually gets built.
Matt leads Planorama Design, a product acceleration firm for enterprise software teams. With nearly 30 years of engineering experience, he helps CTOs and VPs of Engineering structure requirements, validate AI feasibility, and ship better software faster.
“Keep a human in the loop” has become enterprise AI’s default safety answer. New research suggests that without a real design behind it, that answer may be making oversight worse, not better.
Most AI interfaces are designed to deliver output, not to help users evaluate it. If your product presents AI-generated content without prompting users to engage critically, the interface itself is the problem.