A customer-facing bot quotes a discount the company has never offered. An internal copilot signs off on vendor spend that breaks procurement policy. A support agent promises a refund window that closed two quarters ago.
The model is not hallucinating in the textbook sense. It is filling a vacuum. There is no structured representation of the rule anywhere the agent can reach, so the model does what language models do — pattern-matches across training data and produces something that sounds plausible.
Better prompts will not fix this. Prompts are natural language, and natural language is ambiguous on purpose. The fix is encoding business rules in machine-readable formats that sit next to the model at inference time and constrain its outputs against logic the company has actually agreed to.
The second-order effect is uncomfortable: the more capable the model, the more confidently it invents business logic. A smarter model produces more convincing wrong answers. Model upgrades do not close this gap. They widen it.
What the Org Knows Implicitly, the Model Cannot See at All
The fight: tribal memory accumulated over years versus a stateless model that sees only what you hand it.
Every company runs on thousands of rules nobody has written down in structured form. Pricing tiers, approval thresholds, regional compliance variants, contractual carve-outs, escalation paths, SLA definitions. They live in employee heads, scattered Confluence pages, PDF policy manuals, and the onboarding conversations that never made it to a doc.
Humans navigate this fine. A senior account manager knows enterprise EU clients get net-60 and APAC clients get net-30. A compliance officer knows transactions above $10,000 need a second signature. The knowledge is implicit, contextual, and usually correct because humans have been running their own reinforcement loop against organizational feedback for years.
AI systems get a prompt, maybe some retrieved context from a vector store, and a generation step. If the vector store contains a two-year-old pricing PDF, the model quotes two-year-old prices with full confidence. There is no internal voice telling it the document is stale. There is no senior coworker shaking their head.
The gap between what humans carry implicitly and what the agent needs explicitly is the single largest failure mode in deployed AI systems. RAG handles factual recall. It does not handle conditional logic. "The refund policy is 30 days" is not the same fact as "the refund policy is 30 days for consumer accounts, 90 days for enterprise with an active support contract, and 0 days for custom integrations after acceptance testing completes."
Branching, conditional, exception-laden logic needs structure. Not paragraphs. Structure.
Four Encoding Patterns. Each One Earns Its Overhead Differently.
Not every rule needs the same machinery. Match the format to the failure mode it has to stop.
Formalization has a cost. Pay it where it buys you something. A pricing tier needs a lookup table. Multi-jurisdiction compliance needs a policy engine. Treating both as the same problem is how rules-as-code projects collapse under their own ceremony. Below: the four patterns, ordered by complexity and the failure class each one is built to absorb.
| Pattern | Best For | Format | Runtime Cost | Maintenance Burden |
|---|---|---|---|---|
| Decision Tables | Pricing, eligibility, tier assignment | JSON / YAML / CSV | Microseconds | Low — business teams own the edits |
| Rule Engines | Multi-step logic, chained conditions | Drools / DMN / GoRules | Milliseconds | Medium — needs rule-authoring fluency |
| Policy-as-Code | Access control, compliance, guardrails | OPA Rego / Cedar / Cerbos | Milliseconds | Medium — needs policy engineering |
| Constraint Solvers | Scheduling, allocation, optimization | OR-Tools / Z3 / MiniZinc | Seconds | High — needs mathematical modeling |
Pattern 1: Decision Tables Solve Most of the Problem
The simplest encoding that buys most of the leverage. Start here. Graduate later.
Decision tables map input conditions to output actions. They are the oldest formalization of business rules — insurance carriers have been running them since the 1960s — and they remain the most practical starting point for AI-accessible logic.[3]
The mechanism is plain. Conditions are columns. Outputs are columns. Each row is a rule. At inference time the agent (or a middleware layer) evaluates the input against the table and returns the matching output. No ambiguity. No generation. No guessing.
rules/pricing-tiers.json{
"table": "pricing_tiers",
"version": "2026-03-01",
"conditions": ["account_type", "annual_spend", "region"],
"outputs": ["discount_pct", "payment_terms", "support_tier"],
"rules": [
{
"when": {
"account_type": "enterprise",
"annual_spend": ">= 500000",
"region": "NA"
},
"then": {
"discount_pct": 25,
"payment_terms": "net-60",
"support_tier": "dedicated"
}
},
{
"when": {
"account_type": "enterprise",
"annual_spend": "< 500000",
"region": "NA"
},
"then": {
"discount_pct": 15,
"payment_terms": "net-45",
"support_tier": "priority"
}
},
{
"when": {
"account_type": "startup",
"annual_spend": "*",
"region": "*"
},
"then": {
"discount_pct": 10,
"payment_terms": "net-30",
"support_tier": "standard"
}
}
]
}The leverage point: tables can be injected into the agent's context window or called as a tool. The model does not need to understand the table. It needs to look up the right row given customer attributes and return the result. That is retrieval, not generation. Retrieval is the part LLMs are reliable at.
Pattern 2: When the Discount Depends on the Credit Score That Depends on the Account Age
Decision tables stop being enough the moment one rule's input is another rule's output.
Decision tables break when rules depend on each other. If the discount depends on the payment terms, which depend on the credit score, which depends on the account age — you need forward chaining or backward chaining through a dependency graph. That is what a rule engine does.
The Decision Model and Notation (DMN) standard, maintained by the Object Management Group, encodes those chains in a vendor-neutral way.[3] DMN uses a graphical notation for decision dependencies and FEEL (Friendly Enough Expression Language) for the rule logic itself. Treat it as SQL for business rules — constrained enough to be deterministic, expressive enough for real conditions.
rules/refund-eligibility.feel// DMN decision in FEEL. Chained logic, not a single lookup.
// One source of truth for refund eligibility — the agent calls this, never invents it.
if account.type = "consumer" then
if days_since_purchase <= 30 then "full_refund"
else if days_since_purchase <= 90 then "store_credit"
else "no_refund"
else if account.type = "enterprise" then
if has_active_support_contract then
if days_since_purchase <= 90 then "full_refund"
else "prorated_refund"
else
if days_since_purchase <= 30 then "full_refund"
else "no_refund"
else if account.type = "custom_integration" then
if acceptance_testing_complete then "no_refund"
else "full_refund"
else "escalate_to_manager"When the agent has to answer "can this customer get a refund", it does not generate the answer. It calls the rule engine with the customer's attributes, gets a deterministic result, and uses the LLM only to communicate the result in natural language. Reasoning offloads to verified logic. The model handles tone.
This is the split that holds: rules for reasoning, LLMs for communication. IBM's work on combining rule-based engines with LLMs reports both higher accuracy and better explainability under this separation[5] — the system can cite the exact rule that produced the decision. The size of the accuracy gain depends on rule-set complexity and the model in use, but the direction is consistent.
Model invents discount percentages from training data
Same customer, different phrasing, different answer
No audit trail for why a specific decision was made
Updating a policy means rewriting prompts and hoping
Compliance team cannot inspect what the agent 'knows'
Discount pulled from a versioned decision table
Identical inputs produce identical outputs every time
Full trace: input -> matched rule -> output -> version
Updating policy means changing one row in a table
Compliance team reads the rules directly, in the table format
Pattern 3: Encode What the Agent Must Not Do
Decision tables and rule engines say what to do. Policy-as-code is the layer that draws the hard lines.
Policy-as-code inverts the framing. Instead of encoding what the agent should do, it encodes what the agent must not do. It is a constraint system — allow/deny rules evaluated against every action the agent attempts.
Open Policy Agent (OPA) and its Rego language are the de facto standard for this layer.[2] OPA was built for Kubernetes admission control and API authorization. The architecture maps cleanly onto agent guardrails: evaluate a structured request against a policy bundle, return allow or deny, log the decision for audit. The runtime is sub-millisecond. The semantics are deterministic. The prompt cannot override it.
policies/agent-guardrails.regopackage agent.guardrails
import rego.v1
# Block discounts above the maximum for the account tier.
deny contains msg if {
input.action == "apply_discount"
input.discount_pct > max_discount[input.account_type]
msg := sprintf("Discount %d%% exceeds max %d%% for %s accounts",
[input.discount_pct, max_discount[input.account_type], input.account_type])
}
max_discount := {
"enterprise": 30,
"startup": 15,
"consumer": 10,
}
# Block transactions above threshold without dual approval.
deny contains msg if {
input.action == "approve_transaction"
input.amount > 10000
not input.has_second_signature
msg := "Transactions above $10,000 require dual approval"
}
# Block PII disclosure in outbound responses.
deny contains msg if {
input.action == "send_response"
contains_pii(input.response_text)
msg := "Response contains PII — must be redacted before sending"
}AWS Bedrock's Automated Reasoning checks demonstrate the same pattern at scale — translating natural language policies into formal logic that validates AI outputs with high verification accuracy on well-defined policy domains (AWS reports up to 99% in specific benchmark settings).[6] The mechanism works because formal logic is complete in a way prompt engineering is not. You can prove a policy covers every edge case. You cannot prove that about a system prompt.
The runtime topology is simple: the agent generates a proposed action, the policy engine evaluates it against the current rule set, only approved actions reach the user. Denied actions route to a fallback — human escalation, a safe default, or a re-generation under tighter constraints. The bypass surface is zero by construction.
The Hardest Part Is Not the Encoding. It Is the Extraction.
Most orgs do not have a clean inventory of their own rules. They have artifacts. Workflow starts here.
Encoding format is a tractable problem. Extraction is the structural one. Most companies do not have a clean inventory of their own rules. They have policy documents, employee handbooks, Slack threads where exceptions were negotiated, and institutional memory carried by people who might leave next quarter.
The first time we ran this extraction, we interviewed the compliance team and got back a clean list of about 40 rules. We encoded them, deployed, and watched the system fail on roughly 30% of edge cases. The missing rules were not in any document. They were in the billing team's institutional memory — eight years of enterprise exceptions that had never been written down. The lesson is structural, not motivational: plan for at least two extraction passes. The first captures documented rules. The second captures the undocumented rules that surface as production errors.
Four steps that hold up across industries.
- [01]
Sweep every artifact that contains conditional logic
Collect everything that holds rules: pricing sheets, compliance manuals, SLA agreements, approval matrices, HR policy, vendor contracts. Do not formalize yet — gather. Most teams find 3-5x more rule-bearing documents than they expected. The undercount is the point. Rules without an explicit owner accumulate in the places where work has to happen anyway.
- [02]
Classify rules by volatility, not by domain
Rules change at different cadences. Tax thresholds shift annually. Pricing shifts quarterly. Compliance shifts when regulations move. Sort each rule into stable (yearly or less), volatile (quarterly), and dynamic (weekly or per-transaction). The classification picks the encoding format. Skipping it forces every rule into the same machinery and overcomplicates the cheap ones.
- [03]
Encode in the simplest format that handles the rule
Pick the lightest encoding that contains the rule's complexity. Start with decision tables — most rules collapse to a condition-to-output map. Graduate to a rule engine only when there are real chained dependencies. Choosing heavy machinery for a simple rule is a maintenance tax you will pay every quarter.
- [04]
Wire the rules into the agent's execution path
Make the rules reachable at inference time. Either inject them as a tool the agent must call, or run them as middleware between the model and the output. Do not put business rules in system prompts. Long contexts swallow them, they cannot be versioned independently, and the model will reinterpret them under adversarial input.
Three Layers, Working Together at Runtime
Decision lookups, chained reasoning, and policy guardrails. None of them is optional. Each one catches what the layer above it lets through.
A production rules-as-code setup runs three layers in series. The decision layer resolves lookups — pricing, eligibility, categorization. The logic layer handles chained reasoning — multi-step approval workflows, eligibility with dependencies. The policy layer is the final guardrail — blocking actions that violate hard constraints regardless of what the layers above returned. Skip any one of them and a specific failure class walks straight through to production.
Rules Repository Layout
treerules/
├── decision-tables/
│ ├── pricing-tiers.json
│ ├── support-eligibility.json
│ ├── shipping-rates.json
│ └── discount-matrix.json
├── rule-engine/
│ ├── refund-eligibility.dmn
│ ├── approval-workflow.dmn
│ └── credit-assessment.dmn
├── policies/
│ ├── agent-guardrails.rego
│ ├── pii-protection.rego
│ ├── spend-limits.rego
│ └── regional-compliance.rego
├── schemas/
│ ├── rule-schema.json
│ └── audit-log-schema.json
└── CHANGELOG.mdlib/rules-middleware.tsimport { evaluate } from './rule-engine';
import { checkPolicy } from './policy-engine';
import { lookupDecisionTable } from './decision-tables';
interface AgentAction {
type: string;
params: Record<string, unknown>;
context: Record<string, unknown>;
}
interface RuleResult {
allowed: boolean;
values: Record<string, unknown>;
appliedRules: string[];
deniedBy?: string;
}
export async function enforceBusinessRules(
action: AgentAction
): Promise<RuleResult> {
// Layer 1: decision table lookup. Cheap, deterministic, no model in the path.
const tableResult = await lookupDecisionTable(
action.type,
action.params
);
// Layer 2: rule engine for chained logic.
const engineResult = await evaluate({
...action.params,
...tableResult.values,
});
// Layer 3: policy guardrail. Final check. The prompt cannot override this.
const policyResult = await checkPolicy({
action: action.type,
...action.params,
...engineResult.values,
});
if (policyResult.denied) {
return {
allowed: false,
values: {},
appliedRules: policyResult.violatedPolicies,
deniedBy: policyResult.violatedPolicies[0],
};
}
return {
allowed: true,
values: { ...tableResult.values, ...engineResult.values },
appliedRules: [
...tableResult.matchedRules,
...engineResult.firedRules,
],
};
}Every Decision Needs a Traceable Chain Back to a Specific Rule Version
When a regulator asks why an application was denied, 'the AI decided' is not an answer. The audit trail is the answer.
When a customer disputes a charge or a regulator asks why an application was denied, "the AI decided" is not a defense. The trail has to be explicit: input data, the rule version that was active, the specific rule that matched, the output it produced.
This is where rules-as-code pays for itself past accuracy. Every evaluation produces an audit record a human can read, verify, and stand behind. The rule engine does not have a mood. It applied rule version 2026.03.15, row 7, which maps enterprise accounts above $500K annual spend in the NA region to a 25% discount with net-60 terms. That sentence is the artifact. It is also the deposition exhibit.
Store rules in Git. Treat every rule change like a code change — pull request, domain-owner review, automated tests, deployment through CI/CD. Rules that live in a database or a UI without version control destroy your ability to answer the question "what were the rules on March 3rd, when this decision was made." That question is the entire point of the audit trail.
The stronger implementations tag every agent response with the rule-version hash that was active during evaluation. If the rules have moved since, the system can flag the response as potentially stale — a pattern that becomes load-bearing for long-running conversations where a pricing update lands mid-session.
Encode These First. They Have the Biggest Blast Radius When Wrong.
Not every rule deserves the same urgency. These categories produce the most expensive errors when the agent invents an answer.
Financial rules — biggest blast radius
- ✓
Pricing tiers, discount maximums, volume break points
- ✓
Payment terms by account type, region, contract status
- ✓
Approval thresholds — what dollar amounts require which sign-offs
- ✓
Tax calculation rules by jurisdiction
- ✓
Refund and credit policies, every exception path
Compliance and regulatory rules — non-negotiable enforcement
- ✓
Data residency requirements by region (GDPR, CCPA, PIPL)
- ✓
KYC/AML thresholds and documentation requirements
- ✓
Industry-specific regulations (HIPAA, SOX, PCI-DSS)
- ✓
Record retention periods and deletion obligations
- ✓
Mandatory disclosure and disclaimer language by jurisdiction
Operational rules — death by a thousand cuts
- ✓
SLA definitions — response, resolution, escalation
- ✓
Eligibility criteria for features, programs, services
- ✓
Routing logic — which team handles which request type
- ✓
Capacity limits — users, API rates, storage quotas
- ✓
Scheduling constraints — business hours, blackout periods, maintenance windows
Rules Are Code. Test Them Like Code.
Decision tables and policies need test suites, not spot checks. Boundary conditions are where the failure modes live.
The rigor you apply to application code applies to rules. Every decision table needs a test suite verifying outputs for known inputs, including the boundary rows. Every policy needs negative tests confirming that denied actions are actually denied. Spot checks miss the exact place rules fail — the boundaries.
OPA ships test tooling (opa test) that makes this straightforward. For decision tables in JSON, write a thin harness that evaluates every row against sample inputs and asserts the expected output. The cost is low. The signal is high.
- [01]
Test every row in every decision table against known inputs
typescript// test/pricing-tiers.test.ts import { lookupDecisionTable } from '../lib/decision-tables'; test('enterprise NA high-spend gets 25% discount', () => { const result = lookupDecisionTable('pricing_tiers', { account_type: 'enterprise', annual_spend: 750000, region: 'NA', }); expect(result.discount_pct).toBe(25); expect(result.payment_terms).toBe('net-60'); }); - [02]
Test boundary conditions — the row edges where rules flip
typescripttest('exactly $500K triggers high-spend tier', () => { const result = lookupDecisionTable('pricing_tiers', { account_type: 'enterprise', annual_spend: 500000, // boundary region: 'NA', }); expect(result.discount_pct).toBe(25); }); test('$499,999 stays in standard tier', () => { const result = lookupDecisionTable('pricing_tiers', { account_type: 'enterprise', annual_spend: 499999, // just below boundary region: 'NA', }); expect(result.discount_pct).toBe(15); }); - [03]
Test policy denials with OPA's built-in test runner
ruby# policies/agent-guardrails_test.rego package agent.guardrails test_deny_excessive_discount if { deny with input as { "action": "apply_discount", "account_type": "consumer", "discount_pct": 20 } } test_allow_valid_discount if { count(deny) == 0 with input as { "action": "apply_discount", "account_type": "enterprise", "discount_pct": 25 } }
Rules Drift Is the Default State of Any Encoded Rule Set Without an Owner
Encoding is a point-in-time act. Six months later the world has moved and the table has not. Drift is the silent killer.
Encoding a rule is a point-in-time activity. The rules are correct on the day you encode them. Six months later the pricing has shifted, a compliance threshold has moved, two new product tiers exist — and nobody updated the table.
Drift is the silent failure mode of rules-as-code. The agent is enforcing rules that are no longer accurate, and because enforcement is deterministic, nobody notices until a customer complains or an audit catches it. Three mechanisms keep drift from becoming the default.
Drift Prevention Mechanisms
Mandatory expiry dates on every rule
Every decision-table row and every policy rule carries an expires_at field. When a rule expires, the system fails loudly and routes the decision to a human — it does not silently keep using stale logic. This forces periodic review without depending on humans remembering.
Automated conflict detection between rules and observed behavior
Run a weekly job that compares rule outputs against actual outcomes. If the rule says the discount is 15% but the last 50 transactions averaged 22%, something is wrong. Either the rule is stale or it is being bypassed. Either way, flag it and assign an owner.
Domain-owner sign-off on a cadence matching volatility
Stable rules get annual review. Volatile rules get quarterly review. Dynamic rules get reviewed with every deployment. Assign a named owner — a person, not a team — to each rule category. Ownerless rules drift by default, because no one is on the hook for catching the change.
Wiring Rules Into Agent Frameworks
Tool calls or middleware. Pick deliberately — the choice decides what can bypass the rules.
Most agent frameworks support tool calling. That is your integration point. Define the rules engine as a tool the agent can — and must — call before taking actions that involve business logic.
The live architectural call: should the agent call rules proactively, or should middleware intercept the agent's output and validate it. Both patterns have trade-offs. The trade-offs decide what failure mode you accept.
| Pattern | How It Works | Pros | Cons |
|---|---|---|---|
| Agent-calls-rules (tool) | Agent has a 'checkbusinessrules' tool and calls it before responding | Lower latency on simple queries; agent learns when to check | Agent can skip the call; depends on reliable tool-use behavior |
| Middleware interception | Every agent output passes through the rules engine before delivery | 100% coverage; the agent cannot bypass the layer | Latency on every response; some checks are unnecessary |
| Hybrid (recommended) | Middleware catches high-risk actions; the agent calls tools for lookups | Best coverage-to-latency ratio; defense in depth | More setup; two systems on the maintenance roster |
tools/business-rules-tool.tsimport { tool } from 'ai';
import { z } from 'zod';
import { enforceBusinessRules } from '../lib/rules-middleware';
export const checkBusinessRules = tool({
description: 'Check business rules before quoting prices, applying discounts, ' +
'or making commitments to customers. ALWAYS call this before responding ' +
'with any financial figures.',
parameters: z.object({
action: z.string().describe('The action type: pricing, discount, refund, approval'),
account_type: z.string().describe('Customer account tier'),
region: z.string().optional().describe('Customer region code'),
amount: z.number().optional().describe('Transaction amount if applicable'),
additional_context: z.record(z.unknown()).optional(),
}),
execute: async (params) => {
const result = await enforceBusinessRules({
type: params.action,
params,
context: params.additional_context ?? {},
});
return {
allowed: result.allowed,
values: result.values,
rules_applied: result.appliedRules,
denied_reason: result.deniedBy ?? null,
};
},
});Four Metrics That Tell You the System Is Actually Working
Track these from day one. They justify the investment, and they surface drift before customers do.
Four metrics tell you whether rules-as-code is doing its job. Track them from day one. They are both the justification for the investment and the early-warning system for drift. If you cannot read these four numbers off a dashboard, the system is enforcing rules nobody is auditing.
What to Ship by Friday
A first-week plan. Boring. Concrete. Designed to remove the loudest source of invented answers before next sprint.
First Week: Business Rules as Code
List the 10 most common customer-facing decisions the agent makes — pricing, eligibility, refunds, routing
For each decision, name the current source of truth: a PDF, a spreadsheet, or a person
Encode the top 3 as JSON decision tables with version and effective_date fields
Write a test for each table covering the happy path and two boundary conditions
Wire one decision table into the agent as a tool call — start with pricing
Add a single OPA policy blocking discounts above the maximum for each account tier
Set up audit logging that records rule version, input, matched rule, and output for every evaluation
Schedule a monthly review with named domain owners — not a team, a person — to verify rules freshness
Can the LLM just interpret rules in natural language?
No. Use a separate engine. LLMs interpreting rules in prose will occasionally get them wrong, and 'occasionally wrong' on pricing or compliance is the failure mode that costs real money. The model's job is communication — taking the deterministic output from the rules engine and explaining it clearly. Reasoning and communication stay separate. The failure pattern to watch for: teams start hybrid ('the LLM will call the rules API when needed') and discover months later that the model has been skipping the call for 'simple' lookups and inventing the answer. For any action with financial or compliance implications, rule checks are mandatory middleware. Not optional tool calls.
What about rules with exceptions or that need human judgment?
Encode the rule. Encode the exception path explicitly. If the rule is 'net-30 except when VP of Sales approves an override,' the override is a separate rule that requires a human-approval flag in the input. The system never guesses at exceptions. It either applies the rule or routes to a human. Guessing is what got you here.
What about rules that change frequently — promotional pricing?
Run a deployment pipeline. Promotional rules go in a separate decision table with start and end dates. CI/CD validates the table, runs tests, deploys to the rules service. The agent always reads the current version. When the promo expires, the table reverts to base pricing automatically. No manual intervention. No rule to forget about.
How do decision tables scale to thousands of rules?
Partition by domain. One table per category — pricing, eligibility, routing, compliance. Each table stays small and maintainable. The middleware decides which table to query based on action type. Most organizations end up with 20-50 tables covering 500-2000 rules total. That is manageable with standard tooling. The failure mode is not size. It is putting unrelated rules in the same table.
Can an LLM help extract rules from policy documents?
Yes — with verification. Use the model to propose structured rules from unstructured policy text. Always have a domain expert validate the output before it enters production. Georgetown's Beeck Center research finds LLMs convert policy to code well when the logic is simple, and stumble on complex multi-condition rules.[7] Treat LLM extraction as a drafting tool, not a finalization tool. The bottleneck is review, not generation.
- [1]Digital Government Hub — AI-Powered Rules as Code: Experiments with Public Benefits Policy(digitalgovernmenthub.org)↩
- [2]Open Policy Agent — Open Policy Agent Documentation(openpolicyagent.org)↩
- [3]Object Management Group — Decision Model and Notation (DMN) Standard(omg.org)↩
- [4]GoRules — Cloud-Native Rule Engine(gorules.io)↩
- [5]IBM DecisionsDev — Rule-Based LLMs(github.com)↩
- [6]AWS — Minimize AI Hallucinations And Deliver Up To 99% Verification Accuracy With Automated Reasoning Checks(aws.amazon.com)↩
- [7]Georgetown Beeck Center — AI-Powered Rules as Code: Experiments with Public Benefits Policy(beeckcenter.georgetown.edu)↩
- [8]Combining Rule-Based and LLM-Based Approaches for Decision Making(arxiv.org)↩