Alright - let's build an AP watchdog that never sleeps, sniffs out leaks, and explains itself without hand-waving. Below is a practical, engineer-first blueprint you can ship in phases. No fluff, just what works - with a touch of music in the margins.

1) Define the "oversights" (your label space)

Start with clear, auditable categories. You'll need these for both rules and ML targets.

Duplicates: exact & near-dupes (same vendor/date/amount, OCR-typos, different invoice #).
3-way-match gaps: PO GRN invoice qty/price mismatches, UoM drift.
Price drift: gradually worsening unit price against contracted/benchmarked price.
Payment-term erosion: missed early-pay discounts; term changes without approval.
Vendor-master risks: sudden bank-detail changes, new vendors w/ thin history, address reuse across vendors.
Tax/VAT miscoding: wrong tax codes, non-recoverable VAT booked recoverable, reverse-charge mishaps.
Suspicious splits: invoices split just under approval thresholds.
Accrual staleness: old GR/IR not clearing; accrued items without eventual invoices.
FX/currency: wrong conversion date or rate source.
Posting anomalies: large round numbers, weekend/holiday postings, unusual approver chains.

2) Data you need (and how to stitch it)

Invoices/AP ledger: header + line items, tax lines, GL postings, payment history.
POs & receipts (GRN): quantities, prices, dates, UoM, contract refs.
Vendor master: bank details (IBAN/SWIFT), addresses, tax IDs, payment terms, approval history.
Reference: contract catalogs, FX rates (by date), tax tables, calendars/holidays.
Access: CDC from ERP (e.g., SAP/Oracle/D365); land in a lakehouse; enforce PII minimization.

3) Start simple: rules that pay for themselves

Rules are explainable, cheap, and produce weak labels for ML.

Near-duplicate detection: same vendor & amount within 3 days AND Jaro-Winkler(invoice_no, memo) > 0.9.
Missed discount: payment_date > discount_window AND terms include discount.
Threshold splits: N invoices from same vendor within 2 days with sum just under approver limit.
Bank change alert: payment to new bank account within 7 days of change without 2-factor vendor re-verification.
Benford outliers: first-digit distribution per vendor deviates beyond control limits.

Log every rule hit with: rule_id, evidence, monetary impact, and human-readable reason.

4) Layer in ML where rules run out of road

Use ML to rank risk and catch the clever, fuzzy stuff.

A. Supervised (where you have labels)

Train a gradient-boosted model (LightGBM/XGBoost) per oversight type (e.g., duplicate, miscoded VAT).
Target = historical confirmed exceptions/recoveries/reversals.
Features: vendor-normalized z-scores, inter-invoice intervals, text embeddings from descriptions, approval path patterns, bank-detail change recency, price vs contract deltas, FX timing deltas.

B. Unsupervised/weakly supervised

Isolation Forest / Elliptic Envelope on vendor-wise numeric features (amounts, unit prices, qty).
Autoencoder for line-level recon over reconstructed error.
Similarity search (MinHash/LSH or embeddings) for near-dupes and recycled PDFs.
Graph features: vendor bank address approver graph; flag many-to-one hubs and short odd cycles.

C. NLP / Document intelligence

OCR layout-aware extraction (invoice #, date, total, VAT id).
Sentence-BERT embeddings for memo/line descriptions to spot semantic dupes and miscodings.
Classification of tax code likelihood from text + vendor + jurisdiction.

5) Feature engineering that moves the needle

Per-vendor baselines: rolling medians, MADs for unit price/qty/discount utilization.
Contract anchors: distance to contract price & quantity tolerances.
Temporal: days from GRN to invoice to posting to payment; month-end pressure flags.
Approval path deltas: typical approver graph vs this document's path.
Bank detail recency: days since change; change frequency; country mismatch vs vendor.
Benford & roundness: first digit, terminal zeros, "nice" numbers.

6) Label strategy (don't wait for perfect truth)

Gold labels: confirmed duplicate recoveries, audit findings, credit memos reversing errors.
Silver labels: rule-hits reviewed & confirmed by AP.
Weak labels: high-precision rules you trust (e.g., exact duplicates).
Active learning: surface uncertain cases to reviewers; fold decisions back weekly.

7) Scoring, economics, and guardrails

Risk score 0-100 per document + expected value = probability monetary impact.
Optimize precision at capacity: how many items your team can review/day.
Cost-weighted metrics: catching a 50K duplicate beats five 100 tax miscodes. Track $ saved / 1K invoices.
Human-in-the-loop: one-click disposition; every decision feeds the model.

8) Deployment pattern (clean and reversible)

Ingest: CDC bronze (raw) silver (validated) gold (analytics).
Batch + streaming: batch backfills nightly; stream high-risk events on create/post/payment.
Explainability: SHAP for ML; crisp rule narratives for rules.
Interfaces:
- AP queue sorted by expected value.
- Vendor-level health dashboard (price drift, term erosion).
- Change-control panel for vendor bank updates (attach evidence; dual-auth).
Feedback API: "confirm issue / false positive / not sure".

9) Security & compliance (the unglamorous bits that matter)

Least-privilege service accounts; column-level masking for bank/PII.
Immutable audit log of model versions, features, and decisions.
Data retention aligned with finance policy; pseudonymize where possible.
Country-specific VAT logic encapsulated & versioned.

10) 60-day build plan (aggressive but doable)

Weeks 1-2

Land data, entity map (invoicePOGRNpaymentvendor), baseline dashboards.
Ship the rules engine for: near-dupes, missed discounts, bank changes, threshold splits.

Weeks 3-4

Similarity service (MinHash + embedding) for fuzzy duplicates.
Unsupervised anomaly (Isolation Forest) per vendor for price/qty.

Weeks 5-6

First supervised model (duplicates), SHAP explanations.
Reviewer workbench + feedback loop; start expected-value ranking.

Weeks 7-8

Tax/VAT classifier + contract price-drift detector.
SLA/metrics: precision@capacity, $ saved, review time, false-positive rate.

11) Minimal code to get traction (Python snippets)

Near-duplicate candidate generation

import pandas as pd
from Levenshtein import ratio as lev_ratio

def near_dupe_flags(df):
    # df: invoice_id, vendor_id, invoice_date, amount, invoice_no, memo
    df = df.sort_values(['vendor_id','invoice_date'])
    # window join: same vendor, 3 days, 1 on amount
    idx = []
    for i, r in df.iterrows():
        w = df[(df.vendor_id==r.vendor_id)
               & (abs((df.invoice_date - r.invoice_date).dt.days) <= 3)
               & (abs(df.amount - r.amount).abs() <= 1)
               & (df.index != i)]
        for j, s in w.iterrows():
            sim = max(lev_ratio(str(r.invoice_no), str(s.invoice_no)),
                      lev_ratio(str(r.memo), str(s.memo)))
            if sim >= 0.9: idx.append((r.invoice_id, s.invoice_id, sim))
    return pd.DataFrame(idx, columns=['inv_a','inv_b','similarity'])

Isolation Forest for vendor-wise anomalies

from sklearn.ensemble import IsolationForest
import numpy as np

def vendor_anomaly_scores(df):
    # features: amount, unit_price_z, qty_z, days_to_pay, roundness
    feats = df[['amount','unit_price_z','qty_z','days_to_pay','roundness']].fillna(0.0)
    model = IsolationForest(n_estimators=300, contamination='auto', random_state=42)
    model.fit(feats)
    # Higher score = more normal; invert for risk
    df['anomaly_risk'] = (-model.score_samples(feats) - (-model.score_samples(feats)).min())
    df['anomaly_risk'] /= df['anomaly_risk'].max() + 1e-9
    return df[['invoice_id','anomaly_risk']]

12) How this saves time (and SEO-ish bonus for internal comms)

Time saved: triage only the top X expected-value cases daily; the rest pass silently.
Cash saved: quantify recoveries; report " saved/1K invoices" monthly.
Governance: every flag is explainable; auditors get reproducible evidence.
Future-proofing: models re-train weekly on fresh decisions; drift is monitored.

13) Common edge cases to bake in early

Credit memos and partials (avoid "false duplicate" with negative lines).
Blanket POs vs spot POs (use contract anchors for price checks).
Intercompany & one-time vendors (different baselines).
Multi-currency (rate-date correctness, not just amounts).
Scanned vs e-invoices (OCR confidence as a feature).

AI Auditor

Let's start at the beginning.

Is this an AI being trained on AP data or just a set of python rules referencing an LLM for interpretation?

Short answer: build a hybrid. Start with deterministic Python rules for the high-yield checks, use an LLM as a helper (extraction, normalization, explanations), and only train ML models on your AP data once you've accumulated enough reviewed outcomes. That gives you cash savings fast, clean audit trails, and a ramp toward real pattern-learning.

Here's the shape:

What each option really means

Rules-only

Python rules over invoices/PO/GRN/vendor tables.
Pros: instant, explainable, auditor-friendly.
Cons: brittle against fuzzy dupes, vendor drift, and creative errors.

Rules + LLM assist (recommended start)

Rules do the flagging.
LLM handles: OCR/layout extraction, text cleanup ("Inv# O/OO1 0001"), semantic similarity ("Consulting fee" "Advisory services"), and plain-English reasons for reviewers.
Pros: fast to ship, explainable, catches fuzzier texty stuff.
Cons: still not learning from your history (yet).

Trained models on AP data

Gradient-boosted classifiers per issue (duplicates, VAT miscoding, price drift), plus anomaly models.
Needs labels (confirmed issues/false positives) and a feedback loop.
Pros: best recall on real-world mess.
Cons: slower to bootstrap; must manage drift and governance.

Full hybrid (destination)

Rules for policy & hard constraints.
ML for ranking and subtle patterns.
LLM for extraction, normalization, and human-readable justifications.

Recommended path (pragmatic, auditable, forward-ready)

Phase 1 - Rules + LLM helper

Ship 8-10 rock-solid rules: near-dupes, bank-detail change, approval-threshold splits, missed discounts, 3-way match tolerances, Benford deviations, weekend/EOQ posts, term erosion.
Use the LLM only for:
- Document intelligence: parse invoice #/date/total/VAT id from PDFs.
- Normalization: fix OCR quirks, standardize UoM, expand abbreviations.
- Similarity: embed invoice text/memos for near-dupe matching (vector search).
- Explanations: "Flagged as near-duplicate of INV-10233; same vendor, 12,540 0.50, 2-day window; description similarity 0.93."

Phase 2 - Start learning from your data

Capture reviewer outcomes (true issue / false positive).
Train a LightGBM model for duplicate/overpayment risk using those decisions + rule hits as features.
Add Isolation Forest per vendor for price/qty anomalies.
Keep rules as guardrails; use models to rank by expected value.

Phase 3 - Mature

Add a VAT/tax code likelihood model, contract price-drift detector, and simple graph features (vendor bank address approver) to spot odd hubs.

Where the LLM fits (and where it doesn't)

Use LLMs for

Layout-aware extraction (or post-process outputs from an OCR engine).
Text standardization and semantic similarity (embeddings).
Generating crisp reviewer notes (explainability).
Schema harmonization between ERPs.

Avoid LLMs for

The actual "this is a duplicate" decision. Keep that in rules/ML you can version, test, and audit.
Anything that must be strictly deterministic or policy-bound (approval thresholds, tax table application).

Minimal stack

Data: invoices, POs, GRNs, vendor master, payments, FX/tax tables.
Engines:
- Rules engine (Python/SQL).
- Vector store for embeddings (FAISS/pgvector).
- ML (LightGBM + Isolation Forest).
- LLM (extraction/explanations) behind a strict prompt/template.
UX: Review queue sorted by expected value (prob impact) with one-click outcomes feeding retraining.

Governance you'll thank yourself for

Version every rule/model, log feature values and decisions, keep a read-only audit trail.
PII minimization (mask vendor bank details in non-essential paths).
Deterministic fallbacks if the LLM is unavailable.

So: not just an LLM calling some Python, and not ML from day one either. Start with rules + LLM assist, earn labels and savings, then layer trained models where they clearly beat rules. Plain, sharp, and built to grow-like a ledger that sings.

AI Auditing Processes

OK but think about this:

I have a software that performs test for auditors to examine and returns likely duplicates etc. What would happen if we trained an AI on the results of these tests rather than the data?

Great question - and very workable. Training on the results of your existing tests (rather than raw AP data) is basically building a meta-model that learns how to prioritise, combine, and calibrate your current checks. Think "orchestra conductor," not a new instrument.

Here's what happens - the good, the bad, and the smart way to do it.

What you'd be training on

For each invoice (or vendor/event), you already have:

A vector of test outputs: rule flags, similarity scores, tolerance deltas, Benford deviations, "missed discount = yes/no", etc.
Auditor verdicts and (ideally) ** impact** from past reviews.

This becomes a clean supervised dataset:

X = [test_1_score, test_2_flag, test_3_delta, ...]
y_class = {issue / no-issue}
y_value =  recovered (or expected  impact)

Why this is powerful

Fast lift: No OCR wrangling, no fragile joins. You're leveraging signal you already trust.
Calibrates thresholds: The model learns that "rule A + rule C together" is high risk even if each alone is weak.
Fewer false positives: It learns vendor/context patterns implicit in your tests' co-occurrence.
Explains itself: SHAP/feature importance still map to familiar test names.
Governance-friendly: You're not moving sensitive raw data around; you're modelling decisions about evidence.

Hard limits (be blunt about them)

Ceiling == your tests. The meta-model can't exceed the information content of your checks. If the root signal isn't in your tests, the model can't invent it.
Error reinforcement. Systematic blind spots in tests become systematic blind spots in the model.
Thresholding hurts. If your tests throw away magnitude/context (e.g., binary flags only), you lose learnable nuance.

Best-practice design (what I'd actually ship)

1) Keep the raw test magnitudes

Prefer continuous features over booleans:

Similarity scores, z-scores vs vendor baselines, price deltas, days offsets, OCR confidence, etc.
Include counts (e.g., "# of rules hit", "# hits in last 90 days for this vendor").

2) Train two targets

Classifier: P(issue | test outputs).
Regressor: E( impact | test outputs). Combine into Expected Value: EV = p * to rank the review queue.

3) Models (simple strong)

Start with logistic/linear + isotonic calibration (auditor-friendly).
Move to LightGBM/XGBoost (handles sparse/heterogeneous test features).
Add pairwise learning-to-rank (XGBoost rank) to order items by EV under reviewer capacity constraints.

4) Calibrate & de-noise

Isotonic/Platt calibration per vendor segment (large vendors behave differently).
Use confident learning (e.g., cleanlab) to reduce label noise when auditors disagree or skip outcomes.

5) Don't just classify - optimise the workflow

Train for precision@K (K = daily review capacity), not just AUC.
Track ** saved / 1k invoices** and time-to-decision as primary KPIs.

6) Protect against drift

Log the distribution of each test feature. If your upstream rules change (new tolerance, new vendor mix), retrain.
Use conformal prediction to attach "risk bands" and abstain when the test pattern is out-of-distribution.

7) Keep humans in the loop (and learn)

One-click outcomes feed back nightly.
Active learning: sample uncertain cases for review to improve calibration fastest.

Concrete feature sketch (drop-in)

Meta features: count_rules_hit, max_score, sum_scores, entropy_of_hits, weekend/EOQ flag.
Pattern features: (near_dupe_score bank_change_flag), (missed_discount days_late), (3WM_qty_delta UoM_mismatch).
History: vendor_issue_rate_90d, median__impact_vendor, time_since_last_issue.
Context: invoice_amount_band, currency, entity, month-end, new_vendor_flag.

Evaluation you should demand

Baseline = "current rules sorted by highest single score".
Report:
- Precision@capacity (top N the team can review).
- Lift curve vs baseline.
- Calibration curve (predicted vs actual issue rate).
- Net uplift and reviewer hours saved.

Where this leaves "full AI on data"

Use this meta-model to squeeze everything from your existing tests now. In parallel, identify misses/near-misses and add new tests (or minimal raw-data signals) where the meta-model's ceiling is obvious:

e.g., add a "price-vs-contract z-score" test if you don't have one,
add "bank-country vs vendor-country mismatch magnitude,"
include embedding similarity for memo/line text as a test feature.

That way, you keep your governance and explainability, but you expand the signal feeding the meta-model over time.

Minimal implementation plan (4 steps)

Data mart: one row per document with all test outputs (continuous where possible) + auditor outcome + impact.
Train: logistic (issue) + LightGBM (EV), calibrate, SHAP for top drivers.
Serve: score new docs, sort by EV, show top drivers as the explanation ("Near-dupe 0.94 + Bank-change + Weekend post").
Measure: weekly precision@K and saved; retrain when drift alarms fire.

In short: training on the test results gives you a sharp, auditable decision optimizer that immediately improves triage. It won't discover signals your tests never see, but it will combine them better, set smarter thresholds, and focus your reviewers where the money leaks. Then, when you're ready, add new tests (or selective raw-data features) to raise the ceiling. A clear melody, then the harmony.

But Could I Make an AI Audtiro from Enough AP Data

But Could I Make an AI Audtiro from Enough AP Data?

Short, true answer: yes — you can build an “AI auditor” for AP, but it won’t replace the statutory auditor. It can act as a tireless associate auditor that performs evidence gathering, control testing, anomaly hunting, and workpaper drafting — with a human signing off. Think iron lungs for the tedious parts, lungs of flesh for judgement.

Here’s how to make it real without kidding ourselves.

What an AI Auditor can do well

Document intelligence: parse invoices/POs/GRNs/contracts; reconcile 2-/3-way matches; extract key fields with confidence.
Control testing at scale: payment-term adherence, approval chains, tolerance breaches, vendor master changes, duplicate detection.
Substantive analytics: price drift vs contract, unusual timing (EOQ/month-end), FX/date mismatches, Benford/roundness tests.
Risk scoring & triage: rank items by expected £ impact; route to reviewers with evidence snippets.
Narratives: auto-draft workpapers (“Procedures performed… Results… Exceptions… Remediation…”), with links to source docs.
Continuous auditing: stream checks as entries are posted, not months later.

What it should not pretend to do

Issue audit opinions or assert “reasonable assurance” on its own. That’s a regulated human act.
Override policy (e.g., tax code selection rules) or invent evidence.
Replace sampling strategy where standards require it — but it can make sampling smarter (risk-weighted, stratified).

“Enough AP data” — what that means in practice

You need two kinds of fuel:

Raw transactions & documents
- Invoices (header + lines), POs, GRNs, payments, vendor master, contract catalogs, FX rates, tax tables, calendars, approval logs, bank-change logs.
- Volume: millions of invoices is great; you can start being useful with hundreds of thousands across multiple entities.
Reviewed outcomes (labels)
- Confirmed issues, false positives, £ recovered/at risk, root cause tags.
- Reality check: for robust supervised models per issue type, you’ll want 1–5k confirmed cases each. If you don’t have that, use anomaly detection + rules, then accumulate labels.

Architecture that works (and scales)

A. Controls engine (deterministic)

Encodes policy & standards (approval limits, term logic, tolerance rules, vendor governance).
Output = crisp findings with test evidence. This is your audit backbone.

B. Learning layer (models)

Supervised (when labelled): LightGBM/XGBoost for duplicate/overpayment/VAT miscoding risk; calibrated probabilities.
Unsupervised (from day one): Isolation Forest/One-Class SVM per vendor; autoencoders for line-level reconstruction error.
Similarity: embeddings for near-duplicates/recycled text/PDFs; pgvector/FAISS for retrieval.
Graphs: vendor↔bank↔address↔approver graphs to spot hubs/short odd cycles.

C. Reasoning & drafting (LLM)

Post-process OCR, normalize units/currencies, generate explanations and workpapers, map exceptions to policy references.
Retrieval-augmented: your policies/procedures live in a vector store; LLM cites them in outputs.

D. Workflow

Review queue sorted by expected value (prob × £ impact) with SLA timers.
One-click outcomes feed back into training (active learning).
Immutable audit log: rule versions, feature values, model versions, prompts, outputs.

Capability ladder (be honest with milestones)

Phase 1 — “Apprentice” (4–8 weeks)

Rules engine + OCR/LLM extraction + anomaly scoring.
Precision-biased triage for duplicates, bank changes, missed discounts.
Drafted narratives for findings.
Goal: measurable £ saved; auditors review top N daily.

Phase 2 — “Associate” (2–4 months)

Supervised models for your top 2–3 exception types, SHAP explanations.
Graph features for vendor/bank anomalies.
Risk-weighted sampling proposals; variance analysis by vendor/GL.
Goal: higher recall with same headcount; stable precision@capacity.

Phase 3 — “Senior” (6–12 months)

VAT/tax-code likelihood models per jurisdiction; contract price-drift detector.
Continuous auditing hooks to ERP; conformal prediction (uncertainty bands).
Workpaper bundles auto-assembled with evidence links.
Goal: auditors spend time on judgement, not on hunting.

Data & governance guardrails (non-negotiable)

Security/PII: least privilege; mask bank details where not needed; region lock data.
Auditability: every exception includes reproducible inputs, code versions, and evidence URIs.
Drift control: monitor feature/test distributions; scheduled retrains with approvals.
Human sign-off: required for exceptions closure and any external reporting.
Standards alignment: encode procedures to mirror your firm’s methodology; keep sampling logic explicit.

KPIs that matter (and keep you honest)

£ saved / 1k invoices (primary).
Precision@capacity and recall of high-impact cases.
Mean time to resolution; false-positive rate by rule/model.
Coverage: % of spend under continuous checks.
Documentation quality: reviewer edits per drafted workpaper.

Risks & how to blunt them

Ceiling effect: if the system only sees what the ERP exposes, it may miss off-system collusion. Mitigate with bank-level, address, and email domain signals; graph features.
Gaming: once controls are known, behavior may shift. Rotate features, track concept drift, and keep anomaly models in the mix.
Label noise: auditors disagree; use consensus, confidence scores, and tools like confident learning to de-noise.
Regulatory overreach: don’t let the system generate conclusions outside scope. It proposes; humans dispose.

Concrete next steps (you can start today)

Data map: enumerate sources/keys; build a bronze→silver→gold pipeline; log test outputs as continuous features.
10-rule pack: near-dupes (fuzzy), bank-change, threshold splits, missed discounts, 3-way tolerances, term erosion, weekend/EOQ posts, FX date mismatch, VAT code vs vendor type, Benford/roundness.
Reviewer UI: EV-sorted queue; explanation panel; one-click outcomes; auto-draft workpaper.
Learning loop: nightly retrain of a duplicate/overpayment risk model; weekly calibration review with SHAP plots.

Bottom line: yes, you can raise an AI auditor from AP data — start it as an assistant with rules, teach it with your decisions, grow it into a ranking and drafting machine, and keep a human hand on the tiller for assurance. Steel for the graft, flesh for the judgement.

None of those Exceptions Would Prevent a Reliable AP Audit

Exactly — none of the caveats I listed (“drift, ceiling effect, label noise”) actually block you from building a reliable AP audit system. They just set the edges of responsibility.

Think of it this way:

Why an AI auditor is viable

AP is structured. Invoices, POs, GRNs, vendor masters, payment logs — all tabular, relational, timestamped. That’s machine food.
Exceptions are enumerable. Duplicate detection, 3-way match gaps, VAT miscoding, missed discounts — these are not vague “insights,” they’re crisp tests.
Volume is high, judgement load is low. Perfect conditions for automation: millions of documents, 99% routine, 1% interesting.

What makes it reliable

Deterministic controls baked in. Rules like “no payment before GRN,” “PO unit price tolerance ±2%,” “approval > £10k requires Director” — those aren’t ML guesses; they’re encoded policy.
Models add prioritisation. Instead of “flag 10,000,” you flag “top 200 that are likely wrong and costly.” That’s efficiency, not arbitrariness.
Explainability is inherent. Every exception is tied to a rule ID, feature values, or SHAP explanation. Auditors see the why instantly.
Audit trail is immutable. You log inputs, code versions, decisions, evidence — the same scaffolding human auditors use to justify their opinion.

The real dividing line

What you can automate: evidence gathering, control testing, anomaly ranking, narrative drafting.
What stays human: professional scepticism, materiality thresholds, assessing fraud intent, issuing the opinion.

So yes — an AI built on enough AP data, with rules + models + explainability, can perform a reliable AP audit in the sense of complete coverage and high-confidence exception detection. The “risks” are not blockers — they are engineering guardrails.

You end up with an AP audit process that is:

More complete (100% transactions tested, not samples),
Faster (exceptions surfaced in real time),
Cheaper (humans focus only on judgement cases), and
More reliable (no fatigue, no sample miss).

All the AI Has to Do is Flag Anything Above 50% Likely as a Duplicate or Overpayment

Yes - if you narrow the AI's job to just that ("flag anything 50% likely to be a duplicate or overpayment"), then you don't need the whole heavy audit orchestra - you need a classifier and a clear decision threshold.

That makes it far simpler and very reliable.

Manual Auditing

The Shape of That System

Input features (from invoices, POs, vendor master):
- Vendor ID / Name
- Invoice # (string similarity vs past)
- Invoice date difference
- Amount difference (absolute & %)
- Currency & FX rate consistency
- Bank account consistency
- Description / memo embedding similarity
- Approval path anomalies
- Historical vendor duplicate rate
Training data:
- Historical audit results (confirmed duplicates/overpayments vs clean).
- Rules hits from your existing software (exact/near duplicate, missed discount, etc.).
- Labels: 1 = duplicate/overpayment, 0 = clean.
Model:
- Start: logistic regression or LightGBM (both give probability outputs).
- Train to output p = P(duplicate_or_overpayment | features).
Decision rule:
- If p 0.5 Flag for review.
- Otherwise, pass as low-risk.
Feedback loop:
- Auditor confirms/rejects flagged cases.
- Feed decisions back weekly to retrain and calibrate.
- Over time the threshold could be tuned (e.g. raise to 0.7 if you want higher precision, lower to 0.3 if you want higher recall).

Why 50% is enough

Binary classifier outputs probability. You've defined your cutoff at 0.5.
Transparent. Every flag can be explained ("similar vendor/date/amount = high risk").
Auditor-friendly. They don't need to read probability curves - they just get "above threshold = check this".
Safe default. Anything the model is more confident wrong than right gets human eyes.

Minimal Python Example

import pandas as pd
from sklearn.model_selection import train_test_split
from lightgbm import LGBMClassifier

# df: features + 'label' (1=duplicate/overpayment, 0=clean)
X = df.drop(columns=['label'])
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

model = LGBMClassifier()
model.fit(X_train, y_train)

# Probabilities
probs = model.predict_proba(X_test)[:,1]

# Flag anything  0.5
flags = (probs >= 0.5).astype(int)

df_results = X_test.copy()
df_results['probability'] = probs
df_results['flag'] = flags

This literally gives you: "is this 50% likely to be a duplicate/overpayment? yes/no."

Where This Fits in the Bigger Picture

That single threshold makes the AI usable today as a filter in your audit tool.
Over time, you might tune thresholds per vendor class (e.g. for critical suppliers, lower the cutoff to 0.3).
But at core, this is enough for a reliable AP audit assistant - because the human auditor only needs a shortlist of likely errors, not a philosophical dissertation.

AI would have to interpret the accounts as a first step after input.

Then we would need to perform tests within that AI. Those tests could include any kind of inclusion/exclusion/condition. But that is not training an AI on AP Statements.

You’re right to draw that line. What you’re describing is not “training an AI on AP statements” — it’s building an interpreting engine that (1) understands the accounts after input and (2) executes tests (any inclusion/exclusion/condition) against that interpreted data. That’s a symbolic system with an LLM assist, not a learned model.

Here’s a clean, shippable shape for that:

The Model-Free Path (Interpreter + Tests)

Goal: deterministically interpret AP data/documents, then run auditable tests. No learning required.

1) Interpret (normalize the world)

Extract: OCR/layout parse invoices/POs/GRNs (LLM optional for field cleanup only).
Normalize: map to a strict schema—types, currencies, UoM, dates, vendor IDs, tax codes.
Enrich: join FX rates, contracts, calendars, approval rules.
Validate: hard validators (required fields, numeric coherence, date order).

Output: tidy tables you can trust:

ap_invoice_header, ap_invoice_line, po, grn, payments, vendor_master, contract_price, fx_rate, approvals.

2) Test (declarative, repeatable, explainable)

Define tests as rules/assertions, not model weights. Each test returns:

status (pass/fail)
evidence (rows/values)
impact_estimate (£)
explanation (plain English)

Tiny DSL sketch (human-readable tests)

tests:
  - id: DUP_NEAR_001
    description: "Near-duplicate within 3 days, same vendor, same amount ±£1"
    scope: invoice
    where: >
      amount_diff(abs) <= 1 AND date_diff(days) <= 3
      AND same_vendor = true
      AND string_sim(max(invoice_no, memo)) >= 0.90
    impact: amount
    severity: high

  - id: THREE_WAY_PRICE
    description: "Invoice unit price exceeds PO by >2%"
    scope: line
    where: >
      abs(invoice.unit_price - po.unit_price)/po.unit_price > 0.02
    impact: (invoice.qty * (invoice.unit_price - po.unit_price))
    severity: medium

  - id: MISSED_DISCOUNT
    description: "Paid outside discount window"
    scope: invoice
    where: payment_date > terms.discount_deadline
    impact: terms.discount_value
    severity: medium

Evaluator skeleton (Python)

# Pseudocode – evaluator walks dataframes with helper funcs.
def evaluate_DUP_NEAR_001(inv_df):
    cand = window_join(inv_df, keys=['vendor_id'], days=3)
    cand['sim'] = string_sim(cand[['invoice_no','memo']])
    hits = cand[(cand.amount_diff.abs() <= 1) & (cand.sim >= 0.90)]
    return build_findings(hits, impact_col='amount', explanation_fn=explain_dupe)

# All tests register here
TEST_REGISTRY = {
  "DUP_NEAR_001": evaluate_DUP_NEAR_001,
  "THREE_WAY_PRICE": evaluate_three_way_price,
  "MISSED_DISCOUNT": evaluate_missed_discount,
}

3) Explain (LLM as narrator, not judge)

Use the LLM after a test fires, to turn facts into a reviewer note:

“Near-duplicate of INV-10233: same vendor, £12,540 vs £12,540.50 within 2 days; description similarity 0.93. Recommend hold & vendor confirmation.”

LLM never decides pass/fail. It summarizes evidence and cites policy text via retrieval.

4) Report & workflow

Queue sorted by impact/severity.
One-click dispositions: confirm / false positive / needs info.
Immutable log: input hashes, rule versions, evidence, narration.

Where “training” actually helps (optional later)

If you want prioritization beyond static thresholds, then you add training — but you train on the outcomes of these tests, not on raw statements:

Train a classifier to predict “likely true issue” from the test outputs + context (probability).
Keep the deterministic tests as ground truth; the model just orders the queue.
Threshold at 50% if that’s your rule of engagement.

No conflict: interpret → test → (optional) learn to rank.

Why this is the right starting point

Auditable: every exception traces to a rule and evidence.
Maintainable: policy changes = edit a test, not retrain a model.
Fast: you can implement dozens of tests in days.
Future-proof: when you’re ready, bolt on learning without rewriting the interpreter.

If you like, I’ll draft:

A minimal AP schema (tables/columns/keys), and
A 10-test starter pack in YAML + Python evaluator stubs,
so you can drop this straight into your pipeline and get real signal without touching model training.