Banking Like It’s 1999: What Human Workflows Can Teach AI
Banking Like It’s 1999
In Taiwan, particularly when you’re a foreigner, there are still several banking tasks where you actually have to go to the bank. For me, it’s mainly been transferring money back to the U.S. to invest. At some point in the transfer workflow, the teller will call over a supervisor to reconfirm my identity. This “four eyes” rule before initiating a large transfer makes sense, because independent review is one of the most effective ways to catch errors or fraud.
AI Agents Make More Mistakes Than First-Line Bank Tellers
If LLM-based AI agents are to take over more and more of our daily workflows, software technologists face a major problem: Hallucinations, blind spots, and systematic errors are baked into how LLMs work. And all the stopgap measures that people are trying today like reasoning tricks, clever prompt engineering, and larger models aren’t going to completely fix this. I can’t stress the point enough that LLMs are not “magic”. LLMs perform next-token prediction, prediction means “stochastic”, and stochastic implies at least some amount of probabilistic error.
So how can you trust an AI agent to autonomously book travel, shop online, make healthcare recommendations, or complete other high-stakes transactions? It’s discouraging to see engineers who should know better fall back on justifications like, “the LLM answer is correct 99% of the time.” That might be fine for a demo, but 99% or even 99.99% isn’t enough for some of the use cases people claim are imminently going to be automated. We need real solutions, not demo-quality full-stack engineering hacks.
Human Workflows To The Rescue
Going back to my banking example, it might occur to you that age-old human workflows can be adapted to AI agents. Thanks to the wonders of reliability math, some of these techniques can radically improve the reliability of our systems. Let’s look at a few.
Dual Control (“Four-Eyes Principle”)
Idea: Two (or more) people review and approve any high-stakes action.
AI Application: Two different AI models (or one model and a human) independently confirm an action plan before it’s executed.
Benefits: This approach dramatically reduces the risk of errors, as both reviewers must agree before an action proceeds.
Discussion: This is used in my banking transfer example. It should also be familiar to software engineers from code review and release workflows.
Ensemble Voting
Idea: Boards, juries, hiring committees, and the like combine multiple perspectives before arriving at a decision.
AI Application: Multiple AI models run independently on the same goal-oriented task, with the final plan of action decided by majority vote or weighted scoring.
Benefits: Setting multiple models on the same task mitigates hallucinations or idiosyncrasies from blind spots in a single AI model.
Discussion: Ensembles are a proven software reliability technique. The Space Shuttle famously used redundant computers and a voting system. In practice, this mitigated hardware failures. Issues in software could be reduced via n-version programming, where multiple independent (or as independent as possible) implementations are run side-by-side.
Checklists
Idea: Pilots, surgeons, mechanics, and others use checklists to catch predictable mistakes.
AI Application: Validate each action in an AI agent workflow against explicit rules, ideally independent of the AI itself.
Benefits: Step-by-step validation shifts part of the reliability burden off the AI agent and onto the checks.
Discussion: Checklists are a simple but time-tested way to prevent errors in high-stakes environments. The value isn’t just in catching mistakes but in enforcing a disciplined process where every step is consciously validated. Applied to AI agents, checklists act as an external layer of control to keep them from making obvious errors or going off the rails. Much like in aviation or medicine, a checklist doesn’t guarantee success—but it systematically blocks known failure modes.
Escalation
Idea: Junior staff escalate unusual or risky decisions to senior staff.
AI Application: Ask agents for a “confidence score” for every step and the overall workflow. If the score is low, escalate: run a stronger model or loop in a human.
Benefits: Escalation ensures high-risk workflows receive extra scrutiny, while routine actions remain efficient. (This was also part of my bank transfer example.) You’ve seen a version of this in the AI world with GPT-5’s automatic mode switching (though that’s probably more about cost savings than reliability). Still, it’s often useful to explicitly request a confidence score in prompts, and we can build on this in agent workflows.
Discussion: Escalation is a common safeguard in human organizations. Bringing this principle into AI agent workflows helps ensure that when a system is “uncertain” (measured by the aforementioned “confidence score” or anomaly detection or similar), additional scrutiny kicks in. This can be as simple as routing a task to a more capable AI model, or as complex as involving a human in the loop. Either way, escalation provides a lever that balances efficiency with reliability.
Caveat
Earlier I suggested that we might use reliability math to combine several “two-nines” systems into a single three- or four-nines system. I may do a follow-up post diving into this math for LLMs specifically.
For now, keep in mind that techniques relying on multiple runs assume independence. That assumption can be a little shaky. All large models are trained on overlapping datasets, inheriting similar biases. Many also use distillation, where one model literally learns from another, again breaking independence.
There are ways to mitigate this: Diversify by using models from different families, add rule-based validation layers that don’t rely on LLMs (see above), or introduce “design diversity” by running multiple different prompts and workflows. None of these make independence perfect, but they reduce the risk of correlated failure enough to make the reliability math worth applying.
In Practice
By now, the astute software technologists among you have probably noticed that applying human workflows to AI agents sounds like a lot of extra design and implementation work. And you’d mostly be right.
But the point isn’t to overcomplicate every workflow with endless checks. It’s to raise awareness of the reliability limitations we inherit when using LLMs and to point out that we already know how to mitigate them. We’ve been doing it for decades with humans as well as more “traditional” software systems.
Reliability math is on our side. But only if we stop treating AI agents as autonomous black boxes, and start treating them as participants in structured workflows that we can design, monitor, and improve.
Let’s Build More Reliable AI Together
If your team is experimenting with AI agents and wants to make them more robust, I’d love to help. I work with engineering leaders to apply proven reliability techniques like the ones above to real-world AI systems. Get in touch if you’d like to talk about building safer, more trustworthy AI into your products and workflows.