Why Your Agent Won't Behave: Determinism, Drift, A...

Deterministic Systems vs. Non-Deterministic AI

What Determinism Actually Means

A deterministic system is one where the same input always produces the same output. No randomness, no ambiguity, no variation. The entire history of software engineering — compilers, databases, operating systems, network protocols — is built on this property.

Determinism gives you three things that engineers take for granted:

Reproducibility. You can replay a bug, bisect a regression, write a test that fails today and passes tomorrow only if the code changed.

Composability. You can chain predictable functions together and the result stays predictable. If each step behaves the same way every time, so does the whole pipeline.

Accountability. When something goes wrong, you can trace exactly why. The audit trail is complete.

These aren't luxuries. For most enterprise systems — billing, compliance, transaction processing, access control — determinism isn't optional. You need to know what happens next, and you need it to happen the same way every time.

Where Language Models Introduce Non-Determinism

A common misconception: setting temperature to zero should make an LLM deterministic. Mathematically, it looks like it should. Temperature zero means the model just picks the token with the highest probability at each step — no randomness, same input, same output.

In practice, you get different outputs from identical prompts. Why? Because the non-determinism doesn't come from the model's logic. It comes from the hardware.

GPU floating-point arithmetic doesn't behave like textbook math. When billions of calculations run in parallel across thousands of cores, tiny rounding errors accumulate differently depending on the execution order. Change which GPU you run on, which batch you're grouped with, or even which driver version is installed, and you get subtly different numbers percolating through the computation. One bit flip in the 15th decimal place can flip which token the model chooses.

Real-world benchmarks confirm this. When researchers tested major models like Claude, GPT-4o, and Gemini at temperature zero across repeated identical runs, short structured prompts stayed highly consistent — but open-ended reasoning prompts that generate hundreds of tokens diverged sharply, with some models producing a different output on nearly every run. Same prompt, different results.

The takeaway: this isn't a bug. It's a fundamental property of running probability calculations on distributed hardware. You can't eliminate it without trading away performance or using slower, more expensive exact-arithmetic systems. It's an engineering tradeoff, not a flaw in the model.

How Chain-of-Thought Amplifies Variability

Chain-of-thought (CoT) reasoning works by breaking a complex problem into steps. Instead of jumping straight to the answer, the model generates intermediate reasoning: “First, let me consider the cost… then the timeline… therefore the answer is X.”

The catch: each step is probabilistic. The model picks reasoning tokens from a probability distribution, and each token becomes context for the next choice. This creates a branching structure. Pick “cost first” as your opening thought, and you've reshaped the entire downstream reasoning. Pick “timeline first,” and you're on a completely different path to (maybe) a different answer.

Recent research adds a further twist: the visible reasoning text might not be where the actual reasoning is happening. Internal analysis shows that LLMs often encode answers in their hidden layers before they write out the reasoning. The surface-level chain of thought might be more of a scaffold that structures the generation than a faithful record of how the model actually thinks.

The practical implication: CoT reasoning is valuable for improving accuracy, but it's not a deterministic scaffold. It's itself a source of variability. The longer and more complex the reasoning chain, the more opportunities for small divergences to compound.

Compounding Variance, Not Chaos

It's tempting to call this the “butterfly effect” — and some researchers do. But there's an important difference worth understanding.

Chaotic systems (weather, turbulence) are deterministic — they follow fixed laws — but tiny input changes lead to wildly different outcomes. Agentic AI is different. It's probabilistic at every step. Small changes in early reasoning steps compound into bigger divergences downstream.

The good news: formal analysis shows this compounding is bounded, not explosive. Error doesn't grow exponentially. It grows more slowly — roughly like the square root of the number of steps. That means the system degrades predictably. You can engineer around it.

How to Evaluate Non-Deterministic Systems

Testing a deterministic function is straightforward: assert that this input always produces that output. Non-deterministic systems need a different approach.

Run evaluations multiple times. One run tells you almost nothing. Run your eval 5-10 times and look at the distribution. Does it consistently hit 90%? Or does it swing between 70% and 100%? Both average to similar numbers, but they're completely different systems.

Define success as a range, not a point. Instead of “the output must be X,” say “the output must be one of these acceptable responses” or “must score above this semantic similarity threshold.” For code, run the tests. For math, check if the answer is correct. Let deterministic validators do the heavy lifting.

Watch the variance. If your agent is wildly inconsistent, it's operating at a decision boundary — a place where tiny variations flip the outcome. That's a design problem worth fixing.

The Practitioner's Mental Model

Think of an agentic AI system as three concentric rings.

The outer ring is deterministic infrastructure: API gateways, authentication, rate limiting, database transactions, audit logging, workflow state machines. This is traditional software engineering and it must stay that way. These components need to know — with certainty — what happens next.

The middle ring is the control plane: orchestration logic, retry policies, timeout handling, output validation, schema enforcement, human-in-the-loop gates. This is where you manage stochastic behavior without eliminating it. As SAP has demonstrated with its Autonomous Enterprise architecture, governance composes as you add capabilities — when you combine two controlled agents, the governance constraints compose together, not dilute. The middle ring is deterministic in its structure (the rules are fixed) but adaptive in its responses (it reacts to variable AI outputs).

The inner ring is the AI model itself: reasoning, planning, generating, deciding. This is where non-determinism lives, and where it should be allowed to live. The model's value comes precisely from its ability to handle ambiguity, generalize from sparse examples, and produce contextually appropriate outputs that no rule system could enumerate in advance.

The mistake is letting the inner ring's properties leak into the outer ring. The skill is building a middle ring robust enough to prevent that.

Closing: An Enterprise Architecture Perspective

For enterprise architects evaluating agentic AI, the question isn't “deterministic or non-deterministic?” It's “where does each belong in my architecture?”

The ten patterns from Part 1 aren't academic abstractions — they map directly to familiar enterprise concerns. Agentic RAG maps to your existing content and knowledge management layer. HITL maps to your approval and governance workflows. State machine orchestration maps to your BPM and process automation layer. Tool-use routing maps to your API gateway and service mesh. The AI model itself is a new type of compute resource — powerful but probabilistic — that plugs into these existing layers rather than replacing them.

The enterprise systems that work will be the ones that treat determinism and non-determinism as complementary materials, not competing philosophies. Your transaction processing, your compliance rules, your audit trails — these remain deterministic, as they must. Your natural language understanding, your document reasoning, your adaptive decision-making — these benefit from the flexibility that probabilistic models provide.

But as SAP's leadership has emphasized in its Autonomous Enterprise messaging, simply plugging AI agents into your existing system landscape drives little value on its own. Moving to an autonomous enterprise requires serious change management — AI adoption goes hand in hand with business process change and end-user enablement.

The architecture that emerges looks less like “AI replaces the system” and more like “AI augments specific nodes in a well-governed system.” The workflow knows what happens next. The AI decides how it happens at the steps where rigid rules fall short. Humans provide strategic direction. Agents execute with guardrails.

That's not a compromise. That's good engineering.

Missed Part 1? “Tame Your Agents: 10 Design Patterns for Agentic AI That Actually Work in Production” covers the practical playbook. Next up, Part 3 maps everything onto SAP's AI landscape — Joule, Joule Studio, and the Skill-vs-Agent split that makes this whole determinism discussion concrete.

References

Non-Determinism in LLMs

Floating-Point Non-Associativity

Batch Variance and Infrastructure Heterogeneity

SugiV. (2025). Unmasking the True Culprit: Why Temperature=0 Doesn't Mean Deterministic LLM Inference. blog.sugiv.fyi (Empirical analysis of batch-size-dependent non-determinism via GPU kernel reduction order.)
Brenndoerfer, M. (2025). Why Temperature=0 Doesn't Guarantee Determinism in LLMs. mbrenndoerfer.com (Server heterogeneity, GPU driver variation, and MoE routing effects.)
QAnswer. (2026). Why LLMs Are Not Deterministic Even at Temperature 0. qanswer.ai (Benchmarks across Claude, GPT-4o, and Gemini showing determinism rates at temperature zero.)
Atil, B., et al. (2025). Non-Determinism of “Deterministic” LLM Settings. arXiv:2408.04667 (Systematic study of output variance across 5 models, 8 tasks, and 10 runs per configuration.)

Compounding Variance in Agentic Systems

Xu, Y., et al. (2026). Agentic Confidence Calibration. arXiv:2601.15778 (Holistic Trajectory Calibration for compounding errors across agent trajectories.)
Guo, A., et al. (2026). Information Fidelity in Tool-Using LLM Agents: A Martingale Analysis of the Model Context Protocol. arXiv:2602.13320 (Formal proof that error propagation grows sublinearly — O(√T), not exponentially.)
Singh, A., et al. (2025). Stochastic, Dynamic, Fluid Autonomy in Agentic AI: Implications for Authorship, Inventorship, and Liability. arXiv:2504.04058 (Analysis of compounded stochasticity across recursive agentic workflows.)

Statistical Evaluation of Agentic Systems

Rein, D., et al. (2024). Stochasticity in Agentic Evaluations: Quantifying Inconsistency with Intraclass Correlation. arXiv:2512.06710 (ICC-based framework for measuring trial-to-trial variance in agentic benchmarks.)

Chain-of-Thought Reasoning

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E. H., Le, Q. V., & Zhou, D. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022. arXiv:2201.11903
Anonymous. (2026). LLM Reasoning Is Latent, Not the Chain of Thought. arXiv:2604.15726 (Position paper arguing reasoning should be studied as latent-state trajectory formation rather than faithful surface CoT.)

Survey and Overview Papers

Wang, L., Ma, C., Feng, X., et al. (2023). A Survey on Large Language Model Based Autonomous Agents. arXiv:2308.11432
Weng, L. (2023). LLM-Powered Autonomous Agents. Lil'Log. lilianweng.github.io
Xi, Z., et al. (2023). The Rise and Potential of Large Language Model Based Agents: A Survey. arXiv:2309.07864

This is Part 2 of a four-part series. Part 1 — “Tame Your Agents” is the builder's playbook: ten design patterns, when to use each, and the guardrails that keep agents in line. Part 2 (this post) goes under the hood: why LLMs are non-deterministic by nature, how that variability compounds across steps, how to evaluate systems that won't repeat themselves, and the three-ring architecture that puts it all together. Part 3 — “The SAP AI Landscape” maps these patterns onto SAP's agentic stack (Joule, Joule Studio, Skills vs. Agents, AI Core, MCP/A2A). Part 4 — “Hands-On” builds a simple Joule agent end to end.

Source link