Engineering Transactional LLMs for the Real World

The leap from a conversational chatbot to an autonomous agent is measured in state changes. It is the difference between an LLM that can tell you about an available 2:00 PM slot, and an LLM that can confidently acquire that slot, modify a remote database, charge a credit card, and guarantee the operation's integrity against competing concurrent requests.

Building Ordina, an AI secretary designed for service businesses, forced us to confront the reality of production-grade AI. The core engineering challenge isn't prompt engineering; it is building a deterministic, highly concurrent, and secure infrastructure around an inherently probabilistic reasoning engine.

To achieve our core promise, a 0% double-booking guarantee combined with natural, context-aware conversations, we had to move beyond standard API wrappers and build a robust, agentic architecture. This requires solving three fundamental computer science problems through the lens of modern AI: multi-tenant contextual retrieval, distributed concurrency control, and asynchronous idempotency.

1. Multi-Tenant Contextual Retrieval: Securing the RAG Pipeline

For an AI agent to act on behalf of a specific business, it must have perfect recall of that business's localized context—prices, service menus, operational hours, and specific rules. Fine-tuning models per tenant is economically and computationally unviable. Instead, we rely on Retrieval-Augmented Generation (RAG).

However, in a multi-tenant SaaS environment, a naïve RAG implementation is a critical security vulnerability.

If Tenant A (a high-end salon) and Tenant B (a budget barber) share a vector space without strict physical or logical separation, a slight semantic drift in the LLM's retrieval query could cause the agent to confidently quote Tenant B's prices to Tenant A's client.

The Isolation Architecture

To prevent "context bleeding," our RAG pipeline enforces isolation at both the vector database and application tiers:

Metadata Tagging and Filtering: During the ETL (Extract, Transform, Load) phase of document ingestion, parsed chunks are embedded and strictly tagged with a unique tenant_id within the vector store metadata.
Deterministic Pre-filtering: When a client queries the agent, the retrieval mechanism does not rely on the LLM to filter the results. The vector search is executed with a hard constraint: WHERE tenant_id = 'X'. The LLM only ever sees the context belonging to the specific business link the client interacted with.
Row-Level Security (RLS): This isolation extends to the relational database containing structured data (client histories, current calendar availability). The database engine physically enforces RLS based on the authenticated session context, ensuring the agent's data access is tightly scoped.

This guarantees that the agent's "worldview" is entirely bounded by the specific tenant it represents at any given millisecond.

2. Event-Driven Concurrency: Solving the "Race to the Slot"

The most complex distributed systems challenge in scheduling is managing race conditions over shared, finite resources (time slots) across disparate systems (Ordina's DB and external calendars like Google or Outlook).

When an LLM is the actor negotiating the slot, the latency of the reasoning cycle exacerbates the concurrency risk.

The Race Condition Scenario

Consider two clients chatting with the AI simultaneously, both requesting the 2:00 PM Friday slot.

T+0: Client 1 asks for 2:00 PM. The agent checks availability and sees it is open.
T+1: Client 2 asks for 2:00 PM. The agent checks availability and sees it is open.
T+5: Both clients say, "Yes, book it."

If the agent relies solely on synchronous API calls to Google Calendar, both bookings will attempt to write. Depending on network latency, either a double-booking occurs or the API rejects the second request, causing the LLM to crash or hallucinate a response.

Distributed Locking and State Machines

To solve this, Ordina implements an event-driven architecture utilizing distributed locks and explicit state management:

The Intent State: When the LLM decides to book, it does not execute the write operation directly. It emits an IntentToBook event containing the tenant_id, provider_id, and time_block.
The Distributed Lock: Our orchestration layer intercepts this event and attempts to acquire a distributed lock (via Redis) on that specific [provider_id]:[time_block].
The Resolution:
- Winner: Client 1's request acquires the lock. The system executes the transaction (writing to our DB and pushing via webhook to Google Calendar) and returns a Success state to Client 1's agent session.
- Loser: Client 2's request fails to acquire the lock. The system returns a SlotUnavailable state to Client 2's agent session.
Graceful Degradation: The LLM receives the SlotUnavailable state before it generates its final response to Client 2. This allows the agent to pivot naturally: "I apologize, but that slot was just taken. I have 3:00 PM or 4:00 PM available. Would either work?"

By decoupling the LLM's reasoning from the transaction execution, we eliminate the race condition and guarantee calendar integrity.

3. Asynchronous Idempotency: The Payment Gateway

Ordina handles transactions via Flutterwave, allowing businesses to accept deposits or full payments in Naira directly in the chat flow.

Integrating LLMs with payment gateways introduces a severe risk: the "retry loop." If an agent encounters a network timeout while calling the payment API, its default behavior (like any standard HTTP client) might be to retry the request. Without strict safeguards, this results in duplicate charges.

The Agent Payment Gateway Pattern

Agents cannot be trusted to manage idempotency natively. Therefore, Ordina utilizes a dedicated Agent Payment Gateway microservice.

Stable Idempotency Keys: When the agent generates a payment link, the orchestration layer assigns a deterministic idempotency_key based on the unique booking intent (e.g., payment:tenant123:booking456).
The Deduplication Window: This key is passed to the payment processor. If the agent—or the client's network connection—drops and retries the exact same request, the payment gateway recognizes the active idempotency key.
Cached Responses: Instead of initiating a new charge, the gateway returns the cached state of the initial transaction (e.g., IN_PROGRESS or COMPLETED).
Webhook Callbacks: We rely entirely on asynchronous webhooks from the payment provider to confirm success, rather than trusting the synchronous response of the agent's API call. Only when the webhook confirms the payload do we trigger the final booking confirmation and WhatsApp notifications.

The Future of Agentic Infrastructure

Building a reliable AI secretary requires treating the Large Language Model not as a monolithic brain, but as a reasoning engine embedded within a highly structured, defensive architecture.

By enforcing strict data isolation via RAG, managing state through distributed locks, and decoupling critical actions through asynchronous, idempotent gateways, we move AI out of the realm of novelty and into the critical path of business operations.

The Architecture of Agency: Engineering Transactional LLMs for the Real World

1. Multi-Tenant Contextual Retrieval: Securing the RAG Pipeline

The Isolation Architecture

2. Event-Driven Concurrency: Solving the "Race to the Slot"

The Race Condition Scenario

Distributed Locking and State Machines

3. Asynchronous Idempotency: The Payment Gateway

The Agent Payment Gateway Pattern

The Future of Agentic Infrastructure

Comments

Keep reading

5 ways to cut no-shows without being pushy

Take bookings on WhatsApp: a simple setup for busy owners

Should you let customers negotiate? How smart price negotiation wins more bookings