From LLM Prompt to Production (3/9) - Architecture Patterns

This is part of a series of blogs:

When deploying LLM prompts as production APIs, choosing the right architecture pattern is crucial for balancing performance, cost, and user experience. Let me break down the main patterns and their trade-offs for AWS serverless deployments.

Let us first understand what each architecture pattern represents conceptually and how they fundamentally differ in their approach to handling client-server communication for LLM requests.

1. Synchronous REST API

The synchronous REST API pattern represents the most traditional approach to web service communication. In this model, when a client sends a request to your LLM API, it establishes a direct, blocking connection where the client waits until the server completely processes the LLM prompt and returns the full response. This follows the classic request-response cycle that most developers are familiar with from standard web APIs.

The fundamental characteristic of this pattern is its simplicity and predictability. The client sends a request, the server processes it entirely, and then sends back a complete response. There’s no intermediate state or partial responses - it’s an all-or-nothing transaction. This makes it excellent for scenarios where LLM responses are quick (under 30 seconds) and the client application can afford to wait for the complete result.

However, this pattern becomes problematic for LLM applications because language models can take significant time to generate responses, especially for complex prompts or longer content generation tasks. During this processing time, the client connection remains open, consuming server resources and potentially timing out if the processing takes too long.

Best for: Simple queries, quick responses, traditional integrations

func handleLLMRequest(ctx context.Context, event events.APIGatewayProxyRequest) (events.APIGatewayProxyResponse, error) {
    response, err := callLLM(event.Body)
    if err != nil {
        return errorResponse(500, "LLM processing failed"), nil
    }
    return successResponse(response), nil
}

What this code does: This is a straightforward Lambda function that receives an HTTP request through API Gateway, makes a direct call to the LLM service, waits for the complete response, and returns it immediately. The entire process is synchronous - the client waits until the LLM finishes processing before getting any response.

Pros:

Simple implementation and testing
Familiar REST semantics for developers
Built-in API Gateway features (rate limiting, caching, authentication)
Easy client integration with standard HTTP libraries

Cons:

Lambda timeout limits (15 minutes maximum)
Expensive for long-running LLM calls (you pay for idle waiting time)
Poor user experience for slow responses (users see loading screens)
No progress feedback during processing

2. WebSocket Real-time API

WebSocket represents a fundamentally different communication paradigm that establishes a persistent, bidirectional connection between client and server. Unlike REST APIs where each request creates a new connection, WebSocket creates a long-lived communication channel that allows both parties to send messages at any time during the connection’s lifetime.

For LLM applications, this pattern shines when you want to provide real-time streaming responses. Instead of waiting for the entire LLM response to be generated before sending anything back, the server can send partial responses as they’re generated - token by token or chunk by chunk. This creates the familiar experience seen in modern AI chat interfaces where users see the response being “typed out” in real-time.

The bidirectional nature of WebSocket also enables advanced features like allowing users to interrupt long-running LLM generations, send follow-up questions without establishing new connections, or implement conversational interfaces where context is maintained throughout the session. This makes it particularly powerful for interactive AI applications where the conversation flow is as important as individual responses.

Best for: Interactive applications, streaming responses, real-time feedback

func handleWebSocketMessage(ctx context.Context, event events.APIGatewayWebsocketProxyRequest) (events.APIGatewayProxyResponse, error) {
    connectionID := event.RequestContext.ConnectionID

    go func() {
        stream := callLLMStream(event.Body)
        for chunk := range stream {
            sendToWebSocket(connectionID, chunk)
        }
        sendToWebSocket(connectionID, "COMPLETE")
    }()

    return events.APIGatewayProxyResponse{StatusCode: 200}, nil
}

What this code does: This establishes a persistent WebSocket connection between the client and server. When a message arrives, it starts a goroutine (background process) that calls the LLM in streaming mode. As the LLM generates tokens, each chunk is immediately sent back to the client through the WebSocket connection. The client sees the response being built word-by-word, similar to ChatGPT’s interface.

Pros:

Real-time streaming capabilities (see responses as they’re generated)
Better user experience with progressive responses
Bidirectional communication (client can interrupt or modify requests)
Lower perceived latency

Cons:

Complex connection management (handling disconnects, reconnects)
Higher infrastructure costs (persistent connections)
WebSocket scaling challenges in serverless environments
Connection state management complexity

3. Webhook Callbacks (Async Pattern)

The webhook callback pattern fundamentally decouples the request acceptance from the actual processing, creating an asynchronous workflow that can handle long-running operations gracefully. When a client submits a request, the server immediately acknowledges receipt and provides a tracking identifier, but the actual LLM processing happens in the background.

This pattern operates on the principle of “fire and forget” from the client’s perspective, but with a notification system that ensures the client eventually receives the results. Once the background processing completes, the server “calls back” to a URL provided by the client (the webhook) to deliver the results. This callback mechanism is what gives the pattern its name.

The asynchronous nature makes this pattern particularly well-suited for scenarios where LLM processing times are unpredictable or potentially very long. It’s also excellent for batch processing scenarios where multiple requests can be queued and processed efficiently in the background without keeping client connections open. The pattern naturally supports retry mechanisms, error handling, and can scale processing independently from request acceptance.

Best for: Background processing, integrations, fire-and-forget scenarios

// Step 1: Accept request and queue it
func handleAsyncRequest(ctx context.Context, event events.APIGatewayProxyRequest) (events.APIGatewayProxyResponse, error) {
    requestID := generateRequestID()

    // Store request details for later processing
    storeRequest(requestID, event.Body, request.CallbackURL)

    // Queue the work for background processing
    sendToSQS(requestID, event.Body)

    // Immediately return to client
    return successResponse(map[string]string{
        "request_id": requestID,
        "status": "processing"
    }), nil
}

// Step 2: Background processor handles the actual LLM call
func processLLMRequest(ctx context.Context, event events.SQSEvent) error {
    for _, record := range event.Records {
        requestID := record.Body
        request := getRequestFromDB(requestID)

        // Now do the actual LLM processing
        response, err := callLLM(request.Prompt)
        if err != nil {
            // Notify the client about the error via their callback URL
            notifyCallback(request.CallbackURL, requestID, "error", err.Error())
            continue
        }

        // Send the successful result to the client's callback URL
        notifyCallback(request.CallbackURL, requestID, "complete", response)
    }
    return nil
}

What this code does: This pattern splits the process into two phases. First, the API immediately accepts the request, generates a unique ID, stores the request details in DynamoDB, puts it in an SQS queue, and returns the ID to the client. The client can use this ID to check status or will receive results via a webhook. Separately, background Lambda functions process the SQS queue, perform the actual LLM calls, and send results back to the client’s specified callback URL.

Pros:

No Lambda timeout concerns (can process for hours)
Cost-effective for long operations (no idle waiting time)
Natural error handling and retries through SQS
Scales independently from request acceptance

Cons:

Requires clients to implement callback endpoints
More complex error scenarios (network failures during callbacks)
Higher implementation complexity
Delayed response delivery

4. Server-Sent Events (SSE)

Server-Sent Events represent a middle ground between the complexity of WebSocket and the limitations of traditional REST APIs. SSE establishes a one-way communication channel where the server can send multiple messages to the client over a single HTTP connection, but the client cannot send messages back through the same channel.

For LLM applications, SSE is particularly useful when you need to stream responses to the client but don’t require the bidirectional communication that WebSocket provides. The pattern works by keeping an HTTP connection open and sending specially formatted messages that browsers can automatically parse and handle. This makes it excellent for scenarios like live dashboards, progress updates, or streaming LLM responses where user interaction during the stream isn’t necessary.

SSE has the advantage of being built on standard HTTP protocols, making it simpler to implement than WebSocket while still providing real-time capabilities. It also includes built-in reconnection handling, so if the connection drops, browsers will automatically attempt to reconnect and resume receiving events.

Best for: One-way streaming, progress updates, live dashboards

func handleSSERequest(ctx context.Context, event events.APIGatewayProxyRequest) (events.APIGatewayProxyResponse, error) {
    headers := map[string]string{
        "Content-Type": "text/event-stream",
        "Cache-Control": "no-cache",
        "Connection": "keep-alive",
    }

    return events.APIGatewayProxyResponse{
        StatusCode: 200,
        Headers: headers,
        Body: streamLLMResponse(event.Body), // Streaming response body
    }, nil
}

What this code does: This creates an HTTP connection that stays open and continuously sends data to the client. Unlike WebSocket, it’s one-way communication from server to client. The special headers tell the browser this is a streaming connection. The LLM response is streamed chunk by chunk as it’s generated.

Pros:

Simple client implementation (built into modern browsers)
Built-in reconnection handling
HTTP/2 efficient
Good browser support

Cons:

One-way communication only
Limited by Lambda streaming capabilities
Requires API Gateway v2 for proper streaming support

5. Event-Driven Architecture

Event-driven architecture represents the most sophisticated and scalable approach to handling LLM requests at enterprise scale. Rather than directly processing requests in the API endpoint, this pattern treats incoming requests as events that trigger a cascade of loosely coupled processing services.

In this model, when a request arrives, it’s immediately converted into an event and published to an event bus (like AWS EventBridge). Various microservices subscribe to these events and handle different aspects of the processing pipeline. For example, one service might handle request validation, another might route requests based on priority or content type, and yet another might perform the actual LLM processing.

This pattern excels in complex scenarios where different types of requests need different processing workflows, where you need to integrate with multiple downstream systems, or where you’re building a platform that needs to be highly extensible. It allows you to add new processing capabilities by simply adding new event subscribers without modifying existing code.

The event-driven approach also naturally supports sophisticated features like request prioritization, content-based routing, audit trails, and complex error handling workflows. However, it comes with increased architectural complexity and is typically only justified for applications that need significant scale or flexibility.

Best for: High-scale applications, complex workflows, microservices

// Main handler publishes events
func handleEventDrivenRequest(ctx context.Context, event events.APIGatewayProxyRequest) (events.APIGatewayProxyResponse, error) {
    requestID := generateRequestID()

    llmEvent := LLMRequestEvent{
        RequestID: requestID,
        UserID: getUserID(event),
        Prompt: event.Body,
        Priority: determinePriority(event), // Based on user tier, prompt complexity, etc.
    }

    // Publish to EventBridge - multiple services can react to this
    publishToEventBridge(llmEvent)

    return successResponse(map[string]string{
        "request_id": requestID,
        "status": "queued"
    }), nil
}

// Different processors can handle different event types
func processHighPriorityLLM(ctx context.Context, event events.EventBridgeEvent) error {
    var llmEvent LLMRequestEvent
    json.Unmarshal(event.Detail, &llmEvent)

    if llmEvent.Priority > 5 {
        // Route high-priority requests to faster/more expensive LLM instances
        response, err := callLLMWithHighPriority(llmEvent.Prompt)
        updateRequestStatus(llmEvent.RequestID, "complete", response)
    }

    return nil
}

What this code does: Instead of directly processing requests, this publishes events to EventBridge. Multiple Lambda functions can subscribe to these events and handle them differently based on the event data. For example, high-priority requests might go to faster (more expensive) LLM endpoints, while bulk requests might be batched together. This creates a flexible, decoupled system where you can add new processing logic without changing the main API.

Performance Optimization Strategies

Batch Processing for Cost Efficiency

func batchProcessLLM(ctx context.Context, event events.SQSEvent) error {
    var requests []LLMRequest

    // SQS can deliver multiple messages in one Lambda invocation
    for _, record := range event.Records {
        var req LLMRequest
        json.Unmarshal([]byte(record.Body), &req)
        requests = append(requests, req)
    }

    // Process multiple requests concurrently within the same Lambda
    semaphore := make(chan struct{}, 5) // Limit to 5 concurrent LLM calls
    var wg sync.WaitGroup

    for _, req := range requests {
        wg.Add(1)
        go func(request LLMRequest) {
            defer wg.Done()
            semaphore <- struct{}{}        // Acquire semaphore
            defer func() { <-semaphore }() // Release semaphore

            response, err := callLLM(request.Prompt)
            updateRequestInDB(request.ID, response, err)
        }(req)
    }

    wg.Wait() // Wait for all requests to complete
    return nil
}

What this optimization does: Instead of processing one request per Lambda invocation, this processes multiple requests concurrently within a single Lambda function. This reduces the total Lambda execution time and cost. The semaphore limits concurrent LLM calls to prevent overwhelming the LLM service or hitting rate limits.

Error Handling Strategies

func processWithRetry(requestID string, prompt string, maxRetries int) error {
    for attempt := 1; attempt <= maxRetries; attempt++ {
        response, err := callLLM(prompt)

        if err == nil {
            updateRequestStatus(requestID, "complete", response)
            return nil
        }

        // Different errors need different handling
        if isRetryableError(err) { // Network timeouts, rate limits, temporary failures
            backoffTime := time.Duration(attempt * attempt) * time.Second // Exponential backoff
            time.Sleep(backoffTime)

            updateRequestStatus(requestID, "retrying", fmt.Sprintf("Attempt %d/%d", attempt, maxRetries))
            continue
        }

        // Non-retryable errors: invalid input, authentication failures
        updateRequestStatus(requestID, "failed", err.Error())
        publishErrorEvent(requestID, err) // Alert monitoring systems
        return err
    }

    // All retries exhausted
    updateRequestStatus(requestID, "failed", "Max retries exceeded")
    return errors.New("max retries exceeded")
}

What this error handling does: This implements intelligent retry logic that distinguishes between temporary failures (which should be retried) and permanent failures (which shouldn’t). It uses exponential backoff to avoid overwhelming failing services and updates the request status so clients can track progress.

My Recommendation: Hybrid Event-Driven Architecture

Each pattern represents a different trade-off between simplicity, performance, scalability, and user experience. Synchronous REST prioritizes simplicity, WebSocket prioritizes real-time interaction, webhooks prioritize reliability and cost-effectiveness, SSE provides a simple streaming solution, and event-driven architecture prioritizes flexibility and scale. Understanding these fundamental differences is crucial for making the right architectural choice for your specific LLM application requirements.

For most production LLM APIs, I recommend a hybrid event-driven architecture that adapts to different request types:

func handleLLMRequest(ctx context.Context, event events.APIGatewayProxyRequest) (events.APIGatewayProxyResponse, error) {
    requestType := determineRequestType(event) // Based on prompt length, user tier, etc.

    switch requestType {
    case "quick":        // Short prompts, premium users
        return handleSynchronous(event)

    case "standard":     // Most requests
        return handleAsynchronous(event)

    case "streaming":    // Interactive applications
        return redirectToWebSocket(event)

    default:
        return handleAsynchronous(event)
    }
}

What this hybrid approach does: This intelligently routes requests based on their characteristics. Quick requests (short prompts, premium users) get immediate synchronous processing. Standard requests go through async processing. Interactive applications use WebSocket streaming. This maximizes both user experience and cost efficiency.

Additional Intricacies

That is not all. When we build an API, we should also look into these intricacies

Traffic Management Features:

Rate limiting: 1000 RPS steady state, 2000 RPS burst protects against abuse and controls costs
Data trace logging: Enables detailed request/response logging for debugging and compliance
CORS configuration: Enables cross-origin requests for web applications

Validation Benefits:

Early rejection: Invalid requests are rejected at API Gateway level, saving Lambda compute costs
Security: Prevents injection attacks through input length limits and type validation
Cost control: Limits maximum tokens to prevent unexpectedly expensive requests
Quality assurance: Ensures temperature values are within valid ranges for consistent output quality

Lifecycle Tracking Benefits:

Audit trail: Complete record of every request for compliance and debugging
Real-time status: External systems can query request status during long-running operations
Error correlation: Failed requests can be traced back to specific inputs and conditions
Performance analytics: Processing times are tracked for optimization opportunities

Error Handling Strategy:

Structured logging: Errors include context for debugging while maintaining security
Environment-aware responses: Production hides internal error details from clients
Persistent error state: Failed requests are recorded in DynamoDB for analysis
Metric publishing: Failures trigger CloudWatch metrics for monitoring

Resilience Features:

Timeout protection: Prevents indefinite hangs on slow API responses
Automatic retries: Handles transient failures without manual intervention
Secure credential management: API keys stored in environment variables, not code

Safety Mechanisms:

Default values: Provides sensible defaults when parameters are missing
Boundary enforcement: Ensures parameters stay within valid ranges
Cost protection: Caps maximum tokens to prevent unexpectedly expensive requests

Data Strategy:

Audit compliance: Maintains records of all API usage for regulatory requirements
Performance analysis: Prompt lengths and parameters help identify optimization opportunities
User behavior tracking: Enables usage pattern analysis and capacity planning
Automatic cleanup: TTL prevents unlimited data growth and controls storage costs

Dynamic Update Pattern:

Flexible schema: Allows storing arbitrary additional data without schema changes
Atomic updates: Ensures consistent state even under concurrent access
Timestamp tracking: Enables time-series analysis of request processing

Metrics Strategy:

Multi-dimensional data: Enables filtering and grouping by model, status, etc.
Real-time monitoring: Immediate visibility into system performance
Alerting foundation: Metrics feed into CloudWatch alarms for proactive monitoring
Cost tracking: Token usage metrics enable accurate billing and forecasting

Financial Features:

Accurate pricing: Separate input/output token pricing for precise cost calculation
Model-aware costs: Different models have different pricing structures
Real-time calculation: Immediate cost feedback for each request
Billing foundation: Enables usage-based billing and customer cost allocation

API Design Principles:

Consistent structure: Every response follows the same format for predictable client integration
Rich headers: Provide debugging and monitoring information without parsing response body
Metadata separation: Distinguishes between business data and operational metadata
Success indicators: Clear success/failure indication for robust client error handling

How to Choose the Right Strategy

Decision Framework:

1. Response Time Requirements

Immediate (< 5 seconds): Synchronous REST
Progressive feedback needed: WebSocket or SSE
Can wait (> 30 seconds): Async with callbacks

2. Request Characteristics

Simple, predictable: Synchronous REST
Complex, variable duration: Async processing
Interactive, conversational: WebSocket

3. Scale and Cost Constraints

Low volume (< 100 req/min): Synchronous REST is fine
High volume: Event-driven architecture
Cost-sensitive: Async processing to minimize Lambda costs

4. Integration Requirements

Browser-based: WebSocket or SSE for streaming, REST for simple
Server-to-server: Webhooks or polling
Mobile apps: Push notifications + async processing

Recommended Strategy by Use Case:

Use Case	Recommended Pattern	Reasoning
Chatbots/Interactive AI	WebSocket + fallback to async	Real-time feedback crucial for UX
Content Generation	Async with webhooks	Long processing times, batch efficiency
API Integrations	Hybrid (sync for quick, async for complex)	Flexibility for different integration needs
Mobile Apps	Async + push notifications	Handle network interruptions gracefully
High-volume SaaS	Event-driven architecture	Scale different components independently
Simple MVP	Synchronous REST	Fastest to implement and test

Implementation Roadmap:

Phase 1 (MVP): Start with synchronous REST for simplicity Phase 2 (Scale): Add async processing for longer requests
Phase 3 (Optimize): Implement event-driven architecture with intelligent routing Phase 4 (Advanced): Add streaming capabilities for interactive use cases

The key is to start simple and evolve your architecture as your requirements become clearer and your scale increases. Most successful LLM APIs begin with REST and gradually incorporate async patterns as they grow.

Building production-ready LLM systems requires navigating dozens of architectural decisions, each with far-reaching implications. At Yantratmika Solutions, we’ve helped organizations avoid the common pitfalls and build systems that scale. The devil, as always, is in the implementation details.