← Back to Home
✍️ Yantratmika Solutions 📅 2025-11-21 ⏱️ 16 min read

From LLM Prompt to Production (3/9) - Architecture Patterns

This is part of a series of blogs:

  1. Introduction
  2. Choosing the Right Technology
  3. Architecture Patterns
  4. Multi-Prompt Chaining
  5. Additional Complexity
  6. Redundancy & Scaling
  7. Security & Compliance
  8. Performance Optimization
  9. Observability & Monitoring

When deploying LLM prompts as production APIs, choosing the right architecture pattern is crucial for balancing performance, cost, and user experience. Let me break down the main patterns and their trade-offs for AWS serverless deployments.

Let us first understand what each architecture pattern represents conceptually and how they fundamentally differ in their approach to handling client-server communication for LLM requests.

1. Synchronous REST API

The synchronous REST API pattern represents the most traditional approach to web service communication. In this model, when a client sends a request to your LLM API, it establishes a direct, blocking connection where the client waits until the server completely processes the LLM prompt and returns the full response. This follows the classic request-response cycle that most developers are familiar with from standard web APIs.

The fundamental characteristic of this pattern is its simplicity and predictability. The client sends a request, the server processes it entirely, and then sends back a complete response. There’s no intermediate state or partial responses - it’s an all-or-nothing transaction. This makes it excellent for scenarios where LLM responses are quick (under 30 seconds) and the client application can afford to wait for the complete result.

However, this pattern becomes problematic for LLM applications because language models can take significant time to generate responses, especially for complex prompts or longer content generation tasks. During this processing time, the client connection remains open, consuming server resources and potentially timing out if the processing takes too long.

Best for: Simple queries, quick responses, traditional integrations

func handleLLMRequest(ctx context.Context, event events.APIGatewayProxyRequest) (events.APIGatewayProxyResponse, error) {
    response, err := callLLM(event.Body)
    if err != nil {
        return errorResponse(500, "LLM processing failed"), nil
    }
    return successResponse(response), nil
}

What this code does: This is a straightforward Lambda function that receives an HTTP request through API Gateway, makes a direct call to the LLM service, waits for the complete response, and returns it immediately. The entire process is synchronous - the client waits until the LLM finishes processing before getting any response.

Pros:

Cons:

2. WebSocket Real-time API

WebSocket represents a fundamentally different communication paradigm that establishes a persistent, bidirectional connection between client and server. Unlike REST APIs where each request creates a new connection, WebSocket creates a long-lived communication channel that allows both parties to send messages at any time during the connection’s lifetime.

For LLM applications, this pattern shines when you want to provide real-time streaming responses. Instead of waiting for the entire LLM response to be generated before sending anything back, the server can send partial responses as they’re generated - token by token or chunk by chunk. This creates the familiar experience seen in modern AI chat interfaces where users see the response being “typed out” in real-time.

The bidirectional nature of WebSocket also enables advanced features like allowing users to interrupt long-running LLM generations, send follow-up questions without establishing new connections, or implement conversational interfaces where context is maintained throughout the session. This makes it particularly powerful for interactive AI applications where the conversation flow is as important as individual responses.

Best for: Interactive applications, streaming responses, real-time feedback

func handleWebSocketMessage(ctx context.Context, event events.APIGatewayWebsocketProxyRequest) (events.APIGatewayProxyResponse, error) {
    connectionID := event.RequestContext.ConnectionID

    go func() {
        stream := callLLMStream(event.Body)
        for chunk := range stream {
            sendToWebSocket(connectionID, chunk)
        }
        sendToWebSocket(connectionID, "COMPLETE")
    }()

    return events.APIGatewayProxyResponse{StatusCode: 200}, nil
}

What this code does: This establishes a persistent WebSocket connection between the client and server. When a message arrives, it starts a goroutine (background process) that calls the LLM in streaming mode. As the LLM generates tokens, each chunk is immediately sent back to the client through the WebSocket connection. The client sees the response being built word-by-word, similar to ChatGPT’s interface.

Pros:

Cons:

3. Webhook Callbacks (Async Pattern)

The webhook callback pattern fundamentally decouples the request acceptance from the actual processing, creating an asynchronous workflow that can handle long-running operations gracefully. When a client submits a request, the server immediately acknowledges receipt and provides a tracking identifier, but the actual LLM processing happens in the background.

This pattern operates on the principle of “fire and forget” from the client’s perspective, but with a notification system that ensures the client eventually receives the results. Once the background processing completes, the server “calls back” to a URL provided by the client (the webhook) to deliver the results. This callback mechanism is what gives the pattern its name.

The asynchronous nature makes this pattern particularly well-suited for scenarios where LLM processing times are unpredictable or potentially very long. It’s also excellent for batch processing scenarios where multiple requests can be queued and processed efficiently in the background without keeping client connections open. The pattern naturally supports retry mechanisms, error handling, and can scale processing independently from request acceptance.

Best for: Background processing, integrations, fire-and-forget scenarios

// Step 1: Accept request and queue it
func handleAsyncRequest(ctx context.Context, event events.APIGatewayProxyRequest) (events.APIGatewayProxyResponse, error) {
    requestID := generateRequestID()

    // Store request details for later processing
    storeRequest(requestID, event.Body, request.CallbackURL)

    // Queue the work for background processing
    sendToSQS(requestID, event.Body)

    // Immediately return to client
    return successResponse(map[string]string{
        "request_id": requestID,
        "status": "processing"
    }), nil
}

// Step 2: Background processor handles the actual LLM call
func processLLMRequest(ctx context.Context, event events.SQSEvent) error {
    for _, record := range event.Records {
        requestID := record.Body
        request := getRequestFromDB(requestID)

        // Now do the actual LLM processing
        response, err := callLLM(request.Prompt)
        if err != nil {
            // Notify the client about the error via their callback URL
            notifyCallback(request.CallbackURL, requestID, "error", err.Error())
            continue
        }

        // Send the successful result to the client's callback URL
        notifyCallback(request.CallbackURL, requestID, "complete", response)
    }
    return nil
}

What this code does: This pattern splits the process into two phases. First, the API immediately accepts the request, generates a unique ID, stores the request details in DynamoDB, puts it in an SQS queue, and returns the ID to the client. The client can use this ID to check status or will receive results via a webhook. Separately, background Lambda functions process the SQS queue, perform the actual LLM calls, and send results back to the client’s specified callback URL.

Pros:

Cons:

4. Server-Sent Events (SSE)

Server-Sent Events represent a middle ground between the complexity of WebSocket and the limitations of traditional REST APIs. SSE establishes a one-way communication channel where the server can send multiple messages to the client over a single HTTP connection, but the client cannot send messages back through the same channel.

For LLM applications, SSE is particularly useful when you need to stream responses to the client but don’t require the bidirectional communication that WebSocket provides. The pattern works by keeping an HTTP connection open and sending specially formatted messages that browsers can automatically parse and handle. This makes it excellent for scenarios like live dashboards, progress updates, or streaming LLM responses where user interaction during the stream isn’t necessary.

SSE has the advantage of being built on standard HTTP protocols, making it simpler to implement than WebSocket while still providing real-time capabilities. It also includes built-in reconnection handling, so if the connection drops, browsers will automatically attempt to reconnect and resume receiving events.

Best for: One-way streaming, progress updates, live dashboards

func handleSSERequest(ctx context.Context, event events.APIGatewayProxyRequest) (events.APIGatewayProxyResponse, error) {
    headers := map[string]string{
        "Content-Type": "text/event-stream",
        "Cache-Control": "no-cache",
        "Connection": "keep-alive",
    }

    return events.APIGatewayProxyResponse{
        StatusCode: 200,
        Headers: headers,
        Body: streamLLMResponse(event.Body), // Streaming response body
    }, nil
}

What this code does: This creates an HTTP connection that stays open and continuously sends data to the client. Unlike WebSocket, it’s one-way communication from server to client. The special headers tell the browser this is a streaming connection. The LLM response is streamed chunk by chunk as it’s generated.

Pros:

Cons:

5. Event-Driven Architecture

Event-driven architecture represents the most sophisticated and scalable approach to handling LLM requests at enterprise scale. Rather than directly processing requests in the API endpoint, this pattern treats incoming requests as events that trigger a cascade of loosely coupled processing services.

In this model, when a request arrives, it’s immediately converted into an event and published to an event bus (like AWS EventBridge). Various microservices subscribe to these events and handle different aspects of the processing pipeline. For example, one service might handle request validation, another might route requests based on priority or content type, and yet another might perform the actual LLM processing.

This pattern excels in complex scenarios where different types of requests need different processing workflows, where you need to integrate with multiple downstream systems, or where you’re building a platform that needs to be highly extensible. It allows you to add new processing capabilities by simply adding new event subscribers without modifying existing code.

The event-driven approach also naturally supports sophisticated features like request prioritization, content-based routing, audit trails, and complex error handling workflows. However, it comes with increased architectural complexity and is typically only justified for applications that need significant scale or flexibility.

Best for: High-scale applications, complex workflows, microservices

// Main handler publishes events
func handleEventDrivenRequest(ctx context.Context, event events.APIGatewayProxyRequest) (events.APIGatewayProxyResponse, error) {
    requestID := generateRequestID()

    llmEvent := LLMRequestEvent{
        RequestID: requestID,
        UserID: getUserID(event),
        Prompt: event.Body,
        Priority: determinePriority(event), // Based on user tier, prompt complexity, etc.
    }

    // Publish to EventBridge - multiple services can react to this
    publishToEventBridge(llmEvent)

    return successResponse(map[string]string{
        "request_id": requestID,
        "status": "queued"
    }), nil
}

// Different processors can handle different event types
func processHighPriorityLLM(ctx context.Context, event events.EventBridgeEvent) error {
    var llmEvent LLMRequestEvent
    json.Unmarshal(event.Detail, &llmEvent)

    if llmEvent.Priority > 5 {
        // Route high-priority requests to faster/more expensive LLM instances
        response, err := callLLMWithHighPriority(llmEvent.Prompt)
        updateRequestStatus(llmEvent.RequestID, "complete", response)
    }

    return nil
}

What this code does: Instead of directly processing requests, this publishes events to EventBridge. Multiple Lambda functions can subscribe to these events and handle them differently based on the event data. For example, high-priority requests might go to faster (more expensive) LLM endpoints, while bulk requests might be batched together. This creates a flexible, decoupled system where you can add new processing logic without changing the main API.

Performance Optimization Strategies

Batch Processing for Cost Efficiency

func batchProcessLLM(ctx context.Context, event events.SQSEvent) error {
    var requests []LLMRequest

    // SQS can deliver multiple messages in one Lambda invocation
    for _, record := range event.Records {
        var req LLMRequest
        json.Unmarshal([]byte(record.Body), &req)
        requests = append(requests, req)
    }

    // Process multiple requests concurrently within the same Lambda
    semaphore := make(chan struct{}, 5) // Limit to 5 concurrent LLM calls
    var wg sync.WaitGroup

    for _, req := range requests {
        wg.Add(1)
        go func(request LLMRequest) {
            defer wg.Done()
            semaphore <- struct{}{}        // Acquire semaphore
            defer func() { <-semaphore }() // Release semaphore

            response, err := callLLM(request.Prompt)
            updateRequestInDB(request.ID, response, err)
        }(req)
    }

    wg.Wait() // Wait for all requests to complete
    return nil
}

What this optimization does: Instead of processing one request per Lambda invocation, this processes multiple requests concurrently within a single Lambda function. This reduces the total Lambda execution time and cost. The semaphore limits concurrent LLM calls to prevent overwhelming the LLM service or hitting rate limits.

Error Handling Strategies

func processWithRetry(requestID string, prompt string, maxRetries int) error {
    for attempt := 1; attempt <= maxRetries; attempt++ {
        response, err := callLLM(prompt)

        if err == nil {
            updateRequestStatus(requestID, "complete", response)
            return nil
        }

        // Different errors need different handling
        if isRetryableError(err) { // Network timeouts, rate limits, temporary failures
            backoffTime := time.Duration(attempt * attempt) * time.Second // Exponential backoff
            time.Sleep(backoffTime)

            updateRequestStatus(requestID, "retrying", fmt.Sprintf("Attempt %d/%d", attempt, maxRetries))
            continue
        }

        // Non-retryable errors: invalid input, authentication failures
        updateRequestStatus(requestID, "failed", err.Error())
        publishErrorEvent(requestID, err) // Alert monitoring systems
        return err
    }

    // All retries exhausted
    updateRequestStatus(requestID, "failed", "Max retries exceeded")
    return errors.New("max retries exceeded")
}

What this error handling does: This implements intelligent retry logic that distinguishes between temporary failures (which should be retried) and permanent failures (which shouldn’t). It uses exponential backoff to avoid overwhelming failing services and updates the request status so clients can track progress.

My Recommendation: Hybrid Event-Driven Architecture

Each pattern represents a different trade-off between simplicity, performance, scalability, and user experience. Synchronous REST prioritizes simplicity, WebSocket prioritizes real-time interaction, webhooks prioritize reliability and cost-effectiveness, SSE provides a simple streaming solution, and event-driven architecture prioritizes flexibility and scale. Understanding these fundamental differences is crucial for making the right architectural choice for your specific LLM application requirements.

For most production LLM APIs, I recommend a hybrid event-driven architecture that adapts to different request types:

func handleLLMRequest(ctx context.Context, event events.APIGatewayProxyRequest) (events.APIGatewayProxyResponse, error) {
    requestType := determineRequestType(event) // Based on prompt length, user tier, etc.

    switch requestType {
    case "quick":        // Short prompts, premium users
        return handleSynchronous(event)

    case "standard":     // Most requests
        return handleAsynchronous(event)

    case "streaming":    // Interactive applications
        return redirectToWebSocket(event)

    default:
        return handleAsynchronous(event)
    }
}

What this hybrid approach does: This intelligently routes requests based on their characteristics. Quick requests (short prompts, premium users) get immediate synchronous processing. Standard requests go through async processing. Interactive applications use WebSocket streaming. This maximizes both user experience and cost efficiency.

Additional Intricacies

That is not all. When we build an API, we should also look into these intricacies

Traffic Management Features:

Validation Benefits:

Lifecycle Tracking Benefits:

Error Handling Strategy:

Resilience Features:

Safety Mechanisms:

Data Strategy:

Dynamic Update Pattern:

Metrics Strategy:

Financial Features:

API Design Principles:

How to Choose the Right Strategy

Decision Framework:

1. Response Time Requirements

2. Request Characteristics

3. Scale and Cost Constraints

4. Integration Requirements

Use Case Recommended Pattern Reasoning
Chatbots/Interactive AI WebSocket + fallback to async Real-time feedback crucial for UX
Content Generation Async with webhooks Long processing times, batch efficiency
API Integrations Hybrid (sync for quick, async for complex) Flexibility for different integration needs
Mobile Apps Async + push notifications Handle network interruptions gracefully
High-volume SaaS Event-driven architecture Scale different components independently
Simple MVP Synchronous REST Fastest to implement and test

Implementation Roadmap:

Phase 1 (MVP): Start with synchronous REST for simplicity Phase 2 (Scale): Add async processing for longer requests
Phase 3 (Optimize): Implement event-driven architecture with intelligent routing Phase 4 (Advanced): Add streaming capabilities for interactive use cases

The key is to start simple and evolve your architecture as your requirements become clearer and your scale increases. Most successful LLM APIs begin with REST and gradually incorporate async patterns as they grow.


Building production-ready LLM systems requires navigating dozens of architectural decisions, each with far-reaching implications. At Yantratmika Solutions, we’ve helped organizations avoid the common pitfalls and build systems that scale. The devil, as always, is in the implementation details.