From LLM Prompt to Production (3/9) - Architecture Patterns
This is part of a series of blogs:
- Introduction
- Choosing the Right Technology
- Architecture Patterns
- Multi-Prompt Chaining
- Additional Complexity
- Redundancy & Scaling
- Security & Compliance
- Performance Optimization
- Observability & Monitoring
When deploying LLM prompts as production APIs, choosing the right architecture pattern is crucial for balancing performance, cost, and user experience. Let me break down the main patterns and their trade-offs for AWS serverless deployments.
Let us first understand what each architecture pattern represents conceptually and how they fundamentally differ in their approach to handling client-server communication for LLM requests.
1. Synchronous REST API
The synchronous REST API pattern represents the most traditional approach to web service communication. In this model, when a client sends a request to your LLM API, it establishes a direct, blocking connection where the client waits until the server completely processes the LLM prompt and returns the full response. This follows the classic request-response cycle that most developers are familiar with from standard web APIs.
The fundamental characteristic of this pattern is its simplicity and predictability. The client sends a request, the server processes it entirely, and then sends back a complete response. There’s no intermediate state or partial responses - it’s an all-or-nothing transaction. This makes it excellent for scenarios where LLM responses are quick (under 30 seconds) and the client application can afford to wait for the complete result.
However, this pattern becomes problematic for LLM applications because language models can take significant time to generate responses, especially for complex prompts or longer content generation tasks. During this processing time, the client connection remains open, consuming server resources and potentially timing out if the processing takes too long.
Best for: Simple queries, quick responses, traditional integrations
func handleLLMRequest(ctx context.Context, event events.APIGatewayProxyRequest) (events.APIGatewayProxyResponse, error) {
response, err := callLLM(event.Body)
if err != nil {
return errorResponse(500, "LLM processing failed"), nil
}
return successResponse(response), nil
}
What this code does: This is a straightforward Lambda function that receives an HTTP request through API Gateway, makes a direct call to the LLM service, waits for the complete response, and returns it immediately. The entire process is synchronous - the client waits until the LLM finishes processing before getting any response.
Pros:
- Simple implementation and testing
- Familiar REST semantics for developers
- Built-in API Gateway features (rate limiting, caching, authentication)
- Easy client integration with standard HTTP libraries
Cons:
- Lambda timeout limits (15 minutes maximum)
- Expensive for long-running LLM calls (you pay for idle waiting time)
- Poor user experience for slow responses (users see loading screens)
- No progress feedback during processing
2. WebSocket Real-time API
WebSocket represents a fundamentally different communication paradigm that establishes a persistent, bidirectional connection between client and server. Unlike REST APIs where each request creates a new connection, WebSocket creates a long-lived communication channel that allows both parties to send messages at any time during the connection’s lifetime.
For LLM applications, this pattern shines when you want to provide real-time streaming responses. Instead of waiting for the entire LLM response to be generated before sending anything back, the server can send partial responses as they’re generated - token by token or chunk by chunk. This creates the familiar experience seen in modern AI chat interfaces where users see the response being “typed out” in real-time.
The bidirectional nature of WebSocket also enables advanced features like allowing users to interrupt long-running LLM generations, send follow-up questions without establishing new connections, or implement conversational interfaces where context is maintained throughout the session. This makes it particularly powerful for interactive AI applications where the conversation flow is as important as individual responses.
Best for: Interactive applications, streaming responses, real-time feedback
func handleWebSocketMessage(ctx context.Context, event events.APIGatewayWebsocketProxyRequest) (events.APIGatewayProxyResponse, error) {
connectionID := event.RequestContext.ConnectionID
go func() {
stream := callLLMStream(event.Body)
for chunk := range stream {
sendToWebSocket(connectionID, chunk)
}
sendToWebSocket(connectionID, "COMPLETE")
}()
return events.APIGatewayProxyResponse{StatusCode: 200}, nil
}
What this code does: This establishes a persistent WebSocket connection between the client and server. When a message arrives, it starts a goroutine (background process) that calls the LLM in streaming mode. As the LLM generates tokens, each chunk is immediately sent back to the client through the WebSocket connection. The client sees the response being built word-by-word, similar to ChatGPT’s interface.
Pros:
- Real-time streaming capabilities (see responses as they’re generated)
- Better user experience with progressive responses
- Bidirectional communication (client can interrupt or modify requests)
- Lower perceived latency
Cons:
- Complex connection management (handling disconnects, reconnects)
- Higher infrastructure costs (persistent connections)
- WebSocket scaling challenges in serverless environments
- Connection state management complexity
3. Webhook Callbacks (Async Pattern)
The webhook callback pattern fundamentally decouples the request acceptance from the actual processing, creating an asynchronous workflow that can handle long-running operations gracefully. When a client submits a request, the server immediately acknowledges receipt and provides a tracking identifier, but the actual LLM processing happens in the background.
This pattern operates on the principle of “fire and forget” from the client’s perspective, but with a notification system that ensures the client eventually receives the results. Once the background processing completes, the server “calls back” to a URL provided by the client (the webhook) to deliver the results. This callback mechanism is what gives the pattern its name.
The asynchronous nature makes this pattern particularly well-suited for scenarios where LLM processing times are unpredictable or potentially very long. It’s also excellent for batch processing scenarios where multiple requests can be queued and processed efficiently in the background without keeping client connections open. The pattern naturally supports retry mechanisms, error handling, and can scale processing independently from request acceptance.
Best for: Background processing, integrations, fire-and-forget scenarios
// Step 1: Accept request and queue it
func handleAsyncRequest(ctx context.Context, event events.APIGatewayProxyRequest) (events.APIGatewayProxyResponse, error) {
requestID := generateRequestID()
// Store request details for later processing
storeRequest(requestID, event.Body, request.CallbackURL)
// Queue the work for background processing
sendToSQS(requestID, event.Body)
// Immediately return to client
return successResponse(map[string]string{
"request_id": requestID,
"status": "processing"
}), nil
}
// Step 2: Background processor handles the actual LLM call
func processLLMRequest(ctx context.Context, event events.SQSEvent) error {
for _, record := range event.Records {
requestID := record.Body
request := getRequestFromDB(requestID)
// Now do the actual LLM processing
response, err := callLLM(request.Prompt)
if err != nil {
// Notify the client about the error via their callback URL
notifyCallback(request.CallbackURL, requestID, "error", err.Error())
continue
}
// Send the successful result to the client's callback URL
notifyCallback(request.CallbackURL, requestID, "complete", response)
}
return nil
}
What this code does: This pattern splits the process into two phases. First, the API immediately accepts the request, generates a unique ID, stores the request details in DynamoDB, puts it in an SQS queue, and returns the ID to the client. The client can use this ID to check status or will receive results via a webhook. Separately, background Lambda functions process the SQS queue, perform the actual LLM calls, and send results back to the client’s specified callback URL.
Pros:
- No Lambda timeout concerns (can process for hours)
- Cost-effective for long operations (no idle waiting time)
- Natural error handling and retries through SQS
- Scales independently from request acceptance
Cons:
- Requires clients to implement callback endpoints
- More complex error scenarios (network failures during callbacks)
- Higher implementation complexity
- Delayed response delivery
4. Server-Sent Events (SSE)
Server-Sent Events represent a middle ground between the complexity of WebSocket and the limitations of traditional REST APIs. SSE establishes a one-way communication channel where the server can send multiple messages to the client over a single HTTP connection, but the client cannot send messages back through the same channel.
For LLM applications, SSE is particularly useful when you need to stream responses to the client but don’t require the bidirectional communication that WebSocket provides. The pattern works by keeping an HTTP connection open and sending specially formatted messages that browsers can automatically parse and handle. This makes it excellent for scenarios like live dashboards, progress updates, or streaming LLM responses where user interaction during the stream isn’t necessary.
SSE has the advantage of being built on standard HTTP protocols, making it simpler to implement than WebSocket while still providing real-time capabilities. It also includes built-in reconnection handling, so if the connection drops, browsers will automatically attempt to reconnect and resume receiving events.
Best for: One-way streaming, progress updates, live dashboards
func handleSSERequest(ctx context.Context, event events.APIGatewayProxyRequest) (events.APIGatewayProxyResponse, error) {
headers := map[string]string{
"Content-Type": "text/event-stream",
"Cache-Control": "no-cache",
"Connection": "keep-alive",
}
return events.APIGatewayProxyResponse{
StatusCode: 200,
Headers: headers,
Body: streamLLMResponse(event.Body), // Streaming response body
}, nil
}
What this code does: This creates an HTTP connection that stays open and continuously sends data to the client. Unlike WebSocket, it’s one-way communication from server to client. The special headers tell the browser this is a streaming connection. The LLM response is streamed chunk by chunk as it’s generated.
Pros:
- Simple client implementation (built into modern browsers)
- Built-in reconnection handling
- HTTP/2 efficient
- Good browser support
Cons:
- One-way communication only
- Limited by Lambda streaming capabilities
- Requires API Gateway v2 for proper streaming support
5. Event-Driven Architecture
Event-driven architecture represents the most sophisticated and scalable approach to handling LLM requests at enterprise scale. Rather than directly processing requests in the API endpoint, this pattern treats incoming requests as events that trigger a cascade of loosely coupled processing services.
In this model, when a request arrives, it’s immediately converted into an event and published to an event bus (like AWS EventBridge). Various microservices subscribe to these events and handle different aspects of the processing pipeline. For example, one service might handle request validation, another might route requests based on priority or content type, and yet another might perform the actual LLM processing.
This pattern excels in complex scenarios where different types of requests need different processing workflows, where you need to integrate with multiple downstream systems, or where you’re building a platform that needs to be highly extensible. It allows you to add new processing capabilities by simply adding new event subscribers without modifying existing code.
The event-driven approach also naturally supports sophisticated features like request prioritization, content-based routing, audit trails, and complex error handling workflows. However, it comes with increased architectural complexity and is typically only justified for applications that need significant scale or flexibility.
Best for: High-scale applications, complex workflows, microservices
// Main handler publishes events
func handleEventDrivenRequest(ctx context.Context, event events.APIGatewayProxyRequest) (events.APIGatewayProxyResponse, error) {
requestID := generateRequestID()
llmEvent := LLMRequestEvent{
RequestID: requestID,
UserID: getUserID(event),
Prompt: event.Body,
Priority: determinePriority(event), // Based on user tier, prompt complexity, etc.
}
// Publish to EventBridge - multiple services can react to this
publishToEventBridge(llmEvent)
return successResponse(map[string]string{
"request_id": requestID,
"status": "queued"
}), nil
}
// Different processors can handle different event types
func processHighPriorityLLM(ctx context.Context, event events.EventBridgeEvent) error {
var llmEvent LLMRequestEvent
json.Unmarshal(event.Detail, &llmEvent)
if llmEvent.Priority > 5 {
// Route high-priority requests to faster/more expensive LLM instances
response, err := callLLMWithHighPriority(llmEvent.Prompt)
updateRequestStatus(llmEvent.RequestID, "complete", response)
}
return nil
}
What this code does: Instead of directly processing requests, this publishes events to EventBridge. Multiple Lambda functions can subscribe to these events and handle them differently based on the event data. For example, high-priority requests might go to faster (more expensive) LLM endpoints, while bulk requests might be batched together. This creates a flexible, decoupled system where you can add new processing logic without changing the main API.
Performance Optimization Strategies
Batch Processing for Cost Efficiency
func batchProcessLLM(ctx context.Context, event events.SQSEvent) error {
var requests []LLMRequest
// SQS can deliver multiple messages in one Lambda invocation
for _, record := range event.Records {
var req LLMRequest
json.Unmarshal([]byte(record.Body), &req)
requests = append(requests, req)
}
// Process multiple requests concurrently within the same Lambda
semaphore := make(chan struct{}, 5) // Limit to 5 concurrent LLM calls
var wg sync.WaitGroup
for _, req := range requests {
wg.Add(1)
go func(request LLMRequest) {
defer wg.Done()
semaphore <- struct{}{} // Acquire semaphore
defer func() { <-semaphore }() // Release semaphore
response, err := callLLM(request.Prompt)
updateRequestInDB(request.ID, response, err)
}(req)
}
wg.Wait() // Wait for all requests to complete
return nil
}
What this optimization does: Instead of processing one request per Lambda invocation, this processes multiple requests concurrently within a single Lambda function. This reduces the total Lambda execution time and cost. The semaphore limits concurrent LLM calls to prevent overwhelming the LLM service or hitting rate limits.
Error Handling Strategies
func processWithRetry(requestID string, prompt string, maxRetries int) error {
for attempt := 1; attempt <= maxRetries; attempt++ {
response, err := callLLM(prompt)
if err == nil {
updateRequestStatus(requestID, "complete", response)
return nil
}
// Different errors need different handling
if isRetryableError(err) { // Network timeouts, rate limits, temporary failures
backoffTime := time.Duration(attempt * attempt) * time.Second // Exponential backoff
time.Sleep(backoffTime)
updateRequestStatus(requestID, "retrying", fmt.Sprintf("Attempt %d/%d", attempt, maxRetries))
continue
}
// Non-retryable errors: invalid input, authentication failures
updateRequestStatus(requestID, "failed", err.Error())
publishErrorEvent(requestID, err) // Alert monitoring systems
return err
}
// All retries exhausted
updateRequestStatus(requestID, "failed", "Max retries exceeded")
return errors.New("max retries exceeded")
}
What this error handling does: This implements intelligent retry logic that distinguishes between temporary failures (which should be retried) and permanent failures (which shouldn’t). It uses exponential backoff to avoid overwhelming failing services and updates the request status so clients can track progress.
My Recommendation: Hybrid Event-Driven Architecture
Each pattern represents a different trade-off between simplicity, performance, scalability, and user experience. Synchronous REST prioritizes simplicity, WebSocket prioritizes real-time interaction, webhooks prioritize reliability and cost-effectiveness, SSE provides a simple streaming solution, and event-driven architecture prioritizes flexibility and scale. Understanding these fundamental differences is crucial for making the right architectural choice for your specific LLM application requirements.
For most production LLM APIs, I recommend a hybrid event-driven architecture that adapts to different request types:
func handleLLMRequest(ctx context.Context, event events.APIGatewayProxyRequest) (events.APIGatewayProxyResponse, error) {
requestType := determineRequestType(event) // Based on prompt length, user tier, etc.
switch requestType {
case "quick": // Short prompts, premium users
return handleSynchronous(event)
case "standard": // Most requests
return handleAsynchronous(event)
case "streaming": // Interactive applications
return redirectToWebSocket(event)
default:
return handleAsynchronous(event)
}
}
What this hybrid approach does: This intelligently routes requests based on their characteristics. Quick requests (short prompts, premium users) get immediate synchronous processing. Standard requests go through async processing. Interactive applications use WebSocket streaming. This maximizes both user experience and cost efficiency.
Additional Intricacies
That is not all. When we build an API, we should also look into these intricacies
Traffic Management Features:
- Rate limiting: 1000 RPS steady state, 2000 RPS burst protects against abuse and controls costs
- Data trace logging: Enables detailed request/response logging for debugging and compliance
- CORS configuration: Enables cross-origin requests for web applications
Validation Benefits:
- Early rejection: Invalid requests are rejected at API Gateway level, saving Lambda compute costs
- Security: Prevents injection attacks through input length limits and type validation
- Cost control: Limits maximum tokens to prevent unexpectedly expensive requests
- Quality assurance: Ensures temperature values are within valid ranges for consistent output quality
Lifecycle Tracking Benefits:
- Audit trail: Complete record of every request for compliance and debugging
- Real-time status: External systems can query request status during long-running operations
- Error correlation: Failed requests can be traced back to specific inputs and conditions
- Performance analytics: Processing times are tracked for optimization opportunities
Error Handling Strategy:
- Structured logging: Errors include context for debugging while maintaining security
- Environment-aware responses: Production hides internal error details from clients
- Persistent error state: Failed requests are recorded in DynamoDB for analysis
- Metric publishing: Failures trigger CloudWatch metrics for monitoring
Resilience Features:
- Timeout protection: Prevents indefinite hangs on slow API responses
- Automatic retries: Handles transient failures without manual intervention
- Secure credential management: API keys stored in environment variables, not code
Safety Mechanisms:
- Default values: Provides sensible defaults when parameters are missing
- Boundary enforcement: Ensures parameters stay within valid ranges
- Cost protection: Caps maximum tokens to prevent unexpectedly expensive requests
Data Strategy:
- Audit compliance: Maintains records of all API usage for regulatory requirements
- Performance analysis: Prompt lengths and parameters help identify optimization opportunities
- User behavior tracking: Enables usage pattern analysis and capacity planning
- Automatic cleanup: TTL prevents unlimited data growth and controls storage costs
Dynamic Update Pattern:
- Flexible schema: Allows storing arbitrary additional data without schema changes
- Atomic updates: Ensures consistent state even under concurrent access
- Timestamp tracking: Enables time-series analysis of request processing
Metrics Strategy:
- Multi-dimensional data: Enables filtering and grouping by model, status, etc.
- Real-time monitoring: Immediate visibility into system performance
- Alerting foundation: Metrics feed into CloudWatch alarms for proactive monitoring
- Cost tracking: Token usage metrics enable accurate billing and forecasting
Financial Features:
- Accurate pricing: Separate input/output token pricing for precise cost calculation
- Model-aware costs: Different models have different pricing structures
- Real-time calculation: Immediate cost feedback for each request
- Billing foundation: Enables usage-based billing and customer cost allocation
API Design Principles:
- Consistent structure: Every response follows the same format for predictable client integration
- Rich headers: Provide debugging and monitoring information without parsing response body
- Metadata separation: Distinguishes between business data and operational metadata
- Success indicators: Clear success/failure indication for robust client error handling
How to Choose the Right Strategy
Decision Framework:
1. Response Time Requirements
- Immediate (< 5 seconds): Synchronous REST
- Progressive feedback needed: WebSocket or SSE
- Can wait (> 30 seconds): Async with callbacks
2. Request Characteristics
- Simple, predictable: Synchronous REST
- Complex, variable duration: Async processing
- Interactive, conversational: WebSocket
3. Scale and Cost Constraints
- Low volume (< 100 req/min): Synchronous REST is fine
- High volume: Event-driven architecture
- Cost-sensitive: Async processing to minimize Lambda costs
4. Integration Requirements
- Browser-based: WebSocket or SSE for streaming, REST for simple
- Server-to-server: Webhooks or polling
- Mobile apps: Push notifications + async processing
Recommended Strategy by Use Case:
| Use Case | Recommended Pattern | Reasoning |
|---|---|---|
| Chatbots/Interactive AI | WebSocket + fallback to async | Real-time feedback crucial for UX |
| Content Generation | Async with webhooks | Long processing times, batch efficiency |
| API Integrations | Hybrid (sync for quick, async for complex) | Flexibility for different integration needs |
| Mobile Apps | Async + push notifications | Handle network interruptions gracefully |
| High-volume SaaS | Event-driven architecture | Scale different components independently |
| Simple MVP | Synchronous REST | Fastest to implement and test |
Implementation Roadmap:
Phase 1 (MVP): Start with synchronous REST for simplicity Phase 2
(Scale): Add async processing for longer requests
Phase 3 (Optimize): Implement event-driven architecture with intelligent
routing Phase 4 (Advanced): Add streaming capabilities for interactive use
cases
The key is to start simple and evolve your architecture as your requirements become clearer and your scale increases. Most successful LLM APIs begin with REST and gradually incorporate async patterns as they grow.
Building production-ready LLM systems requires navigating dozens of architectural decisions, each with far-reaching implications. At Yantratmika Solutions, we’ve helped organizations avoid the common pitfalls and build systems that scale. The devil, as always, is in the implementation details.