From LLM Prompt to Production (6/9) - Redundancy and Scalability
This is part of a series of blogs:
- Introduction
- Choosing the Right Technology
- Architecture Patterns
- Multi-Prompt Chaining
- Additional Complexity
- Redundancy & Scaling
- Security & Compliance
- Performance Optimization
- Observability & Monitoring
In this section, we’ll dive deep into redundancy and scalability patterns for serverless LLM APIs, using AWS Lambda with GoLang and CDK infrastructure. More importantly, we’ll explore practical solutions to the common pitfalls and provide actionable strategies based on real-world implementations.
Understanding Enterprise Scale Requirements
Enterprise scale isn’t just about handling more requests—it’s about handling unpredictable load patterns, maintaining sub-second response times under stress, and ensuring your API remains available even when dependencies fail.
Consider a financial services company using an LLM for real-time fraud detection. Their requirements might include:
- Processing 50,000+ requests per minute during peak hours
- 99.99% uptime (4.32 minutes of downtime per month)
- Sub-200ms response times even at peak load
- Compliance with SOC 2 and PCI DSS standards
- Global availability across multiple regions
Solution: Multi-Layered Infrastructure Planning
The key to meeting enterprise requirements is planning for failure at every layer. This means designing your infrastructure with redundancy built-in from day one, not bolted on later. Here’s how we approach this systematically:
// infrastructure/stacks/llm-api-stack.ts
export class LLMApiStack extends Stack {
constructor(scope: Construct, id: string, props: LLMApiStackProps) {
super(scope, id, props);
// Multi-AZ Lambda deployment with reserved concurrency
const apiFunction = new Function(this, "LLMApiFunction", {
runtime: Runtime.PROVIDED_AL2,
handler: "bootstrap",
code: Code.fromAsset("dist/lambda.zip"),
timeout: Duration.seconds(30),
memorySize: 1024,
reservedConcurrencyLimit: 1000, // Prevents function from consuming all account concurrency
environment: {
REGION: this.region,
DYNAMODB_TABLE: this.dynamoTable.tableName,
REDIS_CLUSTER_ENDPOINT: this.redisCluster.attrRedisEndpointAddress,
},
});
// Dead letter queue for failed invocations
const dlq = new Queue(this, "LLMApiDLQ", {
visibilityTimeout: Duration.seconds(300),
});
apiFunction.addDeadLetterQueue(dlq);
}
}
This CDK configuration demonstrates several critical enterprise patterns. The
reservedConcurrencyLimit ensures your LLM API doesn’t consume all available
Lambda concurrency in your account, preventing other services from being
starved. The dead letter queue captures failed invocations for analysis and
potential replay. The environment variables are externalized, making the
function configurable across environments without code changes.
The real solution here goes beyond infrastructure code—it requires establishing SLAs for each component and implementing monitoring to track compliance. We recommend setting up CloudWatch dashboards that track not just response times and error rates, but also business metrics like “successful fraud detections per minute” to ensure your scaling efforts align with business objectives.
The Myth of Simple Redundancy
The biggest misconception in scaling APIs is that redundancy equals deploying multiple identical copies. In reality, true redundancy requires addressing cascading failures, data consistency, and graceful degradation. The problem isn’t just having backup systems—it’s ensuring those backups can actually take over seamlessly when needed.
Solution: Intelligent Redundancy with Context Management
The challenge with LLM APIs is that they often need context from previous interactions, user preferences, or conversation history. This creates a paradox: you need state for intelligence, but state destroys scalability. The solution is implementing a tiered context management system that degrades gracefully.
// pkg/context/manager.go
type ContextManager struct {
primary cache.Client // Redis cluster - fastest access
secondary cache.Client // Backup Redis or Memcached
fallback storage.Client // DynamoDB or RDS - persistent but slower
}
func (cm *ContextManager) GetContext(userID string) (*Context, error) {
// Attempt primary cache (Redis) - sub-millisecond response
if ctx, err := cm.primary.Get(userID); err == nil {
return ctx, nil
}
// Fallback to secondary cache - still fast, maybe slightly stale
if ctx, err := cm.secondary.Get(userID); err == nil {
// Async repair primary cache - don't block the response
go cm.primary.Set(userID, ctx, time.Hour)
return ctx, nil
}
// Ultimate fallback: reconstruct from persistent storage
// This is slower but ensures we never completely lose context
return cm.reconstructContextFromStorage(userID)
}
This tiered approach solves multiple problems simultaneously. When the primary
cache fails, users don’t lose their conversation context—they just experience
slightly slower responses. The asynchronous repair pattern
(go cm.primary.Set(...)) ensures the primary cache is restored without
impacting the current request’s response time.
The key insight here is that not all failures require the same response. A Redis cluster outage doesn’t mean your API should fail—it means it should degrade gracefully to a backup system while transparently repairing itself.
Solution: Circuit Breaker Pattern for External Dependencies
When your API depends on external LLM providers (OpenAI, Anthropic, etc.), you need protection against their failures cascading to your users. The circuit breaker pattern is your first line of defense.
// pkg/llm/circuit_breaker.go
type CircuitBreaker struct {
maxFailures int // How many failures before opening the circuit
timeout time.Duration // How long to wait before trying again
failures int // Current failure count
lastFailure time.Time // When the last failure occurred
mutex sync.Mutex // Thread-safe access
}
func (cb *CircuitBreaker) Call(fn func() (interface{}, error)) (interface{}, error) {
cb.mutex.Lock()
defer cb.mutex.Unlock()
// Circuit is open - reject requests immediately
if cb.failures >= cb.maxFailures {
if time.Since(cb.lastFailure) < cb.timeout {
return nil, errors.New("circuit breaker open - external service unavailable")
}
// Half-open state - allow one request through to test recovery
cb.failures = cb.maxFailures - 1
}
// Execute the actual call to external service
result, err := fn()
if err != nil {
cb.failures++
cb.lastFailure = time.Now()
return nil, err
}
// Success - reset failure counter
cb.failures = 0
return result, nil
}
This circuit breaker implementation prevents your API from repeatedly calling failed external services, which would waste time and potentially make problems worse. When OpenAI’s API is down, instead of timing out on every request, your circuit breaker opens after a few failures and immediately returns an error.
The real solution extends beyond just implementing circuit breakers—you need a fallback strategy. This might mean using a different LLM provider, serving cached responses, or providing a simplified response that doesn’t require external LLM calls. The circuit breaker gives you the fast failure detection; your business logic determines what to do when it opens.
Database Challenges and Practical Solutions
Databases become the bottleneck in most scaled LLM APIs because they handle both conversation context and analytics. The traditional approach of “just add read replicas” creates new problems around data consistency and connection management.
Solution: Smart Read Distribution with Consistency Guarantees
The challenge with read replicas is lag—your replica might be several seconds behind the primary, which means users could lose recent conversation context. The solution is implementing intelligent read routing that considers both performance and consistency requirements.
// pkg/database/consistent_reader.go
type ConsistentReader struct {
primary *sql.DB
replicas []*sql.DB
lagTracker map[string]time.Duration // Monitor replica lag in real-time
writeLog *RecentWriteTracker // Track recent writes by user/session
}
func (cr *ConsistentReader) ReadWithConsistency(query string, userID string, maxLag time.Duration) (*sql.Rows, error) {
// If this user has recent writes, always use primary to ensure read-after-write consistency
if cr.writeLog.HasRecentWrites(userID, time.Second*5) {
return cr.primary.Query(query)
}
// Find the replica with the best lag that still meets requirements
bestReplica := cr.findBestReplica(maxLag)
if bestReplica != nil {
return bestReplica.Query(query)
}
// No replica meets requirements - fallback to primary
return cr.primary.Query(query)
}
func (cr *ConsistentReader) findBestReplica(maxLag time.Duration) *sql.DB {
var bestReplica *sql.DB
var bestLag time.Duration = time.Hour // Start with impossibly high lag
for _, replica := range cr.replicas {
currentLag := cr.lagTracker.GetCurrentLag(replica)
if currentLag <= maxLag && currentLag < bestLag {
bestReplica = replica
bestLag = currentLag
}
}
return bestReplica
}
This solution solves the fundamental read replica problem by tracking both replica lag and recent write activity per user. When a user adds a message to their conversation, we know to read their subsequent requests from the primary database for the next few seconds. For other users, we can safely use replicas, distributing load while maintaining consistency guarantees.
The RecentWriteTracker component (implementation not shown for brevity)
maintains a time-based cache of recent writes per user. This is typically
implemented using Redis with expiring keys, allowing fast lookups without adding
significant overhead.
Solution: Addressing DynamoDB’s Hidden Scaling Issues
DynamoDB appears perfect for serverless applications, but it has subtle scaling challenges that can destroy performance. The most common issue is hot partitions—when your data access patterns concentrate on specific partition keys.
// infrastructure/dynamodb-optimized.ts
const conversationTable = new Table(this, "ConversationTable", {
// Primary key design is crucial for avoiding hot partitions
partitionKey: {
name: "conversation_id", // Should be distributed, not sequential
type: AttributeType.STRING,
},
sortKey: {
name: "timestamp",
type: AttributeType.NUMBER,
},
billingMode: BillingMode.ON_DEMAND, // Handles traffic spikes automatically
pointInTimeRecovery: true, // Essential for production
// GSI to support different query patterns without creating hot partitions
globalSecondaryIndexes: [
{
indexName: "user-timestamp-index",
partitionKey: {
name: "user_id_shard", // Note: sharded to prevent hot partitions
type: AttributeType.STRING,
},
sortKey: {
name: "timestamp",
type: AttributeType.NUMBER,
},
projectionType: ProjectionType.ALL,
},
],
});
The critical insight here is the user_id_shard field instead of plain
user_id. In your application code, you’d create this by appending a shard
suffix: ${user_id}#${timestamp % 100}. This distributes user queries across
100 partitions, preventing any single user from creating a hot partition.
The real solution involves designing your data model from the beginning with DynamoDB’s scaling characteristics in mind. This means understanding that DynamoDB distributes data based on partition key hash, not the actual value. Sequential IDs, timestamps, or user IDs can all create hot partitions if not properly sharded.
External API Dependencies: Turning Weakness into Strength
Your LLM API is only as reliable as its slowest dependency. Recent outages at major cloud providers have highlighted how external APIs can become single points of failure. The solution isn’t avoiding dependencies—it’s managing them intelligently.
Solution: Multi-Provider Strategy with Intelligent Failover
Instead of betting everything on a single LLM provider, implement a multi-provider strategy that can seamlessly switch between providers based on availability, performance, and cost.
// pkg/llm/multi_provider.go
type LLMProvider interface {
GenerateResponse(prompt string, options RequestOptions) (*Response, error)
HealthCheck() error
GetLatency() time.Duration // Track average response time
GetCost() float64 // Track cost per request
}
type MultiProviderLLM struct {
providers []LLMProvider
weights []int // Load distribution weights
circuits map[string]*CircuitBreaker // Circuit breaker per provider
metrics *ProviderMetrics // Track performance and reliability
}
func (mp *MultiProviderLLM) GenerateResponse(prompt string, options RequestOptions) (*Response, error) {
// Get providers ordered by current health and performance
orderedProviders := mp.getOrderedProviders()
var lastError error
for i, provider := range orderedProviders {
circuit := mp.circuits[provider.Name()]
// Attempt request through circuit breaker
result, err := circuit.Call(func() (interface{}, error) {
startTime := time.Now()
response, err := provider.GenerateResponse(prompt, options)
// Track metrics for future routing decisions
mp.metrics.RecordRequest(provider.Name(), time.Since(startTime), err == nil)
return response, err
})
if err == nil {
return result.(*Response), nil
}
lastError = err
log.Warn("Provider failed, trying next",
"provider", provider.Name(),
"error", err,
"attempt", i+1,
"total_providers", len(orderedProviders))
}
return nil, fmt.Errorf("all providers failed, last error: %v", lastError)
}
func (mp *MultiProviderLLM) getOrderedProviders() []LLMProvider {
// Sort providers by health score (combination of success rate, latency, and cost)
providers := make([]LLMProvider, len(mp.providers))
copy(providers, mp.providers)
sort.Slice(providers, func(i, j int) bool {
scoreI := mp.metrics.GetHealthScore(providers[i].Name())
scoreJ := mp.metrics.GetHealthScore(providers[j].Name())
return scoreI > scoreJ // Higher score = better health
})
return providers
}
This implementation goes beyond simple failover—it continuously monitors provider performance and adjusts routing accordingly. If OpenAI is responding slowly but Anthropic is fast, the system automatically shifts traffic. The health scoring algorithm considers success rate, latency, and cost, allowing you to optimize for different objectives.
The key innovation here is treating external API dependencies not as single points of failure, but as a pool of interchangeable resources. This requires careful prompt standardization and response format normalization across providers, but the reliability gains are substantial.
Solving the Stateless Context Paradox
LLM APIs inherently need context, but Lambda functions are stateless. This creates a fundamental tension that requires creative solutions. The traditional approach of storing everything in external systems creates latency and complexity.
Solution: Hybrid Context Architecture
The solution is implementing a hybrid approach that keeps frequently accessed context in memory while persisting long-term context externally.
// pkg/context/hybrid_manager.go
type HybridContextManager struct {
inMemoryCache map[string]*ContextEntry // Hot context for active sessions
distributedCache *redis.Client // Warm context for recent sessions
persistentStore *dynamodb.Client // Cold context for historical data
maxMemoryEntries int // Prevent memory bloat
entryTTL time.Duration // How long to keep context in memory
mutex sync.RWMutex // Thread-safe access
}
type ContextEntry struct {
Context *ConversationContext
LastAccess time.Time
AccessCount int
}
func (hcm *HybridContextManager) GetContext(sessionID string) (*ConversationContext, error) {
hcm.mutex.RLock()
// Check in-memory cache first (fastest)
if entry, exists := hcm.inMemoryCache[sessionID]; exists {
entry.LastAccess = time.Now()
entry.AccessCount++
hcm.mutex.RUnlock()
return entry.Context, nil
}
hcm.mutex.RUnlock()
// Check distributed cache (fast)
if contextData, err := hcm.distributedCache.Get(sessionID).Result(); err == nil {
var context ConversationContext
if err := json.Unmarshal([]byte(contextData), &context); err == nil {
// Promote to in-memory cache for future requests
hcm.promoteToMemory(sessionID, &context)
return &context, nil
}
}
// Fallback to persistent store (slower but reliable)
return hcm.loadFromPersistentStore(sessionID)
}
func (hcm *HybridContextManager) promoteToMemory(sessionID string, context *ConversationContext) {
hcm.mutex.Lock()
defer hcm.mutex.Unlock()
// Implement LRU eviction if memory cache is full
if len(hcm.inMemoryCache) >= hcm.maxMemoryEntries {
hcm.evictLeastRecentlyUsed()
}
hcm.inMemoryCache[sessionID] = &ContextEntry{
Context: context,
LastAccess: time.Now(),
AccessCount: 1,
}
}
This hybrid approach solves the stateless context problem by creating a performance hierarchy. Frequently accessed conversations stay in Lambda’s memory between invocations (when possible), recent conversations are cached in Redis, and historical conversations are stored in DynamoDB.
The promoteToMemory function implements an intelligent caching strategy—active
conversations get the fastest access, while inactive ones gracefully degrade to
slower but more cost-effective storage tiers.
Taming Concurrency: Lambda’s Double-Edged Sword
Lambda’s automatic scaling is both a blessing and a curse. While it handles traffic spikes gracefully, it can overwhelm downstream dependencies and create thundering herd problems.
Solution: Intelligent Backpressure Control
The solution is implementing backpressure that protects your dependencies while maintaining good user experience.
// pkg/throttle/intelligent_backpressure.go
type BackpressureController struct {
semaphore chan struct{} // Control max concurrent requests
queue chan Request // Buffer for temporary overload
metrics *CloudWatchMetrics // Monitor performance
adaptiveLimit int // Dynamic concurrency limit
responseTimeTarget time.Duration // Target response time
lastAdjustment time.Time // When we last changed limits
}
func NewBackpressureController(initialLimit int, queueSize int) *BackpressureController {
return &BackpressureController{
semaphore: make(chan struct{}, initialLimit),
queue: make(chan Request, queueSize),
metrics: NewCloudWatchMetrics(),
adaptiveLimit: initialLimit,
responseTimeTarget: 200 * time.Millisecond,
}
}
func (bp *BackpressureController) HandleRequest(req Request) (*Response, error) {
// Check if we should accept this request
select {
case bp.semaphore <- struct{}{}: // Acquired semaphore slot
defer func() { <-bp.semaphore }() // Release when done
startTime := time.Now()
response, err := bp.processRequest(req)
processingTime := time.Since(startTime)
// Adapt concurrency limit based on performance
bp.adaptConcurrencyLimit(processingTime)
bp.metrics.RecordRequest(processingTime, err == nil)
return response, err
case <-time.After(100 * time.Millisecond): // Quick timeout
// System overloaded - try to queue the request
select {
case bp.queue <- req:
bp.metrics.IncrementQueuedRequests()
return bp.processQueuedRequest(req)
default:
// Queue full - reject with helpful error
bp.metrics.IncrementRejectedRequests()
return nil, &OverloadError{
Message: "System temporarily overloaded, please retry with exponential backoff",
RetryAfter: time.Second * 2,
}
}
}
}
func (bp *BackpressureController) adaptConcurrencyLimit(responseTime time.Duration) {
// Only adjust limits every 30 seconds to avoid oscillation
if time.Since(bp.lastAdjustment) < 30*time.Second {
return
}
if responseTime > bp.responseTimeTarget*2 {
// Response times too high - reduce concurrency
if bp.adaptiveLimit > 10 {
bp.adaptiveLimit = int(float64(bp.adaptiveLimit) * 0.8)
bp.adjustSemaphore(bp.adaptiveLimit)
}
} else if responseTime < bp.responseTimeTarget {
// Response times good - can increase concurrency
if bp.adaptiveLimit < 1000 {
bp.adaptiveLimit = int(float64(bp.adaptiveLimit) * 1.2)
bp.adjustSemaphore(bp.adaptiveLimit)
}
}
bp.lastAdjustment = time.Now()
}
This intelligent backpressure system solves multiple problems. It prevents your Lambda functions from overwhelming databases or external APIs, provides graceful degradation during traffic spikes, and automatically adapts to changing conditions.
The key innovation is the adaptive concurrency limit that adjusts based on actual response times rather than just request volume. If your database starts slowing down, the system automatically reduces concurrency to prevent cascading failures.
Balancing Cost and Redundancy Effectively
True redundancy is expensive, but outages are even more expensive. The solution is implementing cost-effective redundancy that provides maximum protection for the investment.
Solution: Tiered Redundancy Strategy
// infrastructure/cost-optimized-redundancy.ts
interface RedundancyTier {
region: string;
lambdaConcurrency: number;
rdsConfig: {
instances: number;
instanceClass: string;
multiAZ: boolean;
};
elasticacheConfig: {
nodes: number;
instanceType: string;
};
}
const productionRedundancy = {
// Tier 1: Primary region - full performance and redundancy
primary: {
region: "us-east-1",
lambdaConcurrency: 1000,
rdsConfig: {
instances: 3, // Primary + 2 read replicas
instanceClass: "db.r6g.xlarge",
multiAZ: true, // Automatic failover
},
elasticacheConfig: {
nodes: 3, // Redis cluster with replication
instanceType: "cache.r6g.large",
},
},
// Tier 2: Secondary region - reduced capacity for cost optimization
secondary: {
region: "us-west-2",
lambdaConcurrency: 200, // 20% of primary capacity
rdsConfig: {
instances: 1, // Single instance, can scale up during failover
instanceClass: "db.r6g.large", // Smaller than primary
multiAZ: false, // Cost optimization
},
elasticacheConfig: {
nodes: 1, // Single node, acceptable for DR
instanceType: "cache.r6g.medium",
},
},
// Configuration based on service tier
enableActiveActive: process.env.TIER === "PREMIUM",
enableWarmStandby: process.env.TIER !== "BASIC",
};
// Cost optimization: Scale secondary region dynamically
const secondaryScaling = new ApplicationAutoScaling(this, "SecondaryScaling", {
// Scale up secondary region during primary region issues
scalingPolicy: {
targetValue: 70, // CPU utilization target
scaleUpCooldown: Duration.minutes(5),
scaleDownCooldown: Duration.minutes(15),
},
});
This tiered approach provides substantial cost savings while maintaining protection against regional failures. The secondary region runs at reduced capacity during normal operations but can scale up automatically during outages.
The key insight is that most redundancy costs come from over-provisioning backup resources that sit idle. This configuration keeps backup resources minimal but ensures they can scale quickly when needed.
Learning from Recent Cloud Outages
The November 2024 AWS and Azure outages taught us valuable lessons about assumptions we make in “cloud-native” architectures. Many organizations discovered their “highly available” systems had subtle single points of failure.
Solution: True Multi-Cloud DNS Strategy
The Route 53 cascade during the AWS outage highlighted that DNS is often an overlooked single point of failure.
// infrastructure/federated-dns.ts
export class FederatedDNSStack extends Stack {
constructor(scope: Construct, id: string) {
super(scope, id);
// Primary DNS in AWS Route 53
const primaryDNS = new PublicHostedZone(this, "PrimaryDNS", {
zoneName: "api.example.com",
});
// Secondary DNS in different provider (e.g., Cloudflare via external config)
// This requires manual setup in Cloudflare but provides true independence
// Health check-based failover with multiple check points
const healthCheck = new CfnHealthCheck(this, "APIHealthCheck", {
type: "HTTPS",
resourcePath: "/health",
fullyQualifiedDomainName: "primary-api-lb.us-east-1.elb.amazonaws.com",
requestInterval: 30,
failureThreshold: 3,
// Multiple health check regions for redundancy
regions: ["us-east-1", "us-west-2", "eu-west-1"],
});
// Primary record with health check
new RecordSet(this, "PrimaryRecord", {
zone: primaryDNS,
recordName: "api",
recordType: RecordType.A,
target: RecordTarget.fromAlias(new LoadBalancerTarget(primaryALB)),
setIdentifier: "primary-us-east-1",
weight: 100,
// Failover routing policy
geoLocation: GeoLocation.continent(Continent.NORTH_AMERICA),
});
// Backup record for failover
new RecordSet(this, "BackupRecord", {
zone: primaryDNS,
recordName: "api",
recordType: RecordType.A,
target: RecordTarget.fromAlias(new LoadBalancerTarget(secondaryALB)),
setIdentifier: "backup-us-west-2",
weight: 0, // Only receives traffic during primary failure
geoLocation: GeoLocation.continent(Continent.NORTH_AMERICA),
});
}
}
This DNS configuration solves the single provider problem by implementing health check-based failover within AWS, while also providing instructions for setting up secondary DNS with a different provider entirely. The multiple health check regions ensure that a regional AWS issue doesn’t cause false positives in health monitoring.
Solution: Container Registry Independence
Many organizations discovered their “serverless” Lambda functions had dependencies on centralized container registries. When these failed, deployments became impossible.
// pkg/deployment/registry_strategy.go
type MultiRegistryDeployment struct {
registries []RegistryConfig
localCache string
verifier *ImageVerifier
}
type RegistryConfig struct {
URL string
Priority int
HealthCheck func() error
AuthProvider AuthProvider
}
func (mrd *MultiRegistryDeployment) DeployFunction(imageTag string) error {
// Try registries in priority order
for _, registry := range mrd.getAvailableRegistries() {
imagePath := fmt.Sprintf("%s:%s", registry.URL, imageTag)
// Check if image exists in this registry
if mrd.imageExists(imagePath) {
log.Info("Deploying from registry", "registry", registry.URL)
return mrd.deployWithImage(imagePath)
}
}
// Fallback: try to use locally cached image
if mrd.localCacheHasImage(imageTag) {
log.Warn("All registries unavailable, using cached image")
return mrd.deployFromCache(imageTag)
}
return errors.New("no available container image found in any registry")
}
func (mrd *MultiRegistryDeployment) getAvailableRegistries() []RegistryConfig {
var available []RegistryConfig
for _, registry := range mrd.registries {
if err := registry.HealthCheck(); err == nil {
available = append(available, registry)
} else {
log.Warn("Registry unavailable", "url", registry.URL, "error", err)
}
}
// Sort by priority
sort.Slice(available, func(i, j int) bool {
return available[i].Priority < available[j].Priority
})
return available
}
This multi-registry strategy ensures that registry outages don’t prevent deployments. The system maintains container images in multiple registries (AWS ECR, Docker Hub, GitHub Container Registry) and can fall back to locally cached images in extreme cases.
Implementing Seamless Switching: Beyond Health Checks
True seamless switching requires more than simple health checks—it needs intelligent monitoring that can detect degraded performance before complete failure occurs.
Solution: Comprehensive Health Assessment
// pkg/health/intelligent_health.go
type HealthChecker struct {
checks map[string]HealthCheck
cache map[string]HealthResult
degradationThresholds map[string]float64
mutex sync.RWMutex
notifier *AlertNotifier
}
type HealthCheck struct {
Check func() HealthResult
Interval time.Duration
Timeout time.Duration
CriticalityLevel int // 1=critical, 2=important, 3=nice-to-have
}
type HealthResult struct {
Healthy bool
Score float64 // 0-100, allows for degraded states
ResponseTime time.Duration
Error error
Timestamp time.Time
Details map[string]interface{}
}
func (hc *HealthChecker) StartMonitoring() {
for name, check := range hc.checks {
go hc.monitorComponent(name, check)
}
}
func (hc *HealthChecker) monitorComponent(name string, check HealthCheck) {
ticker := time.NewTicker(check.Interval)
defer ticker.Stop()
for range ticker.C {
// Run health check with timeout
resultChan := make(chan HealthResult, 1)
go func() {
result := check.Check()
resultChan <- result
}()
select {
case result := <-resultChan:
hc.processHealthResult(name, result, check.CriticalityLevel)
case <-time.After(check.Timeout):
// Health check timed out
hc.processHealthResult(name, HealthResult{
Healthy: false,
Score: 0,
ResponseTime: check.Timeout,
Error: errors.New("health check timeout"),
Timestamp: time.Now(),
}, check.CriticalityLevel)
}
}
}
func (hc *HealthChecker) processHealthResult(name string, result HealthResult, criticality int) {
hc.mutex.Lock()
defer hc.mutex.Unlock()
previousResult, exists := hc.cache[name]
hc.cache[name] = result
// Detect state changes and degradation
if exists {
hc.detectStateChanges(name, previousResult, result, criticality)
}
// Check for degradation even if technically "healthy"
if result.Healthy && result.Score < hc.degradationThresholds[name] {
hc.notifier.SendDegradationAlert(name, result.Score, "Performance degrading")
}
}
func (hc *HealthChecker) detectStateChanges(name string, previous, current HealthResult, criticality int) {
// Component recovered
if !previous.Healthy && current.Healthy {
hc.notifier.SendRecoveryAlert(name, current.Score)
}
// Component failed
if previous.Healthy && !current.Healthy {
if criticality <= 2 { // Critical or important component
hc.notifier.SendCriticalAlert(name, current.Error)
} else {
hc.notifier.SendWarningAlert(name, current.Error)
}
}
// Significant performance degradation
if current.Healthy && previous.Healthy {
if current.ResponseTime > previous.ResponseTime*2 {
hc.notifier.SendPerformanceAlert(name, current.ResponseTime, previous.ResponseTime)
}
}
}
// Example health check implementations
func (hc *HealthChecker) RegisterLLMProviderCheck(providerName string, provider LLMProvider) {
hc.checks[providerName] = HealthCheck{
Check: func() HealthResult {
start := time.Now()
// Test with a simple prompt
response, err := provider.GenerateResponse("Say 'OK'", RequestOptions{MaxTokens: 5})
responseTime := time.Since(start)
if err != nil {
return HealthResult{
Healthy: false,
Score: 0,
ResponseTime: responseTime,
Error: err,
Timestamp: time.Now(),
}
}
// Calculate health score based on response time and content quality
score := calculateHealthScore(responseTime, response)
return HealthResult{
Healthy: score > 50, // Threshold for "healthy"
Score: score,
ResponseTime: responseTime,
Timestamp: time.Now(),
Details: map[string]interface{}{
"response_length": len(response.Text),
"provider_id": providerName,
},
}
},
Interval: 30 * time.Second,
Timeout: 10 * time.Second,
CriticalityLevel: 1, // Critical component
}
}
This comprehensive health monitoring system goes far beyond simple “up/down” checks. It monitors performance degradation, tracks recovery patterns, and provides early warning of issues before they become outages. The scoring system allows for nuanced understanding of component health—a slow but functional LLM provider might score 60⁄100, indicating it’s working but may need attention.
Conclusion: Building Production-Ready LLM APIs
Converting an LLM prompt to a production-ready, enterprise-scale API involves challenges that extend far beyond the initial implementation. Each layer—from infrastructure to application logic to monitoring—requires careful consideration of failure modes and recovery strategies.
The solutions we’ve explored address real-world problems encountered in production environments:
- Intelligent redundancy that goes beyond simple replication to provide graceful degradation
- Multi-provider strategies that turn external dependencies from weaknesses into strengths
- Hybrid context management that solves the stateless/context paradox efficiently
- Adaptive concurrency control that protects downstream systems while maintaining performance
- Cost-effective redundancy that provides maximum protection for the investment
- Comprehensive monitoring that detects problems before they become outages
The key insight is that production-ready LLM APIs require systems thinking—understanding how components interact under stress and designing for graceful degradation rather than perfect operation.
Recent cloud outages have reinforced these lessons, showing that even “cloud-native” architectures can have subtle single points of failure. The organizations that weathered these outages best were those that had planned for failure at every layer and implemented truly independent redundancy.
Building these systems requires deep expertise in distributed systems, careful planning, and extensive testing under realistic failure conditions. The complexity isn’t just technical—it’s operational, requiring monitoring, alerting, and incident response procedures that match the sophistication of the underlying architecture.
Building production-ready LLM systems requires navigating dozens of architectural decisions, each with far-reaching implications. At Yantratmika Solutions, we’ve helped organizations avoid the common pitfalls and build systems that scale. The devil, as always, is in the implementation details.