From LLM Prompt to Production (4/9) - Multi Prompt Design

This is part of a series of blogs:

In this deep dive, we’ll explore how AWS Step Functions provides an elegant solution for building production-ready multi-prompt chain APIs using Go, addressing the unique challenges that emerge when LLM workflows need to scale beyond the laboratory.

Understanding Multi-Prompt Chain Architecture

Multi-prompt chains represent a sophisticated approach to LLM applications where complex tasks are decomposed into sequential or parallel subtasks, each handled by specialized prompts. Consider a document analysis system that needs to extract information, classify content, summarize findings, generate recommendations, and format output according to business requirements.

Common Multi-Prompt Scenarios:

Content Generation Pipelines: Research → Outline → Writing → Editing → Final Review
Data Analysis Workflows: Data Extraction → Cleaning → Analysis → Visualization → Report Generation
Conversational Systems: Intent Recognition → Context Analysis → Response Generation → Fact Checking
Quality Assurance Pipelines: Initial Response → Quality Check → Refinement → Approval

Core Technical Challenges and Solutions

The naive POC approach typically chains these operations sequentially without proper error handling, state management, or observability. Production systems require orchestration that can handle partial failures, maintain context across steps, provide detailed monitoring, and scale efficiently.

Step Functions: Orchestrating Complex Workflows

AWS Step Functions provides the ideal orchestration platform for multi-prompt LLM workflows because it offers:

Visual Workflow Management: Clear representation of complex business logic
Error Handling: Built-in retry, catch, and fallback mechanisms
State Management: Automatic persistence of workflow state
Integration: Native integration with Lambda, EventBridge, and other AWS services
Observability: Built-in execution history and monitoring

Context Management and Token Optimization

LLM APIs impose strict token limits, and naive context passing between prompt chain steps can quickly exceed these boundaries while accumulating significant costs. A document analysis workflow might generate 5,000 tokens in the extraction step, 3,000 in classification, and 4,000 in initial summarization - suddenly you’re approaching the context window limit before even reaching the recommendation phase.

Implement intelligent context summarization that selectively preserves only the most relevant information for each subsequent step. This requires understanding the semantic dependencies between workflow steps and implementing priority-based context extraction.

func (cm *ContextManager) extractRelevantContext(outputs []StepOutput, maxTokens int, nextStepType string) string {
    // Define what each step type needs from previous steps
    contextPriorities := map[string][]string{
        "classification":   {"extracted_entities", "document_type"},
        "summarization":   {"key_findings", "classification_result"},
        "recommendation": {"summary", "business_context"},
    }

    relevantKeys := contextPriorities[nextStepType]
    var contextBuilder strings.Builder
    tokenCount := 0

    for _, output := range outputs {
        for _, key := range relevantKeys {
            if value := output.GetField(key); value != "" && tokenCount < maxTokens {
                tokenCount += len(strings.Fields(value))
                contextBuilder.WriteString(value + " ")
            }
        }
    }

    return contextBuilder.String()
}

This approach maintains a mapping of what information each step type requires from previous steps. Instead of passing the entire context forward, it intelligently selects only the relevant portions. The token counting ensures you stay within API limits while preserving the most important semantic information for each step.

Dynamic Prompt Template Management

Hardcoding prompts directly in your application code creates several critical issues: it’s impossible to A/B test different prompt approaches, you can’t update prompts without deploying new code, and you have no versioning or rollback capability when a prompt change degrades performance.

Implement a dynamic prompt template system with versioning stored in DynamoDB, allowing for runtime prompt updates, A/B testing, and immediate rollback capabilities.

func (pm *PromptManager) BuildDynamicPrompt(ctx context.Context, templateID, version string, variables map[string]string) (string, error) {
    template, err := pm.getTemplate(ctx, templateID, version)
    if err != nil {
        return pm.getFallbackTemplate(templateID) // Graceful degradation
    }

    // Validate all required variables are present
    if err := pm.validateVariables(template.RequiredVars, variables); err != nil {
        return "", fmt.Errorf("missing required variables: %w", err)
    }

    // Perform safe variable substitution
    prompt := template.Content
    for key, value := range variables {
        prompt = strings.ReplaceAll(prompt, "{{"+key+"}}", pm.sanitizeVariable(value))
    }

    go pm.logTemplateUsage(templateID, version) // Async analytics
    return prompt, nil
}

This system separates prompt content from application logic by storing templates in DynamoDB with version control. Each template specifies required variables, enabling validation before substitution. The fallback mechanism ensures your system remains operational even when the template service is unavailable. Analytics tracking helps you understand which prompts perform best.

LLM Provider Abstraction and Failover

Relying on a single LLM provider creates vendor lock-in and single points of failure. Different providers have varying capabilities, pricing models, and availability characteristics. Your production system needs to intelligently route requests and handle provider failures without manual intervention.

Implement a provider abstraction layer that can intelligently route requests based on provider health, cost optimization, and capability matching, with automatic failover when providers experience issues.

func (lpm *LLMProviderManager) GenerateCompletion(ctx context.Context, prompt string, config CompletionConfig) (*CompletionResult, error) {
    providers := lpm.getProviderSequence(config.PreferredProvider)

    for attemptNum, providerName := range providers {
        if !lpm.isProviderHealthy(ctx, providerName) {
            continue // Skip unhealthy providers
        }

        provider := lpm.providers[providerName]
        result, err := lpm.executeWithCircuitBreaker(ctx, provider, prompt, config)

        if err != nil {
            lpm.recordFailure(providerName, err)
            continue // Try next provider
        }

        return &CompletionResult{
            Content: result.Content,
            Provider: providerName,
            Attempt: attemptNum + 1,
            Cost: lpm.calculateCost(result, providerName),
        }, nil
    }

    return nil, fmt.Errorf("all LLM providers failed")
}

The system maintains health status for each provider through periodic health checks. When a request comes in, it attempts providers in order of preference, skipping any that have failed recent health checks. Circuit breakers prevent cascading failures by temporarily blocking requests to consistently failing providers. Cost calculation and attempt tracking provide observability into provider performance and expenses.

Production-Ready Step Functions Implementation

Step Functions workflows for production must handle complex error scenarios, maintain detailed execution state, support parallel processing where beneficial, and provide comprehensive observability. Unlike POC workflows that might simply chain Lambda functions, production workflows need sophisticated retry logic, error handling, and state management.

So, we must Structure the Step Functions definition to include validation gates, parallel processing opportunities, and comprehensive error handling with different strategies for different failure types.

{
  "Comment": "Production Multi-Prompt Chain for Document Analysis",
  "StartAt": "ValidateAndPrepare",
  "States": {
    "ValidateAndPrepare": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:${region}:${account}:function:ValidateInput",
      "Retry": [
        {
          "ErrorEquals": [
            "Lambda.ServiceException",
            "Lambda.AWSLambdaException"
          ],
          "IntervalSeconds": 2,
          "MaxAttempts": 3,
          "BackoffRate": 2.0
        }
      ],
      "Catch": [
        {
          "ErrorEquals": ["ValidationException"],
          "Next": "ValidationFailure",
          "ResultPath": "$.error"
        }
      ],
      "Next": "ParallelExtractionPhase"
    },

    "ParallelExtractionPhase": {
      "Type": "Parallel",
      "ResultPath": "$.parallel_results",
      "Branches": [
        {
          "StartAt": "ExtractEntities",
          "States": {
            "ExtractEntities": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:${region}:${account}:function:EntityExtraction",
              "Parameters": {
                "prompt_template": "entity_extraction_v2",
                "document.$": "$.document",
                "max_tokens": 1000
              },
              "End": true
            }
          }
        }
      ],
      "Next": "ConsolidateAndSummarize"
    }
  }
}

This structure separates concerns by having dedicated validation, parallel processing phases, and consolidation steps. The retry logic distinguishes between transient AWS service errors (which should be retried) and validation errors (which should not). The parallel branches can process different aspects of the document simultaneously, reducing total execution time.

Comprehensive Monitoring and Observability

Production LLM systems require monitoring that goes far beyond typical application metrics. You need to track token usage patterns, cost per customer, prompt performance, provider reliability, and quality metrics like output coherence and relevance. This monitoring must be real-time enough to enable immediate response to issues.

We should implement a comprehensive metrics collection system that captures both technical performance and business-relevant quality indicators.

func (m *LLMMetrics) RecordComprehensiveMetrics(ctx context.Context, execution ExecutionMetrics) {
    timestamp := time.Now()

    // Performance metrics
    m.putMetric("PromptLatency", execution.Latency.Milliseconds(),
        map[string]string{"step": execution.StepName, "provider": execution.Provider})

    // Cost and usage metrics
    m.putMetric("TokensConsumed", float64(execution.TokensUsed),
        map[string]string{"customer": execution.CustomerID, "step": execution.StepName})

    m.putMetric("ExecutionCost", execution.Cost,
        map[string]string{"customer": execution.CustomerID})

    // Quality metrics (when available)
    if execution.QualityScore > 0 {
        m.putMetric("OutputQuality", execution.QualityScore,
            map[string]string{"step": execution.StepName, "model": execution.Model})
    }

    // Error tracking
    if execution.Error != nil {
        m.putMetric("ErrorRate", 1.0,
            map[string]string{"error_type": execution.ErrorType, "provider": execution.Provider})
    }
}

This approach creates a comprehensive picture of system health by tracking multiple dimensions simultaneously. CloudWatch custom metrics enable real-time dashboards and alerting. The dimensional approach allows you to drill down into specific issues (e.g., “why is the entity extraction step failing for OpenAI specifically?”).

Advanced Error Handling and Circuit Breaker Implementation

LLM APIs are inherently unreliable - they can experience rate limiting, transient failures, and gradual performance degradation. Simple retry logic isn’t sufficient; you need sophisticated error handling that can distinguish between different failure types and respond appropriately. Circuit breakers prevent cascading failures when a provider is experiencing systemic issues.

Implement circuit breakers with configurable thresholds that can prevent requests to failing providers while allowing gradual recovery testing.

type CircuitBreaker struct {
    failureThreshold int
    successThreshold int
    timeout         time.Duration
    state          CircuitState
    failures       int
    successes      int
    lastFailure    time.Time
    mu            sync.RWMutex
}

func (cb *CircuitBreaker) Execute(fn func() (interface{}, error)) (interface{}, error) {
    cb.mu.Lock()
    defer cb.mu.Unlock()

    // Check if circuit should transition from OPEN to HALF_OPEN
    if cb.state == StateOpen && time.Since(cb.lastFailure) > cb.timeout {
        cb.state = StateHalfOpen
        cb.successes = 0
    }

    if cb.state == StateOpen {
        return nil, &CircuitOpenError{timeout: cb.timeout}
    }

    result, err := fn()

    if err != nil {
        cb.recordFailure()
        return nil, err
    }

    cb.recordSuccess()
    return result, nil
}

The circuit breaker tracks failure rates and automatically blocks requests when failure thresholds are exceeded. After a timeout period, it enters a half-open state to test if the provider has recovered. This prevents overwhelming failing providers while allowing automatic recovery when they become healthy again.

Cost Optimization and Billing Management

LLM costs can spiral out of control quickly, especially with multi-prompt chains that might use 3-5x the tokens of simple single-prompt systems. You need real-time cost tracking, budget enforcement, and detailed cost attribution by customer, workflow step, and provider.

Implement a cost management system that tracks usage in real-time and can enforce budget limits before expensive operations execute.

func (cm *CostManager) EnforceAndTrackCost(ctx context.Context, customerID string, operation CostOperation) error {
    // Estimate cost before execution
    estimatedCost := cm.calculateEstimatedCost(operation)

    // Check against budget limits
    currentUsage, err := cm.getCurrentMonthlyUsage(ctx, customerID)
    if err != nil {
        return fmt.Errorf("failed to retrieve current usage: %w", err)
    }

    budgetLimit := cm.getBudgetLimit(customerID)
    if currentUsage + estimatedCost > budgetLimit {
        return &BudgetExceededException{
            CustomerID: customerID,
            Requested: estimatedCost,
            Available: budgetLimit - currentUsage,
        }
    }

    // Execute and record actual cost
    actualCost, err := operation.Execute()
    if err != nil {
        return err
    }

    // Async cost recording to avoid blocking the response
    go cm.recordCostAsync(ctx, customerID, operation.StepName, actualCost)

    return nil
}

This system provides cost control at multiple levels: pre-execution budget checking prevents expensive operations from running when budgets are exhausted, real-time usage tracking enables immediate cost attribution, and asynchronous cost recording ensures that billing data is captured without impacting response times.

Implementing Redundancy and High Availability

Production LLM systems must remain available even when individual AWS regions experience issues. Step Functions workflows need to be deployable across multiple regions with automatic failover capabilities. This is particularly challenging because Step Functions state machines are region-specific resources.

Create a multi-region orchestration layer that can deploy identical workflows across regions and intelligently route requests based on regional health.

func (mrs *MultiRegionService) ExecuteWorkflowWithFailover(ctx context.Context, input WorkflowInput) (*WorkflowResult, error) {
    regions := mrs.getOrderedRegions(input.CustomerLocation)

    for attempt, region := range regions {
        select {
        case <-ctx.Done():
            return nil, ctx.Err()
        default:
        }

        client := mrs.stepFunctionsClients[region]
        stateMachineArn := mrs.buildRegionalArn(region, input.WorkflowType)

        result, err := mrs.executeInRegion(ctx, client, stateMachineArn, input)
        if err == nil {
            if attempt > 0 {
                // Log failover for monitoring
                go mrs.logFailoverEvent(mrs.primaryRegion, region, input.WorkflowType)
            }
            return result, nil
        }

        log.Printf("Region %s failed (attempt %d): %v", region, attempt+1, err)
    }

    return nil, fmt.Errorf("all regions failed for workflow execution")
}

This approach maintains Step Functions deployments in multiple regions and routes requests based on customer location and regional health. When the primary region fails, requests automatically fail over to secondary regions. The ordering of regions can be optimized based on customer location to minimize latency while maintaining reliability.

Security Considerations

LLM systems are particularly vulnerable to prompt injection attacks, where malicious users attempt to manipulate the model’s behavior through carefully crafted inputs. Additionally, these systems often process sensitive data that requires protection both in transit and when passed between workflow steps.

Implement comprehensive input validation and output sanitization that can detect and prevent common attack patterns.

func (validator *SecurityValidator) ValidateAndSanitize(input UserInput) (*ValidationResult, error) {
    result := &ValidationResult{IsSafe: true, RiskLevel: "low"}

    // Check for prompt injection patterns
    for _, pattern := range validator.injectionPatterns {
        if pattern.MatchString(input.Content) {
            result.IsSafe = false
            result.RiskLevel = "high"
            result.DetectedThreats = append(result.DetectedThreats, "prompt_injection")

            // Log security incident for analysis
            go validator.logSecurityIncident(input.CustomerID, "prompt_injection", input.Content)
            break
        }
    }

    // PII detection and redaction
    if piiMatches := validator.detectPII(input.Content); len(piiMatches) > 0 {
        result.RiskLevel = "medium"
        result.RedactedContent = validator.redactPII(input.Content, piiMatches)
        result.PIIDetected = true
    }

    return result, nil
}

This security layer examines all user inputs before they reach the LLM providers. Pattern matching identifies common injection techniques, while PII detection protects sensitive information. The validation result guides downstream processing - high-risk inputs are rejected, medium-risk inputs might be processed with additional safeguards.

Performance Optimization and Caching

Multi-prompt chains can be computationally expensive and slow. You need intelligent caching that can recognize when similar requests can reuse previous results, connection pooling to optimize HTTP overhead, and request batching where possible.

Implement semantic caching that goes beyond exact string matches to identify conceptually similar requests.

func (cache *SemanticCache) GetOrCompute(ctx context.Context, prompt string, config LLMConfig, computeFn func() (*Response, error)) (*Response, error) {
    // Try exact cache hit first
    if cached, found := cache.exactCache.Get(cache.cacheKey(prompt, config)); found {
        cache.metrics.RecordHit("exact")
        return cached.(*Response), nil
    }

    // Try semantic similarity match
    embedding, err := cache.embeddingService.GetEmbedding(ctx, prompt)
    if err == nil {
        if similar, score := cache.findSimilarCached(embedding, cache.similarityThreshold); similar != nil {
            cache.metrics.RecordHit("semantic", score)
            return similar, nil
        }
    }

    // Cache miss - compute new result
    cache.metrics.RecordMiss()
    result, err := computeFn()
    if err != nil {
        return nil, err
    }

    // Cache the result with both exact and semantic indexing
    go cache.storeResult(prompt, config, embedding, result)

    return result, nil
}

This caching system uses multiple strategies: exact string matching for identical prompts, and semantic similarity matching for conceptually similar prompts. Embeddings enable the system to recognize that “Summarize this document” and “Create a brief summary of this text” are functionally equivalent requests that can share cached results.

Conclusion

Converting an LLM POC to production using Go and AWS Step Functions requires addressing challenges across multiple dimensions: reliability, observability, cost management, security, and performance. The key insight is that production LLM systems are fundamentally different from traditional APIs - they’re more expensive, less predictable, and require sophisticated orchestration.

The Go language provides excellent tooling for building these systems: strong concurrency support for handling multiple LLM requests efficiently, comprehensive AWS SDK integration, and the type safety that’s crucial for managing complex workflow state.

Success in production LLM systems comes from treating each challenge as a first-class concern rather than an afterthought. Building robust error handling, comprehensive monitoring, and intelligent cost management from the beginning creates a foundation that can scale and adapt as your requirements evolve.

The patterns and approaches outlined here provide a blueprint for building production-ready systems that can handle enterprise workloads while maintaining the reliability, security, and cost-effectiveness that business operations demand.

A comprehensive multi-prompt chain management system provides:

Robust Orchestration: Step Functions manage complex workflows with automatic state persistence
Real-time Monitoring: EventBridge enables real-time status updates across the system
Error Handling: Comprehensive retry and fallback mechanisms for reliability
Quality Assurance: Built-in quality checks with automatic refinement loops
Progress Tracking: Detailed progress calculation and completion time estimates
Flexible Client Support: Both real-time and polling-based status checking options

Building production-ready LLM systems requires navigating dozens of architectural decisions, each with far-reaching implications. At Yantratmika Solutions, we’ve helped organizations avoid the common pitfalls and build systems that scale. The devil, as always, is in the implementation details.