Skip to main content

Scaling AI Applications: Architecture Decisions & Trade-offs

ยท 6 min read
Uday Tamma
Building AI-Powered Applications

Building the AI Ingredient Safety Analyzer taught me valuable lessons about scaling LLM-powered applications. This post covers the key architectural decisions, trade-offs, and interview-ready explanations for production scaling.

The Challengeโ€‹

Our Ingredient Analysis API processes requests that require:

  • Multiple LLM calls (Research โ†’ Analysis โ†’ Critic validation)
  • Vector database queries (Qdrant)
  • Real-time web search (Google Search grounding)

Current Performance:

  • Average response time: ~47 seconds per request
  • Throughput: ~1 request/second
  • Target: Handle increased load without degradation

Key Scaling Questions & Answersโ€‹

Q1: How would you scale this API to handle 10x more traffic?โ€‹

Answer:

I'd implement a three-pronged approach:

  1. Response Caching (Redis/Memcached)

    • Cache ingredient research data (24-72 hour TTL)
    • Cache full analysis reports by ingredient+profile hash (1-6 hour TTL)
    • Expected improvement: 5x throughput for cached requests
  2. API Key Load Balancing

    • Pool multiple Gemini API keys
    • Implement rate-aware key selection
    • Each key has rate limits; N keys = Nร— capacity
  3. Async Processing with Queue

    • Move to job queue (Celery/Redis Queue)
    • Return job ID immediately, poll for results
    • Prevents timeout issues on slow requests

Trade-off: Caching introduces stale data risk. Mitigation: Implement cache invalidation on safety data updates, use appropriate TTLs.


Q2: Why did you choose Qdrant over other vector databases?โ€‹

Answer:

FactorQdrantPineconeWeaviateChromaDB
Self-hosted optionYesNoYesYes
Cloud managedYesYesYesNo
Filtering capabilityExcellentGoodGoodBasic
Python SDKNativeNativeNativeNative
CostFree tier + pay-as-you-goExpensiveModerateFree

Decision rationale:

  • Qdrant Cloud offers generous free tier (1GB)
  • Excellent hybrid search (vector + payload filtering)
  • Can self-host later for cost optimization
  • Simple REST API for debugging

Trade-off: Qdrant is less mature than Pinecone. Mitigation: Their active development and good documentation offset this.


Q3: How do you handle API rate limits from Gemini?โ€‹

Answer:

Current approach:

# Single key - limited capacity
client = genai.Client(api_key=settings.google_api_key)

Scaled approach:

class RateLimitedKeyPool:
def __init__(self, api_keys: list[str], rpm_limit: int = 15):
self.keys = api_keys
self.rpm_limit = rpm_limit
self.request_times = {key: [] for key in api_keys}

def get_available_key(self) -> str | None:
now = time.time()
for key in self.keys:
# Clean requests older than 1 minute
self.request_times[key] = [
t for t in self.request_times[key]
if t > now - 60
]
if len(self.request_times[key]) < self.rpm_limit:
self.request_times[key].append(now)
return key
return None # All keys exhausted

Trade-off: Multiple keys increase cost and complexity. Consider: Is the traffic worth the operational overhead?


Q4: Why use a multi-agent architecture instead of a single LLM call?โ€‹

Answer:

Single-call approach:

  • Pros: Faster (one LLM round-trip), simpler
  • Cons: Less accurate, no self-correction, monolithic prompt

Multi-agent approach (Research โ†’ Analysis โ†’ Critic):

  • Pros:
    • Separation of concerns (research vs analysis vs validation)
    • Self-correction loop (Critic can reject and retry)
    • Better accuracy through validation gates
    • Easier to debug and improve individual agents
  • Cons: 3x LLM calls, higher latency, more complex

Decision rationale: For safety-critical information, accuracy trumps speed. The Critic agent catches ~15% of issues that would otherwise reach users.


Q5: How do you ensure consistency between mobile and web clients?โ€‹

Answer:

Architecture decisions:

  1. Single REST API - Both clients call the same /analyze endpoint
  2. Shared response schema - Pydantic models define the contract
  3. API versioning - /api/v1/analyze allows future breaking changes
class AnalysisResponse(BaseModel):
success: bool
product_name: str
overall_risk: str
average_safety_score: int
summary: str
allergen_warnings: list[str]
ingredients: list[IngredientDetail]

Trade-off: Single API means both clients get same data, even if one needs less. We accept slight over-fetching for consistency.


Q6: How would you add real-time updates for long-running requests?โ€‹

Answer:

Options considered:

ApproachProsCons
PollingSimple, works everywhereInefficient, delayed
WebSocketsReal-time, bidirectionalComplex, stateful
Server-Sent EventsReal-time, simpleOne-way only
WebhooksDecoupledRequires client endpoint

Recommendation for this API:

Server-Sent Events (SSE) for progress updates:

@app.get("/analyze/stream")
async def analyze_stream(request: AnalysisRequest):
async def event_generator():
yield f"data: {json.dumps({'stage': 'research', 'progress': 0})}\n\n"
# ... research phase
yield f"data: {json.dumps({'stage': 'analysis', 'progress': 33})}\n\n"
# ... analysis phase
yield f"data: {json.dumps({'stage': 'validation', 'progress': 66})}\n\n"
# ... validation phase
yield f"data: {json.dumps({'stage': 'complete', 'result': result})}\n\n"

return StreamingResponse(event_generator(), media_type="text/event-stream")

Q7: What's your testing strategy for LLM-based features?โ€‹

Answer:

Testing pyramid for LLM apps:

  1. Unit tests - Mock LLM responses, test business logic
  2. Integration tests - Test agent orchestration with fixtures
  3. Contract tests - Verify LLM output schema compliance
  4. Evaluation tests - Test accuracy on labeled datasets
  5. Load tests - Verify performance under stress

Key insight: LLM outputs are non-deterministic. Solutions:

  • Use temperature=0.1 for more consistent outputs
  • Test for schema compliance, not exact text matching
  • Build evaluation datasets with expected categories
def test_analysis_returns_valid_risk_level():
result = analyze_ingredients(test_state)
assert result["analysis_report"]["overall_risk"] in ["low", "medium", "high"]

Q8: How do you handle failures gracefully?โ€‹

Answer:

Failure modes and handling:

FailureDetectionRecovery
LLM timeoutRequest timeout (120s)Retry with exponential backoff
Rate limit429 responseSwitch to backup API key
Qdrant downConnection errorFall back to Google Search only
Invalid inputPydantic validationReturn 422 with details
Critic rejectionValidation loopRetry up to 3x, then escalate
# Critic agent retry logic
max_retries = 3
if result == ValidationResult.REJECTED:
if retry_count < max_retries:
return {"retry_count": retry_count + 1} # Retry
else:
return {"result": ValidationResult.ESCALATED} # Give up gracefully

Q9: What would you do differently if starting over?โ€‹

Answer:

  1. Start with async from day one - Easier to add concurrency later
  2. Implement caching earlier - Would have saved development API costs
  3. Use structured outputs - Gemini's JSON mode for reliable parsing
  4. Add observability first - LangSmith integration should be from start
  5. Design for horizontal scaling - Stateless API from the beginning

Q10: How do you balance cost vs performance?โ€‹

Answer:

Cost breakdown per request:

  • Gemini API: ~$0.01-0.05 (depending on tokens)
  • Qdrant Cloud: Included in free tier
  • Railway hosting: ~$5/month
  • Google Search: Included in Gemini grounding

Optimization strategies:

  1. Cache common ingredients - 80% of requests hit top 100 ingredients
  2. Use smaller models for validation - Critic doesn't need full model
  3. Batch embeddings - Reduce API calls for multiple ingredients
  4. Set appropriate TTLs - Balance freshness vs cost

Trade-off: Aggressive caching reduces costs but may serve stale safety data. Our mitigation: 24-hour TTL with manual invalidation for critical updates.


Summaryโ€‹

Scaling LLM applications requires balancing:

  • Latency vs Accuracy - More agents = better results, slower response
  • Cost vs Freshness - Caching saves money, risks stale data
  • Simplicity vs Resilience - More fallbacks = more complexity
  • Speed vs Safety - Fast responses vs thorough validation

The key is making intentional trade-offs based on your specific requirements, then documenting the reasoning for future reference.


This post is part of the interview preparation series for the AI Ingredient Safety Analyzer project.