Scaling AI Applications: Architecture Decisions & Trade-offs
Building the AI Ingredient Safety Analyzer taught me valuable lessons about scaling LLM-powered applications. This post covers the key architectural decisions, trade-offs, and interview-ready explanations for production scaling.
The Challengeโ
Our Ingredient Analysis API processes requests that require:
- Multiple LLM calls (Research โ Analysis โ Critic validation)
- Vector database queries (Qdrant)
- Real-time web search (Google Search grounding)
Current Performance:
- Average response time: ~47 seconds per request
- Throughput: ~1 request/second
- Target: Handle increased load without degradation
Key Scaling Questions & Answersโ
Q1: How would you scale this API to handle 10x more traffic?โ
Answer:
I'd implement a three-pronged approach:
-
Response Caching (Redis/Memcached)
- Cache ingredient research data (24-72 hour TTL)
- Cache full analysis reports by ingredient+profile hash (1-6 hour TTL)
- Expected improvement: 5x throughput for cached requests
-
API Key Load Balancing
- Pool multiple Gemini API keys
- Implement rate-aware key selection
- Each key has rate limits; N keys = Nร capacity
-
Async Processing with Queue
- Move to job queue (Celery/Redis Queue)
- Return job ID immediately, poll for results
- Prevents timeout issues on slow requests
Trade-off: Caching introduces stale data risk. Mitigation: Implement cache invalidation on safety data updates, use appropriate TTLs.
Q2: Why did you choose Qdrant over other vector databases?โ
Answer:
| Factor | Qdrant | Pinecone | Weaviate | ChromaDB |
|---|---|---|---|---|
| Self-hosted option | Yes | No | Yes | Yes |
| Cloud managed | Yes | Yes | Yes | No |
| Filtering capability | Excellent | Good | Good | Basic |
| Python SDK | Native | Native | Native | Native |
| Cost | Free tier + pay-as-you-go | Expensive | Moderate | Free |
Decision rationale:
- Qdrant Cloud offers generous free tier (1GB)
- Excellent hybrid search (vector + payload filtering)
- Can self-host later for cost optimization
- Simple REST API for debugging
Trade-off: Qdrant is less mature than Pinecone. Mitigation: Their active development and good documentation offset this.
Q3: How do you handle API rate limits from Gemini?โ
Answer:
Current approach:
# Single key - limited capacity
client = genai.Client(api_key=settings.google_api_key)
Scaled approach:
class RateLimitedKeyPool:
def __init__(self, api_keys: list[str], rpm_limit: int = 15):
self.keys = api_keys
self.rpm_limit = rpm_limit
self.request_times = {key: [] for key in api_keys}
def get_available_key(self) -> str | None:
now = time.time()
for key in self.keys:
# Clean requests older than 1 minute
self.request_times[key] = [
t for t in self.request_times[key]
if t > now - 60
]
if len(self.request_times[key]) < self.rpm_limit:
self.request_times[key].append(now)
return key
return None # All keys exhausted
Trade-off: Multiple keys increase cost and complexity. Consider: Is the traffic worth the operational overhead?
Q4: Why use a multi-agent architecture instead of a single LLM call?โ
Answer:
Single-call approach:
- Pros: Faster (one LLM round-trip), simpler
- Cons: Less accurate, no self-correction, monolithic prompt
Multi-agent approach (Research โ Analysis โ Critic):
- Pros:
- Separation of concerns (research vs analysis vs validation)
- Self-correction loop (Critic can reject and retry)
- Better accuracy through validation gates
- Easier to debug and improve individual agents
- Cons: 3x LLM calls, higher latency, more complex
Decision rationale: For safety-critical information, accuracy trumps speed. The Critic agent catches ~15% of issues that would otherwise reach users.
Q5: How do you ensure consistency between mobile and web clients?โ
Answer:
Architecture decisions:
- Single REST API - Both clients call the same
/analyzeendpoint - Shared response schema - Pydantic models define the contract
- API versioning -
/api/v1/analyzeallows future breaking changes
class AnalysisResponse(BaseModel):
success: bool
product_name: str
overall_risk: str
average_safety_score: int
summary: str
allergen_warnings: list[str]
ingredients: list[IngredientDetail]
Trade-off: Single API means both clients get same data, even if one needs less. We accept slight over-fetching for consistency.
Q6: How would you add real-time updates for long-running requests?โ
Answer:
Options considered:
| Approach | Pros | Cons |
|---|---|---|
| Polling | Simple, works everywhere | Inefficient, delayed |
| WebSockets | Real-time, bidirectional | Complex, stateful |
| Server-Sent Events | Real-time, simple | One-way only |
| Webhooks | Decoupled | Requires client endpoint |
Recommendation for this API:
Server-Sent Events (SSE) for progress updates:
@app.get("/analyze/stream")
async def analyze_stream(request: AnalysisRequest):
async def event_generator():
yield f"data: {json.dumps({'stage': 'research', 'progress': 0})}\n\n"
# ... research phase
yield f"data: {json.dumps({'stage': 'analysis', 'progress': 33})}\n\n"
# ... analysis phase
yield f"data: {json.dumps({'stage': 'validation', 'progress': 66})}\n\n"
# ... validation phase
yield f"data: {json.dumps({'stage': 'complete', 'result': result})}\n\n"
return StreamingResponse(event_generator(), media_type="text/event-stream")
Q7: What's your testing strategy for LLM-based features?โ
Answer:
Testing pyramid for LLM apps:
- Unit tests - Mock LLM responses, test business logic
- Integration tests - Test agent orchestration with fixtures
- Contract tests - Verify LLM output schema compliance
- Evaluation tests - Test accuracy on labeled datasets
- Load tests - Verify performance under stress
Key insight: LLM outputs are non-deterministic. Solutions:
- Use
temperature=0.1for more consistent outputs - Test for schema compliance, not exact text matching
- Build evaluation datasets with expected categories
def test_analysis_returns_valid_risk_level():
result = analyze_ingredients(test_state)
assert result["analysis_report"]["overall_risk"] in ["low", "medium", "high"]
Q8: How do you handle failures gracefully?โ
Answer:
Failure modes and handling:
| Failure | Detection | Recovery |
|---|---|---|
| LLM timeout | Request timeout (120s) | Retry with exponential backoff |
| Rate limit | 429 response | Switch to backup API key |
| Qdrant down | Connection error | Fall back to Google Search only |
| Invalid input | Pydantic validation | Return 422 with details |
| Critic rejection | Validation loop | Retry up to 3x, then escalate |
# Critic agent retry logic
max_retries = 3
if result == ValidationResult.REJECTED:
if retry_count < max_retries:
return {"retry_count": retry_count + 1} # Retry
else:
return {"result": ValidationResult.ESCALATED} # Give up gracefully
Q9: What would you do differently if starting over?โ
Answer:
- Start with async from day one - Easier to add concurrency later
- Implement caching earlier - Would have saved development API costs
- Use structured outputs - Gemini's JSON mode for reliable parsing
- Add observability first - LangSmith integration should be from start
- Design for horizontal scaling - Stateless API from the beginning
Q10: How do you balance cost vs performance?โ
Answer:
Cost breakdown per request:
- Gemini API: ~$0.01-0.05 (depending on tokens)
- Qdrant Cloud: Included in free tier
- Railway hosting: ~$5/month
- Google Search: Included in Gemini grounding
Optimization strategies:
- Cache common ingredients - 80% of requests hit top 100 ingredients
- Use smaller models for validation - Critic doesn't need full model
- Batch embeddings - Reduce API calls for multiple ingredients
- Set appropriate TTLs - Balance freshness vs cost
Trade-off: Aggressive caching reduces costs but may serve stale safety data. Our mitigation: 24-hour TTL with manual invalidation for critical updates.
Summaryโ
Scaling LLM applications requires balancing:
- Latency vs Accuracy - More agents = better results, slower response
- Cost vs Freshness - Caching saves money, risks stale data
- Simplicity vs Resilience - More fallbacks = more complexity
- Speed vs Safety - Fast responses vs thorough validation
The key is making intentional trade-offs based on your specific requirements, then documenting the reasoning for future reference.
This post is part of the interview preparation series for the AI Ingredient Safety Analyzer project.
