Scaling AI Applications: Architecture Decisions & Trade-offs

December 25, 2025 · 6 min read

Building AI-Powered Applications

Building the AI Ingredient Safety Analyzer taught me valuable lessons about scaling LLM-powered applications. This post covers the key architectural decisions, trade-offs, and interview-ready explanations for production scaling.

The Challenge

Our Ingredient Analysis API processes requests that require:

Multiple LLM calls (Research → Analysis → Critic validation)
Vector database queries (Qdrant)
Real-time web search (Google Search grounding)

Current Performance:

Average response time: ~47 seconds per request
Throughput: ~1 request/second
Target: Handle increased load without degradation

Key Scaling Questions & Answers

Q1: How would you scale this API to handle 10x more traffic?

Answer:

I'd implement a three-pronged approach:

Response Caching (Redis/Memcached)
- Cache ingredient research data (24-72 hour TTL)
- Cache full analysis reports by ingredient+profile hash (1-6 hour TTL)
- Expected improvement: 5x throughput for cached requests
API Key Load Balancing
- Pool multiple Gemini API keys
- Implement rate-aware key selection
- Each key has rate limits; N keys = N× capacity
Async Processing with Queue
- Move to job queue (Celery/Redis Queue)
- Return job ID immediately, poll for results
- Prevents timeout issues on slow requests

Trade-off: Caching introduces stale data risk. Mitigation: Implement cache invalidation on safety data updates, use appropriate TTLs.

Q2: Why did you choose Qdrant over other vector databases?

Answer:

Factor	Qdrant	Pinecone	Weaviate	ChromaDB
Self-hosted option	Yes	No	Yes	Yes
Cloud managed	Yes	Yes	Yes	No
Filtering capability	Excellent	Good	Good	Basic
Python SDK	Native	Native	Native	Native
Cost	Free tier + pay-as-you-go	Expensive	Moderate	Free

Decision rationale:

Qdrant Cloud offers generous free tier (1GB)
Excellent hybrid search (vector + payload filtering)
Can self-host later for cost optimization
Simple REST API for debugging

Trade-off: Qdrant is less mature than Pinecone. Mitigation: Their active development and good documentation offset this.

Q3: How do you handle API rate limits from Gemini?

Answer:

Current approach:

# Single key - limited capacity
client = genai.Client(api_key=settings.google_api_key)

Scaled approach:

class RateLimitedKeyPool:
    def __init__(self, api_keys: list[str], rpm_limit: int = 15):
        self.keys = api_keys
        self.rpm_limit = rpm_limit
        self.request_times = {key: [] for key in api_keys}

    def get_available_key(self) -> str | None:
        now = time.time()
        for key in self.keys:
            # Clean requests older than 1 minute
            self.request_times[key] = [
                t for t in self.request_times[key]
                if t > now - 60
            ]
            if len(self.request_times[key]) < self.rpm_limit:
                self.request_times[key].append(now)
                return key
        return None  # All keys exhausted

Trade-off: Multiple keys increase cost and complexity. Consider: Is the traffic worth the operational overhead?

Q4: Why use a multi-agent architecture instead of a single LLM call?

Answer:

Single-call approach:

Pros: Faster (one LLM round-trip), simpler
Cons: Less accurate, no self-correction, monolithic prompt

Multi-agent approach (Research → Analysis → Critic):

Pros:
- Separation of concerns (research vs analysis vs validation)
- Self-correction loop (Critic can reject and retry)
- Better accuracy through validation gates
- Easier to debug and improve individual agents
Cons: 3x LLM calls, higher latency, more complex

Decision rationale: For safety-critical information, accuracy trumps speed. The Critic agent catches ~15% of issues that would otherwise reach users.

Q5: How do you ensure consistency between mobile and web clients?

Answer:

Architecture decisions:

Single REST API - Both clients call the same /analyze endpoint
Shared response schema - Pydantic models define the contract
API versioning - /api/v1/analyze allows future breaking changes

class AnalysisResponse(BaseModel):
    success: bool
    product_name: str
    overall_risk: str
    average_safety_score: int
    summary: str
    allergen_warnings: list[str]
    ingredients: list[IngredientDetail]

Trade-off: Single API means both clients get same data, even if one needs less. We accept slight over-fetching for consistency.

Q6: How would you add real-time updates for long-running requests?

Answer:

Options considered:

Approach	Pros	Cons
Polling	Simple, works everywhere	Inefficient, delayed
WebSockets	Real-time, bidirectional	Complex, stateful
Server-Sent Events	Real-time, simple	One-way only
Webhooks	Decoupled	Requires client endpoint

Recommendation for this API:

Server-Sent Events (SSE) for progress updates:

@app.get("/analyze/stream")
async def analyze_stream(request: AnalysisRequest):
    async def event_generator():
        yield f"data: {json.dumps({'stage': 'research', 'progress': 0})}\n\n"
        # ... research phase
        yield f"data: {json.dumps({'stage': 'analysis', 'progress': 33})}\n\n"
        # ... analysis phase
        yield f"data: {json.dumps({'stage': 'validation', 'progress': 66})}\n\n"
        # ... validation phase
        yield f"data: {json.dumps({'stage': 'complete', 'result': result})}\n\n"

    return StreamingResponse(event_generator(), media_type="text/event-stream")

Q7: What's your testing strategy for LLM-based features?

Answer:

Testing pyramid for LLM apps:

Unit tests - Mock LLM responses, test business logic
Integration tests - Test agent orchestration with fixtures
Contract tests - Verify LLM output schema compliance
Evaluation tests - Test accuracy on labeled datasets
Load tests - Verify performance under stress

Key insight: LLM outputs are non-deterministic. Solutions:

Use temperature=0.1 for more consistent outputs
Test for schema compliance, not exact text matching
Build evaluation datasets with expected categories

def test_analysis_returns_valid_risk_level():
    result = analyze_ingredients(test_state)
    assert result["analysis_report"]["overall_risk"] in ["low", "medium", "high"]

Q8: How do you handle failures gracefully?

Answer:

Failure modes and handling:

Failure	Detection	Recovery
LLM timeout	Request timeout (120s)	Retry with exponential backoff
Rate limit	429 response	Switch to backup API key
Qdrant down	Connection error	Fall back to Google Search only
Invalid input	Pydantic validation	Return 422 with details
Critic rejection	Validation loop	Retry up to 3x, then escalate

# Critic agent retry logic
max_retries = 3
if result == ValidationResult.REJECTED:
    if retry_count < max_retries:
        return {"retry_count": retry_count + 1}  # Retry
    else:
        return {"result": ValidationResult.ESCALATED}  # Give up gracefully

Q9: What would you do differently if starting over?

Answer:

Start with async from day one - Easier to add concurrency later
Implement caching earlier - Would have saved development API costs
Use structured outputs - Gemini's JSON mode for reliable parsing
Add observability first - LangSmith integration should be from start
Design for horizontal scaling - Stateless API from the beginning

Q10: How do you balance cost vs performance?

Answer:

Cost breakdown per request:

Gemini API: ~$0.01-0.05 (depending on tokens)
Qdrant Cloud: Included in free tier
Railway hosting: ~$5/month
Google Search: Included in Gemini grounding

Optimization strategies:

Cache common ingredients - 80% of requests hit top 100 ingredients
Use smaller models for validation - Critic doesn't need full model
Batch embeddings - Reduce API calls for multiple ingredients
Set appropriate TTLs - Balance freshness vs cost

Trade-off: Aggressive caching reduces costs but may serve stale safety data. Our mitigation: 24-hour TTL with manual invalidation for critical updates.

Summary

Scaling LLM applications requires balancing:

Latency vs Accuracy - More agents = better results, slower response
Cost vs Freshness - Caching saves money, risks stale data
Simplicity vs Resilience - More fallbacks = more complexity
Speed vs Safety - Fast responses vs thorough validation

The key is making intentional trade-offs based on your specific requirements, then documenting the reasoning for future reference.

This post is part of the interview preparation series for the AI Ingredient Safety Analyzer project.

The Challenge​

Key Scaling Questions & Answers​

Q1: How would you scale this API to handle 10x more traffic?​

Q2: Why did you choose Qdrant over other vector databases?​

Q3: How do you handle API rate limits from Gemini?​

Q4: Why use a multi-agent architecture instead of a single LLM call?​

Q5: How do you ensure consistency between mobile and web clients?​

Q6: How would you add real-time updates for long-running requests?​

Q7: What's your testing strategy for LLM-based features?​

Q8: How do you handle failures gracefully?​

Q9: What would you do differently if starting over?​

Q10: How do you balance cost vs performance?​

Summary​

The Challenge

Key Scaling Questions & Answers

Q1: How would you scale this API to handle 10x more traffic?

Q2: Why did you choose Qdrant over other vector databases?

Q3: How do you handle API rate limits from Gemini?

Q4: Why use a multi-agent architecture instead of a single LLM call?

Q5: How do you ensure consistency between mobile and web clients?

Q6: How would you add real-time updates for long-running requests?

Q7: What's your testing strategy for LLM-based features?

Q8: How do you handle failures gracefully?

Q9: What would you do differently if starting over?

Q10: How do you balance cost vs performance?

Summary