Vector Database

The AI Ingredient Scanner uses Qdrant Cloud for semantic ingredient search, enabling fast and accurate lookups even with variations in ingredient naming.

Why Vector Search?

Traditional keyword search fails with ingredient names because:

Spelling variations: "Glycerine" vs "Glycerin" vs "Glycerol"
Scientific names: "Sodium Lauryl Sulfate" vs "SLS"
Aliases: "Vitamin E" vs "Tocopherol"

Vector search matches by meaning, not exact text.

Architecture

Configuration

Collection Setup

COLLECTION_NAME = "ingredients"
VECTOR_SIZE = 768  # gemini-embedding-001 output dimensions
EMBEDDING_MODEL = "gemini-embedding-001"
CONFIDENCE_THRESHOLD = 0.7

Vector Parameters

from qdrant_client.models import Distance, VectorParams

client.create_collection(
    collection_name=COLLECTION_NAME,
    vectors_config=VectorParams(
        size=VECTOR_SIZE,
        distance=Distance.COSINE,  # Cosine similarity
    ),
)

Data Schema

Payload Structure

Each vector point stores ingredient metadata:

{
  "name": "Glycerin",
  "purpose": "Humectant, moisturizer",
  "safety_rating": 9,
  "concerns": "No known concerns",
  "recommendation": "SAFE",
  "allergy_risk_flag": "low",
  "allergy_potential": "Rare allergic reactions",
  "origin": "Natural",
  "category": "Both",
  "regulatory_status": "FDA approved, EU compliant",
  "regulatory_bans": "No",
  "aliases": ["Glycerine", "Glycerol", "E422"]
}

TypeScript Interface

interface IngredientData {
  name: string;
  purpose: string;
  safety_rating: number;      // 1-10
  concerns: string;
  recommendation: string;     // SAFE | CAUTION | AVOID
  allergy_risk_flag: string;  // high | low
  allergy_potential: string;
  origin: string;             // Natural | Synthetic | Semi-synthetic
  category: string;           // Food | Cosmetics | Both
  regulatory_status: string;
  regulatory_bans: string;    // Yes | No
  source: string;             // qdrant | google_search
  confidence: number;         // 0.0 - 1.0
}

Operations

Lookup Ingredient

def lookup_ingredient(ingredient_name: str) -> IngredientData | None:
    """Look up ingredient in Qdrant vector database."""

    # Generate embedding
    embedding = get_embedding(ingredient_name.lower().strip())

    # Query Qdrant
    results = client.query_points(
        collection_name=COLLECTION_NAME,
        query=embedding,
        limit=1,
    )

    if not results.points:
        return None

    top_result = results.points[0]
    confidence = top_result.score

    if confidence < CONFIDENCE_THRESHOLD:
        return None  # Will trigger Google Search

    return _parse_payload(top_result.payload, confidence)

Upsert Ingredient

When Google Search finds new ingredient data, it's saved to Qdrant:

def upsert_ingredient(ingredient_data: IngredientData) -> bool:
    """Add or update an ingredient in the database."""

    name = ingredient_data["name"]

    # Create embedding
    embedding = get_embedding(name.lower())

    # Create point
    point = PointStruct(
        id=hash(name.lower()) % (2**63),
        vector=embedding,
        payload={
            "name": name,
            "purpose": ingredient_data["purpose"],
            "safety_rating": ingredient_data["safety_rating"],
            # ... other fields
        },
    )

    client.upsert(
        collection_name=COLLECTION_NAME,
        points=[point],
    )

    return True

Generate Embedding

def get_embedding(text: str) -> list[float]:
    """Get embedding vector using Google AI Studio."""

    client = genai.Client(api_key=settings.google_api_key)

    result = client.models.embed_content(
        model=EMBEDDING_MODEL,
        contents=text,
        config=types.EmbedContentConfig(
            task_type="RETRIEVAL_QUERY",
            output_dimensionality=VECTOR_SIZE,
        ),
    )

    return result.embeddings[0].values

Self-Learning Pipeline

The system automatically improves over time:

Benefits:

First lookup uses Google Search (~2-3 seconds)
Future lookups use Qdrant (~50-100ms)
Knowledge base grows automatically

Performance

Operation	Typical Latency
Embedding generation	100-200ms
Qdrant query	50-100ms
Google Search	2-3 seconds
Total (cached)	~200ms
Total (uncached)	~3 seconds

Qdrant Cloud Setup

1. Create Cluster

Go to Qdrant Cloud Console
Create a new cluster (free tier available)
Note your cluster URL and API key

2. Configure Environment

QDRANT_URL=https://your-cluster.qdrant.io
QDRANT_API_KEY=your_api_key_here

3. Verify Connection

from qdrant_client import QdrantClient

client = QdrantClient(
    url=settings.qdrant_url,
    api_key=settings.qdrant_api_key,
)

# Check collections
collections = client.get_collections()
print(collections)

Troubleshooting

"Collection not found"

The collection is auto-created on first use:

def ensure_collection_exists(client: QdrantClient) -> None:
    collections = client.get_collections()
    exists = any(c.name == COLLECTION_NAME for c in collections.collections)

    if not exists:
        client.create_collection(
            collection_name=COLLECTION_NAME,
            vectors_config=VectorParams(
                size=VECTOR_SIZE,
                distance=Distance.COSINE,
            ),
        )

Low Match Confidence

If ingredients aren't matching well:

Check ingredient name normalization
Verify embedding model is consistent
Consider lowering CONFIDENCE_THRESHOLD

Connection Timeout

client = QdrantClient(
    url=settings.qdrant_url,
    api_key=settings.qdrant_api_key,
    timeout=30,  # Increase timeout
)

Why Vector Search?​

Architecture​

Configuration​

Collection Setup​

Vector Parameters​

Data Schema​

Payload Structure​

TypeScript Interface​

Operations​

Lookup Ingredient​

Upsert Ingredient​

Generate Embedding​

Self-Learning Pipeline​

Performance​

Qdrant Cloud Setup​

1. Create Cluster​

2. Configure Environment​

3. Verify Connection​

Troubleshooting​

"Collection not found"​

Low Match Confidence​

Connection Timeout​

Related Documentation​