Skip to main content

Vector Database

The AI Ingredient Scanner uses Qdrant Cloud for semantic ingredient search, enabling fast and accurate lookups even with variations in ingredient naming.

Traditional keyword search fails with ingredient names because:

  • Spelling variations: "Glycerine" vs "Glycerin" vs "Glycerol"
  • Scientific names: "Sodium Lauryl Sulfate" vs "SLS"
  • Aliases: "Vitamin E" vs "Tocopherol"

Vector search matches by meaning, not exact text.

Architectureโ€‹

Configurationโ€‹

Collection Setupโ€‹

COLLECTION_NAME = "ingredients"
VECTOR_SIZE = 768 # gemini-embedding-001 output dimensions
EMBEDDING_MODEL = "gemini-embedding-001"
CONFIDENCE_THRESHOLD = 0.7

Vector Parametersโ€‹

from qdrant_client.models import Distance, VectorParams

client.create_collection(
collection_name=COLLECTION_NAME,
vectors_config=VectorParams(
size=VECTOR_SIZE,
distance=Distance.COSINE, # Cosine similarity
),
)

Data Schemaโ€‹

Payload Structureโ€‹

Each vector point stores ingredient metadata:

{
"name": "Glycerin",
"purpose": "Humectant, moisturizer",
"safety_rating": 9,
"concerns": "No known concerns",
"recommendation": "SAFE",
"allergy_risk_flag": "low",
"allergy_potential": "Rare allergic reactions",
"origin": "Natural",
"category": "Both",
"regulatory_status": "FDA approved, EU compliant",
"regulatory_bans": "No",
"aliases": ["Glycerine", "Glycerol", "E422"]
}

TypeScript Interfaceโ€‹

interface IngredientData {
name: string;
purpose: string;
safety_rating: number; // 1-10
concerns: string;
recommendation: string; // SAFE | CAUTION | AVOID
allergy_risk_flag: string; // high | low
allergy_potential: string;
origin: string; // Natural | Synthetic | Semi-synthetic
category: string; // Food | Cosmetics | Both
regulatory_status: string;
regulatory_bans: string; // Yes | No
source: string; // qdrant | google_search
confidence: number; // 0.0 - 1.0
}

Operationsโ€‹

Lookup Ingredientโ€‹

def lookup_ingredient(ingredient_name: str) -> IngredientData | None:
"""Look up ingredient in Qdrant vector database."""

# Generate embedding
embedding = get_embedding(ingredient_name.lower().strip())

# Query Qdrant
results = client.query_points(
collection_name=COLLECTION_NAME,
query=embedding,
limit=1,
)

if not results.points:
return None

top_result = results.points[0]
confidence = top_result.score

if confidence < CONFIDENCE_THRESHOLD:
return None # Will trigger Google Search

return _parse_payload(top_result.payload, confidence)

Upsert Ingredientโ€‹

When Google Search finds new ingredient data, it's saved to Qdrant:

def upsert_ingredient(ingredient_data: IngredientData) -> bool:
"""Add or update an ingredient in the database."""

name = ingredient_data["name"]

# Create embedding
embedding = get_embedding(name.lower())

# Create point
point = PointStruct(
id=hash(name.lower()) % (2**63),
vector=embedding,
payload={
"name": name,
"purpose": ingredient_data["purpose"],
"safety_rating": ingredient_data["safety_rating"],
# ... other fields
},
)

client.upsert(
collection_name=COLLECTION_NAME,
points=[point],
)

return True

Generate Embeddingโ€‹

def get_embedding(text: str) -> list[float]:
"""Get embedding vector using Google AI Studio."""

client = genai.Client(api_key=settings.google_api_key)

result = client.models.embed_content(
model=EMBEDDING_MODEL,
contents=text,
config=types.EmbedContentConfig(
task_type="RETRIEVAL_QUERY",
output_dimensionality=VECTOR_SIZE,
),
)

return result.embeddings[0].values

Self-Learning Pipelineโ€‹

The system automatically improves over time:

Benefits:

  1. First lookup uses Google Search (~2-3 seconds)
  2. Future lookups use Qdrant (~50-100ms)
  3. Knowledge base grows automatically

Performanceโ€‹

OperationTypical Latency
Embedding generation100-200ms
Qdrant query50-100ms
Google Search2-3 seconds
Total (cached)~200ms
Total (uncached)~3 seconds

Qdrant Cloud Setupโ€‹

1. Create Clusterโ€‹

  1. Go to Qdrant Cloud Console
  2. Create a new cluster (free tier available)
  3. Note your cluster URL and API key

2. Configure Environmentโ€‹

QDRANT_URL=https://your-cluster.qdrant.io
QDRANT_API_KEY=your_api_key_here

3. Verify Connectionโ€‹

from qdrant_client import QdrantClient

client = QdrantClient(
url=settings.qdrant_url,
api_key=settings.qdrant_api_key,
)

# Check collections
collections = client.get_collections()
print(collections)

Troubleshootingโ€‹

"Collection not found"โ€‹

The collection is auto-created on first use:

def ensure_collection_exists(client: QdrantClient) -> None:
collections = client.get_collections()
exists = any(c.name == COLLECTION_NAME for c in collections.collections)

if not exists:
client.create_collection(
collection_name=COLLECTION_NAME,
vectors_config=VectorParams(
size=VECTOR_SIZE,
distance=Distance.COSINE,
),
)

Low Match Confidenceโ€‹

If ingredients aren't matching well:

  1. Check ingredient name normalization
  2. Verify embedding model is consistent
  3. Consider lowering CONFIDENCE_THRESHOLD

Connection Timeoutโ€‹

client = QdrantClient(
url=settings.qdrant_url,
api_key=settings.qdrant_api_key,
timeout=30, # Increase timeout
)