Vector Database
The AI Ingredient Scanner uses Qdrant Cloud for semantic ingredient search, enabling fast and accurate lookups even with variations in ingredient naming.
Why Vector Search?โ
Traditional keyword search fails with ingredient names because:
- Spelling variations: "Glycerine" vs "Glycerin" vs "Glycerol"
- Scientific names: "Sodium Lauryl Sulfate" vs "SLS"
- Aliases: "Vitamin E" vs "Tocopherol"
Vector search matches by meaning, not exact text.
Architectureโ
Configurationโ
Collection Setupโ
COLLECTION_NAME = "ingredients"
VECTOR_SIZE = 768 # gemini-embedding-001 output dimensions
EMBEDDING_MODEL = "gemini-embedding-001"
CONFIDENCE_THRESHOLD = 0.7
Vector Parametersโ
from qdrant_client.models import Distance, VectorParams
client.create_collection(
collection_name=COLLECTION_NAME,
vectors_config=VectorParams(
size=VECTOR_SIZE,
distance=Distance.COSINE, # Cosine similarity
),
)
Data Schemaโ
Payload Structureโ
Each vector point stores ingredient metadata:
{
"name": "Glycerin",
"purpose": "Humectant, moisturizer",
"safety_rating": 9,
"concerns": "No known concerns",
"recommendation": "SAFE",
"allergy_risk_flag": "low",
"allergy_potential": "Rare allergic reactions",
"origin": "Natural",
"category": "Both",
"regulatory_status": "FDA approved, EU compliant",
"regulatory_bans": "No",
"aliases": ["Glycerine", "Glycerol", "E422"]
}
TypeScript Interfaceโ
interface IngredientData {
name: string;
purpose: string;
safety_rating: number; // 1-10
concerns: string;
recommendation: string; // SAFE | CAUTION | AVOID
allergy_risk_flag: string; // high | low
allergy_potential: string;
origin: string; // Natural | Synthetic | Semi-synthetic
category: string; // Food | Cosmetics | Both
regulatory_status: string;
regulatory_bans: string; // Yes | No
source: string; // qdrant | google_search
confidence: number; // 0.0 - 1.0
}
Operationsโ
Lookup Ingredientโ
def lookup_ingredient(ingredient_name: str) -> IngredientData | None:
"""Look up ingredient in Qdrant vector database."""
# Generate embedding
embedding = get_embedding(ingredient_name.lower().strip())
# Query Qdrant
results = client.query_points(
collection_name=COLLECTION_NAME,
query=embedding,
limit=1,
)
if not results.points:
return None
top_result = results.points[0]
confidence = top_result.score
if confidence < CONFIDENCE_THRESHOLD:
return None # Will trigger Google Search
return _parse_payload(top_result.payload, confidence)
Upsert Ingredientโ
When Google Search finds new ingredient data, it's saved to Qdrant:
def upsert_ingredient(ingredient_data: IngredientData) -> bool:
"""Add or update an ingredient in the database."""
name = ingredient_data["name"]
# Create embedding
embedding = get_embedding(name.lower())
# Create point
point = PointStruct(
id=hash(name.lower()) % (2**63),
vector=embedding,
payload={
"name": name,
"purpose": ingredient_data["purpose"],
"safety_rating": ingredient_data["safety_rating"],
# ... other fields
},
)
client.upsert(
collection_name=COLLECTION_NAME,
points=[point],
)
return True
Generate Embeddingโ
def get_embedding(text: str) -> list[float]:
"""Get embedding vector using Google AI Studio."""
client = genai.Client(api_key=settings.google_api_key)
result = client.models.embed_content(
model=EMBEDDING_MODEL,
contents=text,
config=types.EmbedContentConfig(
task_type="RETRIEVAL_QUERY",
output_dimensionality=VECTOR_SIZE,
),
)
return result.embeddings[0].values
Self-Learning Pipelineโ
The system automatically improves over time:
Benefits:
- First lookup uses Google Search (~2-3 seconds)
- Future lookups use Qdrant (~50-100ms)
- Knowledge base grows automatically
Performanceโ
| Operation | Typical Latency |
|---|---|
| Embedding generation | 100-200ms |
| Qdrant query | 50-100ms |
| Google Search | 2-3 seconds |
| Total (cached) | ~200ms |
| Total (uncached) | ~3 seconds |
Qdrant Cloud Setupโ
1. Create Clusterโ
- Go to Qdrant Cloud Console
- Create a new cluster (free tier available)
- Note your cluster URL and API key
2. Configure Environmentโ
QDRANT_URL=https://your-cluster.qdrant.io
QDRANT_API_KEY=your_api_key_here
3. Verify Connectionโ
from qdrant_client import QdrantClient
client = QdrantClient(
url=settings.qdrant_url,
api_key=settings.qdrant_api_key,
)
# Check collections
collections = client.get_collections()
print(collections)
Troubleshootingโ
"Collection not found"โ
The collection is auto-created on first use:
def ensure_collection_exists(client: QdrantClient) -> None:
collections = client.get_collections()
exists = any(c.name == COLLECTION_NAME for c in collections.collections)
if not exists:
client.create_collection(
collection_name=COLLECTION_NAME,
vectors_config=VectorParams(
size=VECTOR_SIZE,
distance=Distance.COSINE,
),
)
Low Match Confidenceโ
If ingredients aren't matching well:
- Check ingredient name normalization
- Verify embedding model is consistent
- Consider lowering
CONFIDENCE_THRESHOLD
Connection Timeoutโ
client = QdrantClient(
url=settings.qdrant_url,
api_key=settings.qdrant_api_key,
timeout=30, # Increase timeout
)