There's a growing gap between 'AI-enhanced' products and products where AI is actually load-bearing. Most teams are in the first camp — they've added a ChatGPT wrapper and called it AI-native. The real architecture challenge begins when an AI feature is the primary reason users pay for your product.
After building several production AI SaaS systems, the patterns that make or break them are consistent. This isn't a tutorial — it's a set of architectural decisions we'd make again, and a few we wouldn't.
Start with the data model
The biggest mistake teams make is treating LLM calls like REST API calls. They're not. LLMs introduce non-determinism, latency variance (50ms to 30s on the same query), token cost accumulation, and failure modes that don't exist in conventional services. Your schema needs to account for this from day one.
-- Track every LLM interaction for debugging and cost attribution
CREATE TABLE ai_interactions (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL REFERENCES tenants(id),
user_id UUID REFERENCES users(id),
feature VARCHAR(100) NOT NULL, -- 'chat', 'summarize', 'classify'
model VARCHAR(50) NOT NULL,
prompt_tokens INTEGER NOT NULL,
completion_tokens INTEGER NOT NULL,
latency_ms INTEGER NOT NULL,
cached BOOLEAN DEFAULT FALSE,
error TEXT,
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- Index for cost analysis per tenant per day
CREATE INDEX idx_ai_interactions_tenant_day
ON ai_interactions (tenant_id, (created_at::date));Logging every interaction seems excessive until you get a surprise API bill or need to debug why one tenant's chat responses are degrading. Without this table you're flying blind.
Multi-tenancy and data isolation
If you're building a multi-tenant SaaS with RAG, you need strict isolation at the vector database level. A system that searches across all tenant data is a serious security and privacy problem. Each tenant's embeddings must be namespace-separated — a query from Tenant A should never surface Tenant B's documents.
// Pinecone namespace isolation — never query without the tenantId namespace
const index = pinecone.index("knowledge-base");
export async function queryKnowledgeBase(
query: string,
tenantId: string,
topK = 5
) {
const embedding = await embed(query);
// The namespace ensures complete tenant isolation
return index.namespace(tenantId).query({
vector: embedding,
topK,
includeMetadata: true,
});
}
// When indexing documents, always namespace by tenant
export async function indexDocument(
doc: Document,
tenantId: string
) {
const embedding = await embed(doc.content);
await index.namespace(tenantId).upsert([{
id: doc.id,
values: embedding,
metadata: { title: doc.title, source: doc.url },
}]);
}Streaming is not optional
Users tolerate 15 seconds of a streaming response appearing word by word. They don't tolerate 15 seconds of a blank screen. Wire streaming into your API routes at the start — retrofitting it later is genuinely painful because it changes the client/server contract.
// Next.js App Router — streaming AI response
export async function POST(req: Request) {
const { messages, tenantId } = await req.json();
const stream = await openai.chat.completions.create({
model: "gpt-4o-mini",
messages,
stream: true,
});
const encoder = new TextEncoder();
return new Response(
new ReadableStream({
async start(controller) {
for await (const chunk of stream) {
const text = chunk.choices[0]?.delta?.content ?? "";
if (text) {
controller.enqueue(
encoder.encode(`data: ${JSON.stringify({ text })}
`)
);
}
}
controller.enqueue(encoder.encode("data: [DONE]
"));
controller.close();
},
}),
{
headers: {
"Content-Type": "text/event-stream",
"Cache-Control": "no-cache",
Connection: "keep-alive",
},
}
);
}Fallback behavior from day one
OpenAI goes down. Claude has capacity issues. Your embedding service will have latency spikes. If your product's core functionality depends on an external AI API with no fallback, you have a reliability problem that will manifest at the worst possible moment.
A fallback doesn't have to be another AI model. It can be a deterministic response, a cached answer from a similar past query, or an honest 'I'm not able to help with that right now' with a way to contact support. Users tolerate known limitations far better than silent failures.
Token cost management
AI API costs scale non-linearly with usage. A product that costs $50/month to run in development can generate surprising bills at even modest scale. Two patterns that pay for themselves quickly: semantic caching and routing by complexity.
// Semantic cache — return cached answer for semantically similar queries
async function getAIResponse(query: string, tenantId: string) {
const queryEmbedding = await embed(query);
// Check cache first (threshold: 0.95 = very high similarity)
const cached = await findSimilarCachedResponse(
queryEmbedding,
tenantId,
0.95
);
if (cached) {
await logInteraction({ cached: true, tenantId });
return { response: cached.response, fromCache: true };
}
// Route simple queries to a cheaper/faster model
const model = isSimpleQuery(query) ? "gpt-4o-mini" : "gpt-4o";
const response = await callLLM(query, model);
await storeInCache(queryEmbedding, response, tenantId);
await logInteraction({ cached: false, model, tenantId });
return { response, fromCache: false };
}Evaluation before shipping
The hardest part of AI SaaS isn't building the feature — it's knowing when it's good enough to ship. Build an evaluation suite before the feature goes live: a set of test prompts with expected outputs, measured against your pipeline. Run it after any model change, prompt change, or chunking strategy change. Without this, you're shipping blind.
- Define what 'correct' means for each AI feature before building it
- Build a test set of 20–50 representative queries with expected outputs
- Track retrieval quality (did the context contain the answer?) separately from generation quality (did the LLM use it correctly?)
- Re-evaluate after every meaningful change to the pipeline
- Log production failures — user feedback is the best evaluation data you have
Auravon AI
Engineering Studio