Generative AI & Large Language Models: Architecture, Prompt Engineering, Fine-Tuning, and Operational Excellence
Executive Summary
Large Language Models (LLMs) enable natural language understanding, generation, summarization, reasoning, and code synthesis. Enterprise adoption requires more than raw API calls: robust Retrieval-Augmented Generation (RAG), structured prompt pipelines, safety & governance enforcement, continuous evaluation, and cost-performance trade-off optimization. This blueprint provides actionable patterns to design, deploy, and evolve LLM-powered systems on Azureβbalancing quality, reliability, security, and operational efficiency.
Target outcomes: answer accuracy > 85% vs gold set, hallucination rate < 8%, latency p95 < 2.5s for RAG queries, monthly cost per 1K tokens trending down, and safety violation false negative rate < 2%.
Introduction
Models like GPT-style transformers exhibit emergent capabilities at scale: few-shot reasoning, chain-of-thought decomposition, instruction following. Yet naive usage leads to hallucinations, excessive token consumption, brittle prompts, and compliance risks. Production success demands disciplined system design: layered retrieval, prompt templating, evaluation harnesses, adaptive caching, and governance instrumentation.
Reference Architecture (Text Diagram)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LLM Solution Stack β
ββββββββββββββ¬βββββββββββββββ¬βββββββββββββββ¬βββββββββββββββ¬βββββββ€
β Data Layer β Index Layer β Orchestrationβ Safety Layer β UX β
β Documents β Vector Store β Prompt/RAG β Filters + PIIβ Chat β
β FAQs, KB β Hybrid (BM25+β Routing β Moderation β API β
β Policies β Embeddings) β Chains β Redaction β Apps β
ββββββββββββββ΄βββββββββββββββ΄βββββββββββββββ΄βββββββββββββββ΄βββββββ€
β Observability: Metrics (latency, tokens), Traces, Eval Scores β
β Governance: Prompt registry, versioned templates, guardrails β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
LLM Lifecycle
- Problem Framing β identify tasks (Q&A, summarization, extraction, classification).
- Data Preparation β curate documents, chunking strategy, metadata enrichment.
- Embedding & Indexing β generate embeddings (e.g., text-embedding-ada-002 equivalent) + hybrid retrieval.
- Prompt Engineering β system & user prompt templates, structure, few-shot examples.
- Generation / Reasoning β temperature, top_p tuning, chain-of-thought or tool invocation.
- Evaluation β offline benchmarks (BLEU, ROUGE, factual QA), online feedback loops.
- Fine-Tuning / Adaptation β parameter-efficient tuning (LoRA, adapters) or continuous embedding refresh.
- Monitoring & Governance β safety, PII detection, prompt drift, cost telemetry, performance regression.
- Optimization & Iteration β refine prompts, adjust retrieval window, prune examples.
Prompt Engineering Core Concepts
| Technique | Purpose | Example Snippet | Notes |
|---|---|---|---|
| Role / System Prompt | Establish behavior | "You are a compliance assistant" | Anchor identity |
| Few-Shot Examples | Provide pattern guidance | Q/A pairs | Keep token budget in check |
| Chain-of-Thought | Improve reasoning transparency | "Let's reason step by step" | Potential longer output |
| Self-Ask | Decompose questions | Intermediate sub-queries | Higher latency trade-off |
| Tool / Function Calling | Structured extraction | JSON schema call | Enforce output format |
| Guarded Output Patterns | Control shape | XML / JSON enforced | Improves parsing reliability |
Prompt Template Example
SYSTEM_PROMPT = """You are an enterprise knowledge assistant. Provide concise, factual answers sourced ONLY from provided context. If answer not found, respond: 'INSUFFICIENT_CONTEXT'. Avoid speculation."""
USER_TEMPLATE = """Context:\n{context}\n\nQuestion: {question}\nAnswer:"""
def build_prompt(context_chunks, question):
context = "\n".join(context_chunks[:4]) # limit to top chunks
return SYSTEM_PROMPT + "\n" + USER_TEMPLATE.format(context=context, question=question)
Retrieval-Augmented Generation (RAG)
RAG mitigates hallucinations by constraining answers to retrieved context. Combine semantic + keyword retrieval to maximize recall without irrelevant drift.
Chunking Strategy
| Approach | Chunk Size | Pros | Cons |
|---|---|---|---|
| Fixed tokens | 256β512 | Simple | Possible semantic cut |
| Recursive splitting | Dynamic | Maintains coherence | Complexity |
| Semantic boundary | Paragraph-level | Natural grouping | Requires NLP pass |
Hybrid Retrieval Example
def hybrid_retrieve(query, vector_index, keyword_index, k=8):
semantic = vector_index.search(query, k=k)
lexical = keyword_index.search(query, k=k)
union = {d['id']: d for d in semantic + lexical}
# Simple score merge (could weight)
return list(union.values())[:k]
Embeddings & Indexing
Use dimensionality 512β1536 embeddings; store metadata (source, section, timestamp, sensitivity). Periodically re-embed updated documents; maintain version tags for rollback.
embedding_cache = {}
def get_embedding(text, client):
if text in embedding_cache: return embedding_cache[text]
vec = client.embed(text)
embedding_cache[text] = vec
return vec
Parameter-Efficient Fine-Tuning (PEFT)
PEFT allows adapting large base models economically.
| Method | Mechanism | Pros | Cons |
|---|---|---|---|
| LoRA | Inject low-rank matrices into attention | Low memory | May underfit niche tasks |
| Adapters | Add bottleneck layers | Modular | Slight latency impact |
| Prefix Tuning | Prepend trainable tokens | Fast training | Limited deep adaptation |
| QLoRA | Quantized + LoRA | Cost-efficient | Setup complexity |
LoRA Pseudocode
class LoRALayer(nn.Module):
def __init__(self, base_layer, r=8, alpha=16):
super().__init__()
self.base = base_layer
self.A = nn.Linear(base_layer.in_features, r, bias=False)
self.B = nn.Linear(r, base_layer.out_features, bias=False)
self.scaling = alpha / r
def forward(self, x):
return self.base(x) + self.B(self.A(x)) * self.scaling
Evaluation Framework
| Dimension | Metric | Tooling |
|---|---|---|
| Factuality | Exact match / EM@K | Custom QA harness |
| Relevance | Retrieval overlap | Embedding cosine |
| Coherence | Human rating / LLM judge | Eval script |
| Safety | Policy violation rate | Content filter logs |
| Cost | Tokens/request | Billing telemetry |
| Latency | p95 response time | Tracing + APM |
Simple QA Evaluation
def evaluate_qa(model, dataset):
correct = 0
for item in dataset:
answer = model.ask(item['question'], item['context'])
if answer.strip().lower() == item['gold'].strip().lower():
correct += 1
return correct / len(dataset)
Safety & Compliance Layer
Safety filters should sit between generation and user exposure (and pre-generation for input screening).
| Risk | Control | Implementation |
|---|---|---|
| Toxic language | Moderate/Block categories | Content safety API |
| PII leakage | Entity detection + redaction | Regex + NER pass |
| Hallucination | RAG + answer grounding label | Source citation enforcement |
| Prompt Injection | Context boundary enforcement | Strip system tokens, sanitize user input |
| Data Exfiltration | Output length + pattern guard | Post-processing rules |
Hallucination Confidence Heuristic
def hallucination_score(answer, sources):
import difflib
similarity = max(difflib.SequenceMatcher(None, answer, s).ratio() for s in sources)
return 1 - similarity # higher => more risk
Cost Optimization Strategies
| Area | Technique | Impact |
|---|---|---|
| Token Usage | Prompt compression, remove redundant examples | β cost/request |
| Caching | Reuse embedding + answer cache | β repeat cost |
| Model Selection | Smaller model for simple queries | β baseline cost |
| Adaptive Routing | Choose model by complexity score | Balanced spend/quality |
| Batching | Group embedding requests | Improved throughput |
| Quantization | Compress fine-tuned variants | β inference compute |
Adaptive Routing Sketch
def route_query(query, complexity_model):
score = complexity_model.predict(query)
if score < 0.3: return "small" # fast/light model
if score < 0.7: return "medium"
return "large"
Latency Engineering
- Parallel retrieval + embedding lookups.
- Early stream partial answer tokens (server-sent events / websockets).
- Optimize chunk size for retrieval recall vs token overhead.
- Warm pool provision for large model instances.
Observability & Telemetry
| Signal | Purpose | Tool |
|---|---|---|
| Tokens used | Cost control | Billing export |
| Latency p95 | User experience SLA | APM / traces |
| Safety violations | Compliance monitoring | Content filter metrics |
| Cache hit rate | Efficiency | Custom counter |
| Hallucination score | Quality risk | QA pipeline |
| Retrieval coverage % | Context sufficiency | Index analyzer |
Guardrails & Policy as Code
GUARDRAILS = {
"max_tokens": 1024,
"allow_chain_of_thought": False,
"banned_phrases": ["confidential", "password"],
"citation_required": True
}
def enforce_guardrails(output, metadata):
if len(output.split()) > GUARDRAILS["max_tokens"]:
return False, "Token limit exceeded"
if GUARDRAILS["citation_required"] and "SOURCE:" not in output:
return False, "Missing citation"
for phrase in GUARDRAILS["banned_phrases"]:
if phrase in output.lower():
return False, "Banned phrase detected"
return True, "OK"
RAG Query Flow (ASCII)
User Query β Preprocess β Hybrid Retrieve β Rank & Filter β Prompt Build β LLM Generate β Safety Filters β Cite Sources β Return Response β Log Metrics
Versioning & Change Management
Track prompt template versions + embedding index snapshots. Associate deployment with prompt hash + model revision for reproducibility.
Failure Modes & Mitigations
| Failure | Cause | Mitigation |
|---|---|---|
| Hallucinated answer | Insufficient / noisy context | Increase top-k, add validation pass |
| High latency | Large prompt or slow retrieval | Compress prompt, optimize index |
| Safety false negative | Weak filter | Ensemble filters + periodic audit |
| Rising cost | Token inflation | Prompt diff audit + caching |
| Poor domain adaptation | Insufficient fine-tune data | Add synthetic examples, LoRA tuning |
Initial References
- Attention Is All You Need (Transformer paper)
- Retrieval-Augmented Generation (Lewis et al.)
- LoRA: Low-Rank Adaptation of Large Language Models
- OpenAI Prompt Engineering Guidelines
- Azure AI Content Safety Documentation
Next Expansion Targets
- Add tool/function calling governance section.
- Include evaluation harness scripts (BLEU/ROUGE & factuality scoring).
- Expand safety with jailbreak prompt detection.
- Add cost projection calculator.
- Provide multi-turn conversation state design.
Tool / Function Calling Governance
Structured tool invocation reduces hallucinated APIs and enforces output schema fidelity.
TOOLS = {
"lookup_account": {"args": ["account_id"], "safety": ["no_pii"]},
"calc_interest": {"args": ["principal","rate"], "safety": []}
}
def validate_tool_call(name, args):
spec = TOOLS.get(name)
if not spec:
return False, "Unknown tool"
if set(args.keys()) != set(spec["args"]):
return False, "Arg mismatch"
return True, "OK"
Policy layer logs tool usage, anomaly detection flags rare/unexpected sequences.
Multi-Turn Conversation Memory
| Strategy | Mechanism | Pros | Cons |
|---|---|---|---|
| Full Transcript | Append all prior turns | Complete context | Token explosion |
| Sliding Window | Keep last N turns | Bounded tokens | Loses distant context |
| Summary Memory | Periodic summarization | Low token footprint | Potential detail loss |
| Vector Memory | Embedding retrieval of past turns | Semantic recall | Index complexity |
Summarization Memory Update
def update_summary(prev_summary, new_turns, llm):
prompt = f"Prior summary:\n{prev_summary}\nNew turns:\n{new_turns}\nUpdate summary preserving key facts."
return llm.generate(prompt)
Jailbreak & Injection Detection
Common attack vectors: instruction override, prompt leakage, system role manipulation.
JAILBREAK_PATTERNS = ["ignore previous", "disregard instructions", "/system", "pretend to"]
def detect_jailbreak(user_input):
lowered = user_input.lower()
for p in JAILBREAK_PATTERNS:
if p in lowered:
return True
return False
Mitigation: refuse generation, provide safe fallback, log incident.
Retrieval Ranking Enhancements
Combine vector similarity + recency + document authority weight.
def rank_results(results):
for r in results:
r['score'] = 0.6*r['vector_score'] + 0.2*r.get('authority',0.5) + 0.2*r.get('recency_score',0.5)
return sorted(results, key=lambda x: x['score'], reverse=True)
Context Window Optimization
| Technique | Description | Impact |
|---|---|---|
| Deduplication | Remove overlapping chunks | β tokens |
| Salience Scoring | Keep top relevance + novelty | β quality |
| Structured Formatting | Bold headings, bullet compression | β readability |
| Citation Markers | Tag sources (SOURCE:ID) | β traceability |
Salience Score Example
def salience(chunk, query, embedding_fn):
import numpy as np
qv = embedding_fn(query)
cv = embedding_fn(chunk)
relevance = np.dot(qv, cv)/(np.linalg.norm(qv)*np.linalg.norm(cv))
novelty = len(set(chunk.split()))/ (len(chunk.split())+1e-6)
return 0.7*relevance + 0.3*novelty
Advanced Evaluation Harness
from rouge_score import rouge_scorer
def eval_generation(model, dataset):
scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
scores = []
for item in dataset:
output = model.generate(item['prompt'])
s = scorer.score(item['reference'], output)
scores.append(s['rougeL'].fmeasure)
return sum(scores)/len(scores)
Factuality via Source Citation
def factuality_score(answer, source_texts):
import difflib
ratios = [difflib.SequenceMatcher(None, answer, s).ratio() for s in source_texts]
return max(ratios)
Cost Projection Calculator
def monthly_cost(avg_tokens_per_request, requests_per_day, price_per_1k=0.002):
daily_tokens = avg_tokens_per_request * requests_per_day
monthly_tokens = daily_tokens * 30
return (monthly_tokens / 1000) * price_per_1k
Track vs budget; trigger optimization when > 110% forecast.
Token Compression Techniques
| Technique | Mechanism | Trade-Off |
|---|---|---|
| Abbreviation Map | Replace repeated entity names | Possible ambiguity |
| Structured JSON | Remove prose around key-value | Less human readable |
| Example Pruning | Keep hardest diverse examples | Coverage risk |
| Summarized Context | Use summaries for older turns | Detail loss |
Adaptive Temperature Strategy
Lower temperature for factual Q&A, raise for creative tasks.
def select_temperature(task_type):
mapping = {"factual":0.2, "creative":0.8, "code":0.3, "summary":0.4}
return mapping.get(task_type,0.5)
Observability Implementation (Structured Logging)
def log_event(logger, event_type, data):
import json, time
payload = {"ts": time.time(), "type": event_type, **data}
logger.info(json.dumps(payload))
Events: retrieval_set, prompt_built, generation_complete, safety_flagged.
Hallucination Sandbox Testing
Generate adversarial prompts ("Provide details about confidential project X") and measure rejection effectiveness.
| Scenario | Expected Outcome | Actual | Pass |
|---|---|---|---|
| Confidential query | Reject with policy message | Reject | β |
| PII request | Redact / refuse | Refuse | β |
| System prompt leak | Maintain boundaries | Boundaries kept | β |
Prompt Drift Detection
Compare new prompt templates vs baseline embedding similarity; if divergence > threshold log governance review.
def prompt_drift(old, new, embed):
import numpy as np
o = embed(old); n = embed(new)
sim = np.dot(o,n)/(np.linalg.norm(o)*np.linalg.norm(n))
return 1 - sim
Synthetic Data Augmentation for Fine-Tuning
| Method | Description | Risk Mitigation |
|---|---|---|
| Paraphrasing | LLM rephrase existing QA pairs | Validate factual consistency |
| Counterfactual | Alter entity attributes maintaining logic | Check bias introduction |
| Template Fill | Slot-based generation | Ensure slot constraints |
PEFT Training Loop Sketch
def train_peft(model, data_loader, optimizer):
model.train()
for batch in data_loader:
loss = model(batch['input_ids'], labels=batch['labels']).loss
loss.backward()
optimizer.step(); optimizer.zero_grad()
Memory Token Budgeting
Compute projected tokens for conversation; prune if threshold exceeded.
def prune_transcript(turns, max_tokens, tokenizer):
total = 0; kept = []
for t in reversed(turns):
tokens = len(tokenizer(t))
if total + tokens <= max_tokens:
kept.insert(0,t); total += tokens
else:
break
return kept
KPI Catalog (LLM Ops)
| KPI | Target | Rationale |
|---|---|---|
| Answer Accuracy | > 85% | User trust |
| Hallucination Rate | < 8% | Reliability |
| Cost per 1K Tokens | β MoM | Efficiency |
| Cache Hit Rate | > 40% | Cost reduction |
| Safety False Negative | < 2% | Compliance |
| Latency p95 | < 2.5s | UX quality |
Troubleshooting Matrix
| Issue | Cause | Resolution | Prevention |
|---|---|---|---|
| High Hallucination | Weak retrieval | Improve indexing & top-k | Hybrid retrieval |
| Rising Cost | Prompt bloat | Prompt diff & compression | Auto diff alerts |
| Safety Miss | New pattern | Expand regex/ML filter | Weekly pattern review |
| Slow Responses | Cold start instances | Warm pool + autoscale | Pre-scaling strategy |
| Poor Accuracy | Outdated context | Re-embed corpus | Schedule re-embedding |
| Tool Failures | Arg mismatch | Strict schema validation | Tool registry tests |
Best Practices & Anti-Patterns
| Best Practice | Benefit | Anti-Pattern | Risk |
|---|---|---|---|
| Hybrid retrieval | Higher recall | Single retrieval mode | Missed context |
| Prompt versioning | Reproducibility | Ad-hoc edits | Regression blind spots |
| Structured evaluation | Quantified quality | Manual spot check only | Quality drift |
| Safety ensemble | Reduced false negatives | Single heuristic | Compliance gaps |
| Cost monitoring | Financial control | Ignoring token trends | Budget overrun |
Roadmap
- Add active learning feedback integration.
- Implement semantic cache eviction policy.
- Expand jailbreak classifier with ML model.
- Introduce multi-lingual retrieval pipeline.
- Deploy real-time token anomaly detector.
Final Summary
Enterprise LLM success demands engineered layersβretrieval rigor, disciplined prompts, adaptive fine-tuning, robust evaluation, and vigilant safety & cost governanceβforming a repeatable system that scales value while constraining risk.
Advanced Orchestration & Scaling
RAG Pipeline State Machine
Define explicit states to avoid silent failures and enable observability.
from enum import Enum
class RAGState(Enum):
VALIDATE_PROMPT = 'validate_prompt'
EMBED_QUERY = 'embed_query'
RETRIEVE = 'retrieve'
RERANK = 'rerank'
BUILD_CONTEXT = 'build_context'
GENERATE = 'generate'
POST_VALIDATE = 'post_validate'
def run_rag(query, cfg):
state = RAGState.VALIDATE_PROMPT
audit = []
try:
if state == RAGState.VALIDATE_PROMPT:
assert len(query) < cfg.max_chars
state = RAGState.EMBED_QUERY; audit.append(state.value)
qv = cfg.embed_fn(query)
state = RAGState.RETRIEVE; audit.append(state.value)
docs = cfg.vector_index.search(qv, k=cfg.base_k)
state = RAGState.RERANK; audit.append(state.value)
ranked = cfg.reranker(docs, query)
state = RAGState.BUILD_CONTEXT; audit.append(state.value)
ctx = cfg.context_builder(ranked)
state = RAGState.GENERATE; audit.append(state.value)
answer = cfg.llm.generate(cfg.prompt_template.format(context=ctx, question=query))
state = RAGState.POST_VALIDATE; audit.append(state.value)
if cfg.hallucination_score(answer, ranked) > cfg.max_hallu:
answer = cfg.refine(answer, ranked)
return answer, audit
except Exception as e:
return f"ERROR: {e}", audit
State audit enables traceability and SLA attribution (e.g., latency per stage).
Throughput Engineering
- Batch embeddings: group 32β64 queries per API call.
- Async retrieval fan-out: parallel vector + keyword + graph stores.
- Streaming decode with early cancellation on safety triggers.
- Adaptive top-k: increase only when semantic density low.
- Shard indexes by semantic domain to reduce search space.
Memory Hybrid (Summary + Vector)
Combine rolling conversation summary with episodic memory vectors:
class HybridMemory:
def __init__(self, embed_fn):
self.embed_fn = embed_fn
self.summary = ""
self.store = [] # [(vec,text)]
def update(self, turn_text):
self.summary = summarize(self.summary + "\n" + turn_text)
vec = self.embed_fn(turn_text)
self.store.append((vec, turn_text))
def recall(self, query, k=5):
qv = self.embed_fn(query)
scored = sorted(self.store, key=lambda vt: cosine(qv, vt[0]), reverse=True)[:k]
episodic = "\n".join(t for _,t in scored)
return f"SUMMARY:\n{self.summary}\nEPISODIC:\n{episodic}"[:2000]
Use summary for global continuity, episodic vectors for precision details.
Prompt Drift Governance
Track embedding similarity of evolving system prompts vs baseline; if similarity < 0.85, trigger review.
def prompt_drift(baseline, current, embed):
bv = embed(baseline); cv = embed(current)
sim = cosine(bv, cv)
return sim < 0.85, sim
Multi-Model Router
Latency-sensitive vs reasoning-heavy queries routed by intent classifier; maintain per-model cost & quality KPIs.
def route(query, intent_cls, small_llm, large_llm):
intent = intent_cls(query)
if intent in {'status','faq','short'}:
return small_llm.generate(query), 'small'
return large_llm.generate(query), 'large'
Extended Evaluation Metrics
Perplexity (Proxy Fluency)
def perplexity(model, tokens):
import math
log_probs = model.log_probs(tokens)
avg_log = sum(log_probs)/len(log_probs)
return math.exp(-avg_log)
Use only with model offering log-prob interface; threshold drift indicates degradation.
Toxicity & Safety Ensemble
def safety_score(text, classifiers):
scores = [c.predict_proba(text)[1] for c in classifiers]
return sum(scores)/len(scores)
Escalate if score > 0.4 (soft) or >0.7 (hard block).
Hallucination Grounding Delta
Compute ratio difference between original and refined answer grounding ratios to quantify mitigation effectiveness.
Cost Forecast Scenario Analysis
def forecast(monthly_queries, avg_prompt_tokens, avg_completion_tokens, price_prompt, price_completion):
prompt_cost = (monthly_queries * avg_prompt_tokens/1000) * price_prompt
completion_cost = (monthly_queries * avg_completion_tokens/1000) * price_completion
return {
'prompt_cost': prompt_cost,
'completion_cost': completion_cost,
'total': prompt_cost + completion_cost
}
Run scenarios: baseline, +20% volume, +30% longer prompts; record variance for budget governance.
Retrieval Scoring Fusion Formula
Final score = 0.5 * dense_similarity + 0.3 * bm25_norm + 0.2 * recency_decay. This weighted approach balances semantic relevance, lexical coverage, and freshness for dynamic corpora.
def fused_score(dense, bm25, days_old):
recency = 1/(1 + 0.05*days_old)
return 0.5*dense + 0.3*bm25 + 0.2*recency
Semantic Cache Eviction Policy
- LRU baseline.
- Promote entries with high grounding ratio reuse.
- Evict entries falling below 0.6 average similarity across last 5 hits.
Scaling Patterns
| Pattern | Benefit | Trade-off |
|---|---|---|
| Prompt compression | Lower cost | Possible context loss |
| Distillation | Faster inference | Training effort |
| LoRA adapters | Targeted specialization | Additional storage |
| Quantization | Throughput gain | Minor quality drop |
| Caching | Latency & cost reduction | Stale risk |
Key Takeaways
- Treat LLM stack as governed pipeline with observable states.
- Blend summary + episodic memory for conversational continuity.
- Fuse heterogeneous retrieval scores for balanced relevance.
- Continuously measure hallucination mitigation effectiveness.
- Proactively model cost scenarios to avoid budget surprises.
- Route intelligently across model sizes for efficiency.
- Enforce prompt drift guardrails for stability.
- Evaluation is multi-dimensional: fluency, grounding, safety, cost.
Advanced Quantitative Evaluation
BLEU & ROUGE Batch
from nltk.translate.bleu_score import sentence_bleu
from rouge_score import rouge_scorer
def eval_text_metrics(dataset, model):
bleu_scores = []; rouge_scores = []
scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
for item in dataset:
gen = model.generate(item['prompt'])
bleu_scores.append(sentence_bleu([item['reference'].split()], gen.split()))
rouge_scores.append(scorer.score(item['reference'], gen)['rougeL'].fmeasure)
return {
'bleu_mean': sum(bleu_scores)/len(bleu_scores),
'rougeL_mean': sum(rouge_scores)/len(rouge_scores)
}
Structured Extraction Accuracy
def extraction_accuracy(outputs, gold):
import json
correct = 0
for o,g in zip(outputs, gold):
o_d = json.loads(o); g_d = json.loads(g)
if o_d == g_d: correct += 1
return correct/len(outputs)
Embedding Model Selection Criteria
| Factor | Consideration | Impact |
|---|---|---|
| Dimensionality | 384 vs 768 vs 1536 | Memory & recall |
| Domain Adaptation | Finetuned on in-domain corpus | Precision |
| Latency | ms per vector batch | Throughput |
| Cost | $ per 1K embeddings | Budget |
| Multilingual | Cross-lingual alignment | Global coverage |
Cache Design (Semantic + Exact)
semantic_cache = {}
def semantic_get(query, embed_fn, threshold=0.92):
qv = embed_fn(query)
for stored_q, (vec, answer) in semantic_cache.items():
sim = (qv @ vec)/( (qv**2).sum()**0.5 * (vec**2).sum()**0.5 )
if sim >= threshold: return answer
return None
Populate after successful validated generations; expires entries by LRU or concept drift detection.
Dynamic k Retrieval Tuning
Increase top-k when query complexity score or initial similarity average falls below threshold.
def dynamic_k(query_complexity, base_k=6):
if query_complexity < 0.4: return base_k
if query_complexity < 0.7: return base_k + 2
return base_k + 4
Streaming Token Handling
Use incremental evaluationβearly tokens scanned for banned content, abort generation if risk signature detected.
def stream_guard(stream_tokens, banned):
buf = []
for t in stream_tokens:
buf.append(t)
if any(b in t.lower() for b in banned):
return buf, 'ABORT'
return buf, 'OK'
Factual Grounding via Sentence-Level Alignment
def grounding_ratio(answer_sentences, context_sentences):
import difflib
hits = 0
for a in answer_sentences:
if max(difflib.SequenceMatcher(None, a, c).ratio() for c in context_sentences) > 0.75:
hits += 1
return hits / max(len(answer_sentences),1)
Temperature vs Diversity Curve
Plot distinct n-gram ratio vs temperature; choose sweet spot balancing creativity and coherence.
Prompt Cost Diff Audit
def cost_diff(old_prompt, new_prompt, tokenizer):
old_tokens = len(tokenizer(old_prompt))
new_tokens = len(tokenizer(new_prompt))
return {
'old': old_tokens,
'new': new_tokens,
'delta': new_tokens - old_tokens,
'pct_change': (new_tokens - old_tokens)/max(old_tokens,1)
}
Govern changes; reject > 20% token increase without justification.
Model Selection Matrix
| Model | Strength | Weakness | Use Case |
|---|---|---|---|
| Small LLM | Fast, cheap | Limited reasoning | Simple FAQ |
| Medium LLM | Balanced | Occasional hallucination | General enterprise QA |
| Large LLM | High reasoning | Costly | Complex synthesis |
| Fine-Tuned | Domain optimized | Maintenance overhead | Specialized compliance |
Responsible Use Checklist
| Item | Status |
|---|---|
| PII Filtering | Pending |
| Safety Classifier Ensemble | Pending |
| Prompt Version Logged | Pending |
| Source Citation Included | Pending |
| Token Budget Reviewed | Pending |
| Fairness & Bias Scan (if generative decisions) | Pending |
Incident Playbook (LLM)
| Step | Action |
|---|---|
| Detection | Alert: high hallucination or safety breach |
| Containment | Disable risky feature flag, enable stricter filters |
| Analysis | Review prompts, retrieval logs, offending output |
| Mitigation | Adjust prompt, expand context set, retrain safety classifier |
| Verification | Re-run evaluation harness |
| Documentation | Log cause + changes |
Governance Integration Hooks
- Emit
prompt_versionandretrieval_doc_idsin telemetry for audit. - Store cost projections vs actual monthly spend.
- Link safety incident IDs to risk register entries.
Extended References
- Chain-of-Thought Prompting (Wei et al.)
- Self-Ask Strategies
- RAG Fusion Techniques
- LoRA / QLoRA Implementation Guides
- Semantic Caching Research