Here's a framework for prompt optimization:Defining Success: Metrics and Evaluation CriteriaBefore collecting any data, establish what success looks like for your specific use case. Choose a primary metric that directly reflects business value—accuracy for classification, F1 for imbalanced datasets, BLEU/ROUGE for generation tasks, or custom domain-specific measures like "percentage of correctly extracted invoice fields" or "customer issue resolution rate." This primary metric drives optimization decisions.Alongside your primary metric, define auxiliary constraints that you won't compromise on. These include output format compliance (does the JSON parse?), latency requirements (under 2 seconds per request), cost bounds ($0.01 per query), and safety requirements (no PII leakage, no harmful content). Treat these as pass/fail gates rather than metrics to optimize.If you're using LLM judges for evaluation—common for subjective tasks like writing quality or helpfulness—implement careful controls. Randomize the order of responses being compared, normalize for length biases, use structured rubrics rather than open-ended judgments, and periodically validate against human evaluation. Remember that LLM judges can be gamed, so never use them as the sole evaluation method for high-stakes deployments.Data for EvaluationOnce metrics are defined, determine how much data you need for statistically valid comparisons. For detecting a three percentage point improvement with 95% confidence, you'll need approximately 1,000 labeled examples. For a 5 percentage point precision, around 400 examples suffice.Random sampling is critical—your evaluation data must represent the true distribution of inputs your system will face in production. Stratified sampling is cheap to do and always improves the standard errors. Split your data thoughtfully: for small datasets (<1k examples), use K-fold cross-validation on combined train and dev sets, reserving a single test set for final evaluation. For la...
First seen: 2025-08-24 02:54
Last seen: 2025-08-24 03:56