• Playlab

Key Learnings

Critical insights, recommendations, and best practices discovered through the optimization process. These learnings are actionable takeaways for improving prompt engineering, RAG systems, and model selection.

Critical Insights

Model Selection Criteria

  • Tool calling reliability is non-negotiable: Even a 4% failure rate (Gemini) is unacceptable for production use.
  • Formatting consistency requires explicit instructions: Models like Kimi K2 need more guidance than Claude Sonnet 4.5, but can achieve similar results with proper prompt engineering.
  • Cost vs. capability trade-offs: Kimi K2 provides excellent tool calling at low cost, making it ideal for production despite requiring more explicit formatting instructions.
  • Latency matters: GPT-5 Mini's 25-second Time to First Token makes it unsuitable for real-time educational applications.

Prompt Engineering Insights

  • Categorization improves comprehension: Organizing related guidelines together helps models understand relationships and reduces oversight.
  • Repetition for emphasis works: Critical instructions repeated throughout the prompt increase compliance probability.
  • Separation of concerns: Formatting rules must be clearly separated from tool call instructions to prevent functional breakage.
  • Lower variability improves consistency: Especially important for Kimi K2 and other models that benefit from reduced temperature settings.

RAG System Insights

  • Tool execution order matters: References must be retrieved before they're referenced to prevent hallucination.
  • Models can confuse internal knowledge with retrieved content:Explicit workflow instructions prevent this confusion.
  • Search References Tool improves consistency: Migration from traditional RAG to Search References Tool provides better control over retrieval.

Recommendations for Improvement

For Prompt Engineering

  • Use categorization to organize related guidelines
  • Repeat critical instructions multiple times
  • Separate formatting rules from tool call instructions
  • Specify question formatting requirements (bold, max 2 at a time)
  • Use Unicode for math notation instead of LaTeX/Markdown
  • Include explicit verbosity control instructions

For RAG Systems

  • Always execute tool calls before referencing their results
  • Add confidence labels and source tagging
  • Implement retrieval verification steps
  • Lower variability for better instruction following
  • Provide explicit workflow instructions

For Cost Optimization

  • Implement progressive disclosure to reduce token usage
  • Add format specification early in conversation flow
  • Create smart defaults for common patterns
  • Use response length controls
  • Migrate to cost-effective models with similar capabilities

For User Experience

  • Do not provide answer keys automatically - ask user to reupload/paste
  • Add automatic MCQ linter before responses are sent
  • Standardize assessment blueprints per subject/level
  • Implement context persistence with confirmation shortcuts
  • Default to single optimal activity design with alternatives option

Technical Challenges and Solutions

Challenge: Tool Calling Failures

Problem: Gemini 2.5 Flash attempted to use XML tags for tool calls, breaking functionality and displaying raw JSON to users.

Solution: Disqualified Gemini and selected Kimi K2 for reliable tool calling. Implemented explicit tool call format instructions in prompts.

Challenge: Reference Hallucination

Problem: Kimi responds as if it read references even when tool calls haven't been executed.

Solution: Explicit workflow instructions requiring tool calls to be executed before referencing their results. Lower variability settings improve compliance.

Challenge: Formatting Inconsistencies

Problem: Kimi K2 doesn't format as naturally as Claude Sonnet 4.5.

Solution: Added explicit formatting instructions including markdown rules, question formatting requirements, and Unicode for math notation.

Challenge: High Costs

Problem: Claude Sonnet 4.5 costs $3/M input and $15/M output, with rate limits at scale.

Solution: Migrated to Kimi K2 for cost-effective tool calling, implemented token reduction strategies (30-40% reduction), and added progressive disclosure.

Best Practices Discovered

Prompt Engineering

  • Use clear categorization
  • Repeat critical instructions
  • Separate formatting from tool calls
  • Specify exact formatting requirements

Model Selection

  • Prioritize tool calling reliability
  • Consider cost vs. capability trade-offs
  • Test latency for real-time use cases
  • Evaluate formatting quality

RAG Implementation

  • Execute tool calls before referencing
  • Lower variability for consistency
  • Add explicit workflow instructions
  • Verify retrieval before use

Cost Optimization

  • Implement progressive disclosure
  • Add format specification early
  • Create smart defaults
  • Control response length

Actionable Takeaways for Anthropic Team

  • Claude's natural formatting is a competitive advantage: The system prompt and training data give Claude strong formatting abilities that other models struggle to match without explicit instructions.
  • Tool calling reliability is critical: Even small failure rates can create terrible user experiences. Models must handle tool calls correctly consistently.
  • Prompt engineering can bridge capability gaps: With proper prompt engineering, cheaper models can achieve similar results to more expensive ones, though with more explicit instructions.
  • RAG systems need explicit workflow control: Models can confuse internal knowledge with retrieved content. Explicit instructions about tool execution order are essential.
  • Cost optimization requires multiple strategies: Model migration, token reduction, and workflow optimization all contribute to significant cost savings.