• Playlab

Qwen

under evaluation

Still under evaluation. Generally performs well but requires refinement to control unwanted behaviors.

Tool Calling Reliability

Good

Strengths

  • Generally performs well
  • Successfully follows most instructions

Weaknesses

  • Excessive use of emojis and emoticons
  • Tendency toward sycophantic behavior
  • Not as naturally aligned with desired formatting

Key Notes

  • Requires additional prompt engineering to control emoji usage
  • Needs refinement for professional tone

Detailed Notes & Findings

Strengths

  • Clarification: Qwen does VERY well when you try to stump it by providing inputs that don’t make sense. It tries to clarify before moving on.
  • Natural Formatting: Qwen's formatting is already very good out of the box.

Areas for Improvement

  • Over-eager Execution: Qwen sometimes goes through too many steps in one shot. It will need explicit prompting around that.
  • Verbose Steps: Needs explicit instructions not to state "STEP 1" or "STEP 2".
  • Over-engineering: Because formatting is naturally good, applying strict constraints (like those for Gemini) results in over-engineering.

Formatting Observations

When Gemini-specific constraints are applied to Qwen, it results in degraded performance due to over-engineering. Qwen handles formatting well natively and shouldn't be constrained as heavily.

Formatting Issues Reference

QWEN Formatting issues from overengineeering.pdf (See attachments)

Prototype Apps

Embedded prototypes demonstrating Qwen performance across different subjects.

English App

Open in new tab

Economics App

Open in new tab

Intervention App

Open in new tab

Social Studies App

Open in new tab

Mathematics App

Open in new tab