Longform Creative Writing Benchmark This benchmark evaluates several abilities: Brainstorming & planning out a short story/novella from a minimal prompt Reflect on the plan & revise Write a short story/novella over 8x 1000 word turns Models are typically evaluated via openrouter, using temp=0.7 and min_p=0.1 as the generation settings. Outputs are evaluated with a scoring rubric by Claude Sonnet 3.7. Length The average chapter length (chars). Slop Score The Slop column measures the frequency of words/phrases typically overused by LLMs (“GPT-isms”) in each completed chapter. The lower, the better. Repetition Metric The Repetition column measures how strongly a model repeats words/phrases across multiple tasks. Higher means more repetition. Degradation A mini-sparkline of the 8 chapter scores (averages) to visually see if the model’s chapter quality drops off as it continues writing. The degradation score is the absolute value of the trendline's gradient. Score (0-100) The overall final rating assigned by the judge LLM, scaled to 0–100. Higher is better.
First seen: 2025-04-10 07:42
Last seen: 2025-04-10 12:43