LLM Benchmark for 'Longform Creative Writing'

https://news.ycombinator.com/rss Hits: 6

Summary

Longform Creative Writing Benchmark This benchmark evaluates several abilities: Brainstorming & planning out a short story/novella from a minimal prompt Reflect on the plan & revise Write a short story/novella over 8x 1000 word turns Models are typically evaluated via openrouter, using temp=0.7 and min_p=0.1 as the generation settings. Outputs are evaluated with a scoring rubric by Claude Sonnet 3.7. Length The average chapter length (chars). Slop Score The Slop column measures the frequency of words/phrases typically overused by LLMs (“GPT-isms”) in each completed chapter. The lower, the better. Repetition Metric The Repetition column measures how strongly a model repeats words/phrases across multiple tasks. Higher means more repetition. Degradation A mini-sparkline of the 8 chapter scores (averages) to visually see if the model’s chapter quality drops off as it continues writing. The degradation score is the absolute value of the trendline's gradient. Score (0-100) The overall final rating assigned by the judge LLM, scaled to 0–100. Higher is better.

First seen: 2025-04-10 07:42

Last seen: 2025-04-10 12:43

Read Full Article More from this Source

LLM Benchmark for 'Longform Creative Writing'

Summary

Related News

Visual Transistor-level Simulation of the 6502 CPU

How a Pipe Organ Works

TmuxAI: AI-Powered, Non-Intrusive Terminal Assistant

Cut: Chattanooga Civic User Testing

Show HN: I created snapDOM to capture DOM nodes as images with exceptional speed