Evaluating LLMs for my personal use case

https://news.ycombinator.com/rss Hits: 11

Summary

It’s great that AI can win maths Olympiads, but that’s not what I’m doing. I mostly ask basic Rust, Python, Linux and life questions. So I did my own evaluation. I gathered 130 real prompts from my bash history (I use command line tool llm). I had Qwen3 235B Thinking and Gemini 2.5 Pro group them into categories. They both chose very similar ones, broadly (with examples): Programming - “Write a bash script to ..” Sysadmin - “With curl how do I ..” Technical explanations - “Explain underlay networks in a data center” General knowledge and creative tasks - “Recipe for blackened seasoning” Then I had GPT-OSS-120B and GLM 4.5 pick three queries for each category from the 130 prompts. I used that to help me pick three entries per category, they are listed at the end. I use Open Router everyday, and I used it for these evals. It’s the only place I know that has all the models, great prices, low latency, and a very sane API. I use my own fast and simple Rust CLI called ort. The set of models I chose to evaluate was based on my past experience with them, various leaderboards and their cost on Open Router. It is a mixture of reasoning, non-reasoning and hybrid models. I evaluated: anthropic/claude-sonnet-4 without reasoning anthropic/claude-sonnet-4 with reasoning (I didn’t realise Sonnet can think!) deepseek/deepseek-chat-v3-0324 deepseek/deepseek-r1-0528 google/gemini-2.5-flash google/gemini-2.5-pro moonshotai/kimi-k2 openai/gpt-oss-120b qwen/qwen3-235b-a22b-2507 qwen/qwen3-235b-a22b-thinking-2507 z-ai/glm-4.5 with reasoning For the three programming questions I added these at the last minute, mostly because I was enjoying the process: inception/mercury-coder-small-beta mistralai/devstral-medium-2507 qwen/qwen3-coder-480b-a35b-07-25 z-ai/glm-4.5-air without reasoning These extra coding models are not included in most of the results. Mercury Coder was decent and very fast, Qwen3 Coder was surprisingly bad. I wrote a Rust eval script to run the prompts against the models and...

First seen: 2025-08-24 03:56

Last seen: 2025-08-24 14:10

Read Full Article More from this Source

Evaluating LLMs for my personal use case

Summary

Related News

Microsoft PowerToys

Show HN: TailGuard – Bridge your WireGuard router into Tailscale via a container

The Scam Called "You Don't Have to Remember Anything"

E-Paper Display Refresh Rate Reaches New Heights

PKM apps need to get better at resurfacing information