Visual Reasoning Is Coming Soon

https://news.ycombinator.com/rss Hits: 11
Summary

Visual Reasoning is Coming Soon I gotta say – I love it living in exponential times. I can just wish that something existed and then within a month it does! This time it happened with OpenAI's 4o image generation release. In this blog post I'll briefly cover the release and why I think it's pretty cool. Then I'll dive into a new opportunity that I think is even more exciting – visual reasoning. Rather watch than read? Hey, I get it - sometimes you just want to kick back and watch! Check out this quick video where I walk through everything in this post. Same great content, just easier on the eyes! VIDEO Why Image Manipulation with LLMs Stinks Working with images in Multimodal LLMs has been a mostly one-sided affair. On one hand, it's really cool that you can drop an image into an LLM conversation and get the model to reason about it. But when you ask the model to generate an image, there is a disconnect, because all the model can do is describe the image in text and then call out to an external image generation tool to generate the image based on that text. Text is a poor communication medium for images, and the resulting image is often quite disconnected from the expected result because the short description that the LLM provides to the image generation tool will rarely capture the full context of the conversation. The problem is most pronounced when attempting to go back and forth working on an idea for an image. You can show the LLM an image of your cat and then say "make this cat wear a detective hat and a monocle". The best the model can do is to put a detective hat and monocle on some cat, not the one in your image. To make matters worse, the model can't even see the image that it has just created. So if you ask for a modification to first generation attempt, then the subsequent generations are really just starting over from scratch and hoping that a more detailed description to the image generation tool will make things better... it won't. Left: OpenAI's pet c...

First seen: 2025-04-09 16:37

Last seen: 2025-04-10 02:41