（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=40926734

本文讨论了有关具有视觉功能的大型语言模型 (VLM) 性能的问题，特别提到了 GPT-4 和 Gemini-1.5 Pro。作者根据对某些任务的分析，对 VLM 的可靠性表示怀疑。然而，作者承认这些模型在某些任务上可以表现出色。尽管他们做出了承诺，但作者提出了缺乏实质性证据支持这些模型在各种应用中的有效性的问题，并强调了考虑其局限性的重要性。作者提供了相关研究的参考资料，并鼓励读者批判性地评估文中提出的主张。

Entertaining, but I think the conclusion is way off.

> their vision is, at best, like that of a person with myopia seeing fine details as blurry

is a crazy thing to write in an abstract. Did they try to probe that hypothesis at all? I could (well actually I can't) share some examples from my job of GPT-4v doing some pretty difficult fine-grained visual tasks that invalidate this.

Personally, I rate this paper [1], which makes the argument that these huge GenAI models are pretty good at things - assuming that it has seen a LOT of that type of data during training (which is true of a great many things). If you make up tasks like this, then yes can be REALLY bad at them, and initial impressions of AGI get harder to justify. But in practice, we aren't just making up tasks to trip up these models. They can be very performant on some tasks and the authors have not presented any real evidence about these two modes.

[1] https://arxiv.org/abs/2404.04125

There are quite a few "ai apologists" in the comments but I think the title is fair when these models are marketed towards low vision people ("Be my eyes" https://www.youtube.com/watch?v=Zq710AKC1gg) as the equivalent to human vision. These models are implied to be human level equivalents when they are not.

This paper demonstrates that there are still some major gaps where simple problems confound the models in unexpected ways. These is important work to elevate otherwise people may start to believe that these models are suitable for general application when they still need safeguards and copious warnings.

If we're throwing "citation needed" tags on stuff, how about the first sentence?

"Large language models with vision capabilities (VLMs), e.g., GPT-4o and Gemini-1.5 Pro are powering countless image-text processing applications"

I don't know how many a "countless" is, but I think we've gotten really sloppy in terms of what counts for LLMs as a demonstrated, durable win in a concrete task attached to well-measured outcomes and holding up over even modest periods of time.

This stuff is really promising and lots of builders are making lots of nifty things, so if that counts as an application then maybe we're at countless, but in the enterprise and in government and in refereed academic literature we seem to be at the proof-of-concept phase. Impressive chat bots as a use case are pretty dialed in, enough people claim that they help with coding that I tend to believe it's a real thing (I never seem to come out ahead of going directly to the source, StackOverflow).

The amount of breathless press on this seems "countless", so maybe I missed the totally rigorous case study on how X company became Y percent more profitable by doing Z thing with LLMs (or similar), and if so I'd be grateful for citations, but neither Google nor any of the big models seem to know about it.

"maybe I missed the totally rigorous case study on how X company became Y percent more profitable by doing Z thing with LLMs (or similar), and if so I'd be grateful for citations, but neither Google nor any of the big models seem to know about it."

Goldman Sachs recently issued a report.

https://www.goldmansachs.com/intelligence/pages/gs-research/...

"We estimate that the AI infrastructure buildout will cost over $1tn in the next several years alone, which includes spending on data centers, utilities, and applications. So, the crucial question is: What $1tn problem will AI solve? Replacing low- wage jobs with tremendously costly technology is basically the polar opposite of the prior technology transitions I’ve witnessed in my thirty years of closely following the tech industry"

I don’t see Be My Eyes or other similar efforts as “implied” to be equivalent to humans at all. They’re just new tools which can be very useful for some people.

“These new tools aren’t perfect” is the dog bites man story of technology. It’s certainly true, but it’s no different than GPS (“family drives car off cliff because GPS said to”).

Yea, really if you look at human learning/seeing/acting there is a feedback loop that LLM for example isn't able to complete and train on.

You see an object. First you have to learn how to control all your body functions to move toward it and grasp it. This teaches you about the 3 dimensional world and things like gravity. You may not know the terms, but it is baked in your learning model. After you get an object you start building a classification list "hot", "sharp", "soft and fuzzy", "tasty", "slick". Your learning model builds up a list of properties of objects and "expected" properties of objects.

Once you have this 'database' you create as a human, you can apply the logic to achieve tasks. "Walk 10 feet forward, but avoid the sharp glass just to the left". You have to have spatial awareness, object awareness, and prediction ability.

Models 'kind of' have this, but its seemingly haphazard, kind of like a child that doesn't know how to put all the pieces together yet. I think a lot of embodied robot testing where the embodied model feeds back training to the LLM/vision model will have to occur before this is even somewhat close to reliable.

Embodied is useful, but I think not necessary even if you need learning in a 3D environment. Synthesized embodiment should be enough. While in some cases[0] it may have problems with fidelity, simulating embodied experience in silico scales much better, and more importantly, we have control over time flow. Humans always learn in real-time, while with simulated embodiment, we could cram years of subjective-time experiences into a model in seconds, and then for novel scenarios, spend an hour per each second of subjective time running a high-fidelity physics simulation[1].

--

[0] - Like if you plugged a 3D game engine into the training loop.

[1] - Results of which we could hopefully reuse in training later. And yes, a simulation could itself be a recording of carefully executed experiment in real world.

> Like if you plugged a 3D game engine into the training loop

Isn't this what synthesized embodiment basically always is? As long as the application of the resulting technology is in a restricted, well controlled environment, as is the case for example for an assembly-line robot, this is a great strategy. But I expect fidelity problems will make this technique ultimately a bad idea for anything that's supposed to interact with humans. Like self-driving cars, for example. Unless, again, those self-driving cars are segregated in dedicated lanes.

> Humans always learn in real-time

In the sense that we can't fast-forward our offline training, sure, but humans certainly "go away and think about it" after gaining IRL experience. This process seems to involve both consciously and subconsciously training on this data. People often consciously think about recent experiences, run through imagined scenarios to simulate the outcomes, plan approaches for next time etc. and even if they don't, they'll often perform better at a task after a break than they did at the start of the break. If this process of replaying experiences and simulating variants of them isn't "controlling the flow of (simulated) time" I don't know what else you'd call it.

I disagree. I think the title, abstract, and conclusion not only misrepresents the state of the models but it misrepresents Thier own findings.

They have identified a class of problems that the models perform poorly at and have given a good description of the failure. They portray this as a representative example of the behaviour in general. This has not been shown and is probably not true.

I don't think that models have been portrayed as equivalent to humans. Like most AI in it has been shown as vastly superior in some areas and profoundly ignorant in others. Media can overblow things and enthusiasts can talk about future advances as if they have already arrived, but I don't think these are typical portayals by the AI Field in general.

Well maybe not blind but the analogy with myopia might stand.

For exemple in the case of OCR, a person with myopia will usually be able to make up letters and words even without his glasses based on his expectation (similar to vlm training) of seeing letters and words in, say, a sign. He might not see them all clearly and do some errors but might recognize some letters easily and make up the rest based on context, words recognition, etc. Basically experience.

I also have a funny anecdote about my partner, which has sever myopia, who once found herself outside her house without her glasses on, and saw something on the grass right in front. She told her then brother in law "look, a squirrel" Only for the "squirrel" to take off while shouting its typical caws. It was a crow. This is typical of VLM's hallucinations.

I know that GPT-4o is fairly poor to recognize music sheets and notes. Totally off the marks, more often than not, even the first note is not recognize on a first week solfège book.

So unless I missed something but as far as I am concerned, they are optimized for benchmarks.

So while I enjoy gen AI, image-to-text is highly subpart.

You don't really need a LLM for OCR. Hell, I suppose they just run a python script in its VM and rephrase the output.

At least that's what I would do. Perhaps the script would be a "specialist model" in a sense.

It's not that you need an LLM for OCR but the fact that an LLM can do OCR (and handwriting recognition which is much harder) despite not being made specifically for that purpose is indicative of something. The jump from knowing "this is a picture of a paper with writing on it" like what you get with CLIP to being able to reproduce what's on the paper is, to me, close enough to seeing that the difference isn't meaningful anymore.

That's a common misconception.

Sometimes if you upload an image to ChatGPT and ask for OCR it will run Python code that executes Tesseract, but that's effectively a bug: GPT-4 vision works much better than that, and it will use GPT-4 vision if you tell it "don't use Python" or similar.

I think the conclusion of the paper is far more mundane. It's curious that VLM can recognize complex novel objects in a trained category, but cannot perform basic visual tasks that human toddlers can perform (e.g. recognizing when two lines intersect or when two circles overlap). Nevertheless I'm sure these models can explain in great detail what intersecting lines are, and even what they look like. So while LLMs might have image processing capabilities, they clearly do not see the way humans see. That, I think, would be a more apt title for their abstract.

The paper I linked should hopefully mark me out as far from an AI apologist, it's actually really bad news for GenAI if correct. All I mean to say is the clickbait conclusion and the evidence do not match up.

We have started the ara of ai.

It really doesn't matter how good current llms are.

They have been good enough to start this ara.

And no it's not and never has been just llms. Look what Nvidia is doing with ml.

Whisper huge advantage, segment anything again huge. Alpha fold 2 again huge.

All the robot announcements -> huge

I doubt we will reach agi just through llms. We will reach agi through multi modal, mix of experts, some kind of feedback loop, etc.

But the stone started to roll.

And you know I prefer to hear about ai advantages for the next 10-30 years. That's a lot better than the crypto shit we had the last 5 years.

Simple is a relative statement. There are vision problems where monkeys are far better than humans. Some may look at human vision and memory and think that we lack basic skills.

With AINwe are creating intelligence but with different strengths and weaknesses. I think we will continue to be surprised at how well they work on some problems and how poor they do at some “simple” ones.

Be My Eyes user here. I disagree with your uninformed opinion. Be My Eyes is more often than not more useful then a human. And I am reporting from personal experience. What experience do you have?

Ah yes the blind person who constantly needs to know if two lines intersect.

Let's just ignore what a blind person normally needs to know.

You know what blind people ask? Sometimes there daily routine is broken because there is some type of construction and models can tell you this.

Sometimes they need to read a basic sign and models can do this.

Those models help people already and they will continue to get better.

I'm not sure if I'm more frustrated how condescending the authors are or your ignorance.

Valid criticism doesn't need to be shitty

Entertaining is indeed the right word. Nice job identifying corner cases of models' visual processing; curiously, they're not far conceptually from some optical illusions that reliably trip humans up. But to call the models "blind" or imply their low performance in general? That's trivially invalidated by just taking your phone out and feeding a photo to ChatGPT app.

Like, seriously. One poster below whines about "AI apologists" and BeMyEyes, but again, it's all trivially testable with your phone and $20/month subscription. It works spectacularly well on real world tasks. Not perfectly, sure, but good enough to be useful in practice and better than alternatives (which often don't exist).

> their vision is, at best, like that of a person with myopia seeing fine details as blurry

It's not that far from reality, most models sees images in very low resolution/limited colors, so not so far from this description

They didn't test that claim at all though. Vision isn't some sort of 1D sliding scale with every vision condition lying along one axis.

First of all myopia isn't 'seeing fine details as blurry' - it's nearsightedness - and whatever else this post tested it definitely didn't test depth perception.

And second - inability to see fine details is a distinct/different thing from not being able to count intersections and the other things tested here. That hypothesis, if valid, would imply that improving the resolution of the image that the model can process would improve its performance on these tasks even if reasoning abilities were the same. That - does not make sense. Plenty of the details in these images that these models are tripping up on are perfectly distinguishable at low resolutions. Counting rows and columns of blank grids is not going to improve with more resolution.

I mean, I'd argue that the phrasing of the hypothesis ("At best, like that of a person with myopia") doesn't make sense at all. I don't think a person with myopia would have any trouble with these tasks if you zoomed into the relevant area, or held the image close. I have a very strong feeling that these models would continue to suffer on these tasks if you zoomed in. Nearsighted != unable to count squares.

It seems to me they've brought up myopia only to make it more approachable to people how blurry something is, implying they believe models work with a blurry image just like a nearsighted person sees blurry images at a distance.

While myopia is common, it's not the best choice of analogy and "blurry vision" is probably clear enough.

Still, I'd only see it as a bad choice of analogy — I can't imagine anyone mistaking optical focus problems for static image processing problems — so in the usual HN recommendation, I'd treat their example in the most favourable sense.

There's definitely something interesting to be learned from the examples here - it's valuable work in that sense - but "VLMs are blind" isn't it. That's just clickbait.

> these huge GenAI models are pretty good at things

Is this the sales pitch though? Because 15 years ago, I had a scanner with an app that can scan a text document and produce the text on Windows. The machine had something like 256Mb of RAM.

Tech can be extremely good at niches in isolation. You can have an OCR system 10 years ago and it'll be extremely reliable at the single task it's configured to do.

AI is supposed to bring a new paradigm, where the tech is not limited to the specific niche the developers have scoped it to. However, if it reliably fails to detect simple things a regular person should not get wrong, then the whole value proposition is kicked out of the window.

>I could (well actually I can't)

I like the idea that these models are so good at some sort of specific and secret bit of visual processing that things like “counting shapes” and “beating a coin toss for accuracy” shouldn’t be considered when evaluating them.

LLMs are bad at counting things just in general. It’s hard to say whether the failures here are vision based or just an inherent weakness of the language model.

I think gpt4o is probably doing some ocr as preprocessing. It's not really controversial to say the vmls today don't pick up fine grained details - we all know this. Can just look at the output of a vae to know this is true.

If so, it's better than any other ocr on the market.

I think they just train it on a bunch of text.

Maybe counting squares in a grid was not probably considered important enough to train for.

Why do you think it's probable? The much smaller llava that I can run in my consumer GPU can also do "OCR", yet I don't believe anyone has hidden any OCR engine inside llama.cpp.

> Did they try to probe that hypothesis at all?

I think this is a communication issue and you're being a bit myopic in your interpretation. It is clearly an analogy meant for communication and is not an actual hypothesis. Sure, they could have used a better analogy and they could have done other tests, but the paper still counters quite common claims (from researchers) about VLMs.

> I could (well actually I can't) share some examples from my job of GPT-4v doing some pretty difficult fine-grained visual tasks that invalidate this.

I find it hard to believe that there is no example you can give. It surely doesn't have to be exactly your training data. If it is this good, surely you can create an example no problem. If you just don't want to, that's okay, but then don't say it.

But I have further questions. Do you have complicated prompting? Or any prompt engineering? It sure does matter how robust these models are to prompting. There's a huge difference between a model being able to accomplish a task and a model being able to perform a task in a non-very-specific environment. This is no different than something working in a tech demo and not in the hand of the user.

> But in practice, we aren't just making up tasks to trip up these models.

I see this sentiment quite often and it is baffling to me.

First off, these tasks are not clearly designed to trick these models. A model failing at a task is not suddenly "designed to trick a model." Its common with the river crossing puzzles where they're rewritten to be like "all animals can fit in the boat." If that is "designed to trick a model", then the model must be a stochastic parrot and not a generalist. It is very important that we test things where we do know the answer to because, unfortunately, we're not clairvoyant and can't test questions we don't know the answer to. Which is the common case in the real world usage.

Second, so what if a test was designed to trick up a model? Shouldn't we be determining when and where models fail? Is that not a critical question in understanding how to use them properly? This seems doubly important if they are tasks that humans don't have challenges with them.

> They can be very performant on some tasks and the authors have not presented any real evidence about these two modes.

I don't think people are claiming that large models can't be performant on some tasks. If they are, they're rejecting trivially verifiable reality. But not every criticism and has to also contain positive points. There's plenty of papers and a lot of hype already doing that. And if we're going to be critical of anything, shouldn't it be that the companies creating these models -- selling them, and even charging researchers to perform these types of experiments that the can and are used to improve their products -- should be much more clear about the limitations of their models? If we need balance, then I think there's bigger fish to fry than Auburn and Alberta Universities.

> I think this is a communication issue and you're being a bit myopic in your interpretation. It is clearly an analogy meant for communication and is not an actual hypothesis.

I don't know, words have meanings. If that's a communication issue, it's on part of the authors. To me, this wording in a what is supposed to be a research paper abstract clearly suggests the insufficient resolution as the cause. How else should I interpret it?

> The shockingly poor performance of four state-of-the-art VLMs suggests their vision is, at best, like that of a person with myopia seeing fine details as blurry

And indeed, increasing the resolution is expensive, and the best VLMs have something like 1000x1000. But the low resolution is clearly not the issue here, and the authors don't actually talk about it in the paper.

>I find it hard to believe that there is no example you can give.

I'm not the person you're answering to, but I actually lazily tried two of authors' examples in a less performant VLM (CogVLM), and was surprised it passed those, making me wonder whether I can trust their conclusions until I reproduce their results. LLMs and VLMs have all kinds of weird failure modes, it's not a secret they fail at some trivial tasks and their behavior is still not well understood. But working with these models and narrowing it down is notoriously like trying to nail a jelly to the wall. If I was able to do this in a cursory check, what else is there? More than one research paper in this area is wrong from the start.

> I don't know, words have meanings.

That's quite true. Words mean exactly what people agree upon them meaning. Which does not require everyone, or else slang wouldn't exist. Nor the dictionary, which significantly lags. Regardless, I do not think this is even an unusual use of the word, though I agree the mention of myopia is. The usage makes sense if you consider that both myopic and resolution have more than a singular meaning.

  Myopic:
  lacking in foresight or __discernment__ : narrow in perspective and without concern for broader implications

  Resolution:
  the process or capability of making distinguishable the individual parts of an object, closely adjacent optical images, or sources of light

I agree that there are far better ways to communicate. But my main gripe is that they said it was "their hypothesis." If reading the abstract as a whole, I find it an odd conclusion to come to. It doesn't pair with the words that follow with blind guessing (and I am not trying to defend the abstract. It is a bad abstract). But if you read the intro and look at the context of their landing page, I find it quite difficult to come to this conclusion. It is poorly written, but it is still not hard to decode the key concepts the authors are trying to convey.

I feel the need to reiterate that language has 3 key aspects to it: the concept attempted to be conveyed, the words that concept is lossy encoded into, and the lossy decoding of the person interpreting it. Communication doesn't work by you reading/listening to words and looking up those words in a dictionary. Communication is a problem where you use words (context/body language/symbols/etc) to decrease the noise and get the reciever to reasonably decode your intended message. And unfortunately we're in a global world and many different factors, such as culture, greatly affect how one encodes and/or decodes language. It only becomes more important to recognize the fuzziness around language here. Being more strict and leaning into the database view of language only leads to more errors.

> But the low resolution is clearly not the issue here, and the authors don't actually talk about it in the paper.

Because they didn't claim that image size and sharpness was an issue. They claimed the VLM cannot resolve the images "as if" they were blurry. Determining what the VLM actually "sees" is quite challenging. And I'll mention that arguably they did test some factors that relate to blurriness. Which is why I'm willing to overlook the poor analogy.

> I actually lazily tried two of authors' examples in a less performant VLM (CogVLM), and was surprised it passed those

I'm not. Depending on the examples you pulled, 2 random ones passing isn't unlikely given the results.

Something I generally do not like about these types of papers is that they often do not consider augmentations. Since these models tend to be quite sensitive to both the text (prompt) inputs and image inputs. This is quite common in generators in general. Even the way you load in and scale an image can have significant performance differences. I've seen significant differences in simple things like loading an image from numpy, PIL, tensorflow, or torch have different results. But I have to hand it to these authors, they looked at some of this. In the appendix they go through with confusion matrices and look at the factors that determine misses. They could have gone deeper and tried other things, but it is a more than reasonable amount of work for a paper.

> Second, so what if a test was designed to trick up a model? Shouldn't we be determining when and where models fail? Is that not a critical question in understanding how to use them properly?

People are rushing to build this AI into all kinds of products, and they actively don’t want to know where the problems are.

The real world outside is designed to trip up the model. Strange things happen all the time.

Because software developers have no governing body, no oaths of ethics and no spine someone will end up dead in a ditch from malfunctioning AI.

> The real world outside is designed to trip up the model. Strange things happen all the time.

Counterpoint: real world is heavily sanitized towards things that don't trip human visual perception up too much, or otherwise inconvenience us. ML models are trained on that, and for that. They're not trained for dealing with synthetic images, that couldn't possibly exist in reality, and designed to trip visual processing algorithms up.

Also:

> People are rushing to build this AI into all kinds of products, and they actively don’t want to know where the problems are.

Glass half-full (of gasoline) take: those products will trip over real-world problems, identifying them in the process, and the models will get better walking over the corpses of failed AI-get-rich-quick companies. The people involved may not want to know where the problems are, but by deploying the models, they'll reveal those problems to all.

> Because software developers have no governing body, no oaths of ethics and no spine someone will end up dead in a ditch from malfunctioning AI.

That, unfortunately, I 100% agree with. Though AI isn't special here - not giving a fuck kills people regardless of the complexity of software involved.

> They're not trained for dealing with synthetic images, that couldn't possibly exist in reality, and designed to trip visual processing algorithms up

Neither of these claims are true. ML is highly trained on synthetic images. In fact, synthetic data generation is the way forward for the scale is all you need people. And there are also loads of synthetic images out in the wild. Everything from line art to abstract nonsense. Just take a walk down town near the bars.

> not giving a fuck kills people regardless of the complexity of software involved.

What has me the most frustrated is that this "move fast break things and don't bother cleaning up" attitude is not only common in industry but also in academia. But these two are incredibly intertwined these days and it's hard to publish without support from industry because people only evaluate on benchmarks. And if you're going to hack your benchmarks, you just throw a shit ton of compute at it. Who cares where the metrics fail?

I had a remarkable experience with GPT-4o yesterday. Our garage door started to fall down recently, so I inspected it and found that our landlord had installed the wire rope clips incorrectly, leading to the torsion cables losing tension. I didn't know what that piece of hardware was called, so I asked ChatGPT and it identified the part as I expected it to. As a test, I asked if there was anything notable about the photo. ChatGPT correctly identified that the cables were installed backwards, with the side of the cable that was (previously) under tension on top of the slack end, instead of sandwiched securely in the middle. To diagnose that requires tracing the cable through space and inferring which end is under tension from the geometry, though I can't rule out an educated guess.

What was really remarkable though was that it failed to notice that one of the two nuts was obviously missing, even after I told it there was a second problem with the installation.

Screenshot: https://imgur.com/a/QqCNzOM

A human would need to trace the cable. An LLM may just be responding based on (1) the fact that you're asking about the clip in the first place, and that commonly happens when there's something wrong; and (2) that this is a very common failure mode. This is supported by it bringing up the "never saddle a dead horse" mnemonic, which suggests the issue is common.

After you fix it, you should try asking the same questions!

As a human, I was unable to see enough in that picture to infer which side was supposed to be under tension. I’m not trained, but I know what I expected to see from your description.

Like my sister post, I’m skeptical that the LLM didn’t just get lucky.

Vision Transformers do a shocking amount of compression in the tokenizer. In the [Chameleon paper](https://arxiv.org/pdf/2405.09818) they say the tokenizer "encodes a 512 × 512 image into 1024 discrete tokens from a codebook of size 8192". That's 256 pixels per token (512 * 512 / 1024). If we assume that a pixel is 24 bits (3x 8 bit channels), this implies that they've compressed 256 * 24 = 6144 bits into 13 = (log2(8192)). [An Image is Worth 32 Tokens for Reconstruction and Generation](https://yucornetto.github.io/projects/titok.html) pushes this even further. If these models work similarly, it's no wonder they struggle with some vision tasks.

GPT-4o is very good at some visual tasks like optical character recognition. So the selective blindness might just be what you say here -- all of its capacity is dedicated to minimizing loss on a few narrow tasks that had the most training data (like OCR). So it's not necessarily an inherent failure of the architecture to generalize, it could just be a capacity issue that will naturally be resolved with more scale.

for some reason I started thinking about trying to describe the taste of a fruit to someone who hasn't tried it as something that can be similar to this as a non-visual sensory modal in humans

VLMs so far have never been good at counting objects or spatial relationships (e.g. the coffee is to the right of the microwave).

There are ways to help the VLM out - Set of Marks [0] from Microsoft being the most prominent, which uses segmentation to outline and label sections of the image before sending to the VLM.

Providing "speakable" labels to regions helps ground the visual abilities of VLMs and is why in this paper the performance is so much better when words are present in the grid for "Task 6: Counting the rows and columns of a grid"

0: https://github.com/microsoft/SoM

It really shouldn't be surprising that these models fail to do anything that _they weren't trained to do_. It's trivially easy to train a model to count stuff. The wild thing about transformer based models is that their capabilities are _way_ beyond what you'd expect from token prediction. Figuring out what their limitations actually are is interesting because nobody fully knows what their limitations are.

I agree that these open ended transformers are way more interesting and impressive than a purpose built count-the-polygons model, but if the model doesn't generalize well enough to figure out how to count the polygons, I can't be convinced that they'll perform usefully on a more sophisticated task.

I agree this research is really interesting, but I didn't have an a priori expectation of what token prediction could accomplish, so my reaction to a lot of the claims and counterclaims of this new tech is that it's good at fooling people and giving plausible but baseless results. It makes for good research but dangerous in the hands of a market attempting to exploit it.

> I agree that these open ended transformers are way more interesting and impressive than a purpose built count-the-polygons model, but if the model doesn't generalize well enough to figure out how to count the polygons, I can't be convinced that they'll perform usefully on a more sophisticated task.

I think people get really wrapped into the idea that a single model needs to be able to do all the things, and LLMs can do a _lot_, but there doesn't actually need to be a _one model to rule them all_. If VLMs are kind of okay at image intepretation but not great at details, we can supplement them with something that _can_ handle the details.

Vision models use CLiP or something similar, which has no conception of anything specific in the image. It sees embeddings which correlate similarly to text embeddings. Take an image then describe it 'there are birds sitting on a power line in front of a blue sky with some clouds', get the embeddings from that and the embeddings from that picture and line them up. If you ask if there are birds in it, it would know, but not how many, unless it was common to describe the number of birds sitting on things and it happened often enough that the number counted was the number in the image descriptions it trained on. If you want to count objects you want something like YOLO.

VLMs like PaliGemma and Florence-2 support object detection and segmentation, so it's becoming more common to have YOLO like capabilities built into VLMs.

Another benefit of VLMs which support object detection is that they are open vocabulary, meaning you don't have to define the classes ahead of time. Additionally fine tuning tends to keep the previous detection capabilities instead of erasing all previous classes like fine tuning YOLO.

Wow, that is embarrassingly bad performance for current SOTA models (GPT-4o, Gemini-1.5 Pro, Sonnet-3, Sonnet-3.5), which are advertised and sold as being able to understand images, e.g., for guiding the blind or tutoring children in geometry!

The tasks at which they fail are ridiculously simple for human beings, including, for example:

* counting the number of times two lines intersect;

* detecting whether two circles overlap;

* selecting which letter is being circled in a word;

* counting the number of circles in an Olympic-like logo.

This should be at the top of the front page.

I don't see how this is "embarrassing" in the slightest. These models are not human brains, and the fact that people equate them with human brains is an embarrassing failure of the humans more than anything about the models.

It's entirely unsurprising that there are numerous cases that these models can't handle that are "obvious to humans." Machine learning has had this property since its invention and it's a classic mistake humans make dealing with these systems.

Humans assume that because a machine learning model has above human accuracy on task X that it implies that it must also have that ability at all the other tasks. While a human with amazing ability at X would indeed have amazing abilities at other tasks, this is not true of machine learning models The opposite thinking is also wrong, that because the model can't do well on task Y it must be unreliable and it's ability on task X is somehow an illusion and not to be trusted.

It is embarrassingly, shockingly bad, because these models are advertised and sold as being capable of understanding images.

Evidently, all these models still fall short.

They're hardly being advertised or sold on that premise. They advertise and sell themselves, because people try them out and find out they work, and tell their friends and/or audiences. ChatGPT is probably the single biggest bona-fide organic marketing success story in recorded history.

I see this complaint about LLMs all the time - that they're advertised as being infallible but fail the moment you give them a simple logic puzzle or ask for a citation.

And yet... every interface to every LLM has a "ChatGPT can make mistakes. Check important info." style disclaimer.

The hype around this stuff may be deafening, but it's often not entirely the direct fault of the model vendors themselves, who even put out lengthy papers describing their many flaws.

Not only are they capable of understanding images(the kind people might actually feed into such a system - photographs), but they're pretty good at it.

A modern robot would struggle to fold socks and put them in a drawer, but they're great at making cars.

I mean, with some of the recent demos, robots have got a lot better at folding stuff and putting it up. Not saying it's anywhere close to human level, but it has taken a pretty massive leap from being a joke just a few years ago.

These models are marketed as being able to guide the blind or tutoring children using direct camera access.

Promoting those use cases and models failing in these ways is irresponsible. So, yeah, maybe the models are not embarrasing but the hype definitely is.

Well said.

It doesn't matter how they are marketed or described or held up to some standard generated by wishful thinking. And it especially doesn't matter what it would mean if a human were to make the same error.

It matters what they are, what they're doing, and how they're doing it. Feel free to be embarrassed if you are claiming they can do what they can't and are maybe even selling them on that basis. But there's nothing embarrassing about their current set of capabilities. They are very good at what they are very good at. Expecting those capabilities to generalize as they would if they were human is like getting embarrassed that your screwdriver can't pound in a nail, when it is ever so good at driving in screws.

> is an embarrassing failure of the humans more than anything about the models

No, it's a failure of the companies who are advertising them as capable of doing something which they are not (assisting people with low vision)

But they CAN assist people with low vision. I've talked to someone who's been using a product based on GPT-4o and absolutely loves it.

Low vision users understand the limitations of accessibility technology better than anyone else. They will VERY quickly figure out what this tech can be used for effectively and what it can't.

I can't help but read comments like this as:

"My talking dog always makes mistakes on calculus problems: how embarrassingly bad!"

Has the expectation treadmill really advanced so quickly that sub-human performance on any category of problems is now an embarrassment?

I think while the findings are interesting, the title "Vision language models are blind" is hyperbolic and misleading of the findings. It is true that VLMs have a very different way of sensing and working with image inputs and for low resolutions each image is split up into blocks and then mapped into discrete tokens. They actually really dont have access to finegrained details because this mapping is quite lossy. So, in that sense, these results completely make sense and were not surprising to me but "blind" is a strong implication and the work isnt really able to back that.

Take the first example of the four 2D line plots for instance. I asked Sonnet 3.5 for an answer and retried it 5 times and 2 out of 5 times, it said something like this:

Me:

here's an image with four 2-D line plots. Can you tell me A) how many lines are in each plot? and B) how many times do they intersect each other in each plot?

Write your answers as: Plot 1 (leftmost): ...

Sonnet:

I'll analyze each plot from left to right and provide the requested information:

Plot 1 (leftmost): A) 2 lines (one red, one blue) B) 0 intersections (the lines do not cross)

Plot 2: A) 2 lines (one red, one blue) B) 1 intersection (the lines cross once near the right edge)

Plot 3: A) 2 lines (one red, one blue) B) 1 intersection (the lines cross once near the center)

Plot 4 (rightmost): A) 2 lines (one red, one blue) B) 2 intersections (the lines cross twice, once near the left side and once near the right side)

Each plot consistently shows two lines, one red and one blue, with varying intersection patterns across the four plots.

---

Which isn't completely correct (the 3rd plot has 2 intersections) but still pretty good and the other 3 times, it said plot 3 has 1 intersection.

Speaking as someone with only a tenuous grasp of how VLMs work, this naïvely feels like a place where the "embodiement" folks might have a point: Humans have the ability to "refine" their perception of an image iteratively, focusing in on areas of interest, while VLMs have to process the entire image at the same level of fidelity.

I'm curious if there'd be a way to emulate this (have the visual tokens be low fidelity at first, but allow the VLM to emit tokens that correspond to "focusing" on a region of the image with greater resolution). I'm not sure if/how it's possible to performantly train a model with "interactive" data like that, though

Humans are actually born with blurry vision as the eye takes time to develop, so human learning starts with low resolution images. There is a theory that this is not really a limitation but a benefit in developing our visual processing systems. People in poorer countries that get cataracts removed when they are a bit older and should at that point hardware-wise have perfect vision do still seem to have some lifelong deficits.

It's not entirely known how much early learning in low resolution makes a difference in humans, and obviously that could also relate more to our specific neurobiology than a general truth about learning in connectionist systems. But I found it to be an interesting idea that maybe certain outcomes with ANNs could be influenced a lot by training paradigms s.t. not all shortcomings could be addressed with only updates to the core architecture.

Isn't this the attention mechanism, the reason we're using transformers for these things? Maybe not greater resolution per se, but focusing on a region with greater neural connectivity

Ah, good point!

But the model is downstream of the "patch" tokenization, so the cut-down in resolution (compression) of the image has already occurred prior to the point where the model can direct greater "attention".

I think the synthesis is that I'm proposing a per-pixel tokenization with a transformer block whose purpose is to output information at a compression level "equivalent" to that of the patch tokens (is this what an autoencoder is?), but where the attention vector is a function of the full state of the LLM (ie, inclusive of the text surrounding the image)).

Naïvely, I'd think a layer like this that is agnostic to the LLM state needn't be any more computationally costly than the patching computation (both are big honks of linear algebra?), but idk how expensive the "full context attention" feedback is...

(I apologize to anyone who actually understands transformers for my gratuitous (ab|mis)use of terminology)

Lots and lots of eye tracking data paired with what was being looked at in order to emulate human attention processing might be one of the lower hanging fruits for improving it.

It's ironic, they fail these seemingly simple tests that are trivial even for a child to solve. Yet, I used Gemini to read a postcard containing handwritten Russian cursive text with lots of visual noise (postmarks and whatnot). It was able to read the text and translate it into English. I didn't even need to tell it the text is Russian.

On the one hand, it's incredible what these LLMs are capable of. On the other hand, they often fall flat on their face with seemingly simple problems like this. We are seeing the same from self driving cars, getting into accidents in scenarios that almost any human driver could have easily avoided.

I tought some Computational Geometry courses and efficiently computing the intersections of N line segments is not as straightforward as you might initially think. Since somewhere some computation must be done to recognize this and LLMs are not specifically trained for this task, it's not suprising they struggle.

In general, basic geometry seems under-explored by learning.

Yes, but so is telling if a photo contains a dog or understanding sentiment in a paragraph of text. Complexity isn't quite the issue, I think it is that there is a distinction between the type of reasoning which these models have learnt and that which is necessary for concrete mathematical reasoning.

> Since somewhere some computation must be done to recognize this

Humans don't have a "compute intersections" ability (other than a few who have learned it laboriously through algebra), we have a "see things and count them" mechanism. We aren't visually taking lines in a planar space and determining where they cross. We know what an intersection looks like, we see one, increment a counter, and find the next one. If it's less than around five, we do this all at once. Otherwise we literally count, sometimes in small groups, sometimes one at a time.

One use-case I always try is to have an AI try to read a school calendar image where days off are or days of interest are highlighted using a legend. i.e.: days with a square, circle or triangle or different color, etc.

When asking days for specific days of interest for the school year, AIs always struggle. They get some days right but forget some or fabulate new days. They fare a bit better if you remove some of the noise and give them only a picture of a month but even then, it's unreliable.

I've been generally frustrated at the lack of analysis of vision LLMs generally.

They're clearly a very exciting category of technology, and a pretty recent one - they only got good last October with GPT-4 Vision, but since then we've had more vision models from Anthropic and Google Gemini.

There's so much more information about there about text prompting compared to image prompting. I feel starved for useful information about their capabilities: what are vision models good and bad at, and what are the best ways to put them to work?

Why not use them yourself if you have access? I have been using Claude 3.5 Sonnet for gardening recently, and while it's not perfect(and can be a little blind unless you tell it to focus on a specific thing), it's helped me understand how to keep my plants alive in some challenging conditions(for me; this is my second or third attempt at gardening so it's all challenging lol). But just a experiment with it and see where the capabilities lie. I do agree that certain classes of visual data are challenging for it.

This paper examines the limitations of current vision-based language models, such as GPT-4 and Sonnet 3.5, in performing low-level vision tasks. Despite their high scores on numerous multimodal benchmarks, these models often fail on very basic cases. This raises a crucial question: are we evaluating these models accurately?

It's really interesting that there's a huge performance discrepancy between these SOTA models. In the Olympic logo example, GPT-4o is below the baseline accuracy of 20% (worse than randomly guessing) while Sonnet-3.5 was correct ~76% of the time.

Does anyone have any technical insight or intuition as to why this large variation exists?

You're welcome. I recently noticed I get better performance with VLMs when the queries are phrased this way - Descriptive Keys instead of explaining the problem in sentences. Similar to COT reasoning that many people claim gives better results, I personally found querying in this sequence - existenceOfEntity, numberOfEntities followed by propertiesOfEntities etc tends to give better results. I haven't verified any of this rigorously so please do take it with a pinch of salt :)

My guess is that the systems are running image recognition models, and maybe OCR on images, and then just piping that data as tokens into an LLM. So you are only ever going to get results as good as existing images models with the results filtered through an LLM.

To me, this is only interesting if compared with results of image recognition models that can already answer these types of questions (if they even exist, I haven't looked).

Maybe the service is smart enough to look at the question, and then choose one or more models to process the image, but not sure as I can't find anything on their sites about how it works.

I don't like this paper for the following reasons:

- The language is unnecessarily scathing

- They repeatedly show data where the models are getting things right 70, 80, 90% of the time, and then show a list of what they call "qualitative samples" (what does "qualitative" mean? "cherry-picked"?) which look very bad. But it got the answer right 70/80/90% of the time! That's hardly "blind"...

- Various of the tasks hinge on the distinction between two objects "exactly touching" vs. "very nearly touching" vs. "very slightly overlapping", a problem which (i) is hard for humans and (ii) is particularly (presumably deliberately) sensitive to resolution/precision, where we should not be surprised that models fail

- The main fish-shaped example given in task 1 seems genuinely ambiguous to me - do the lines "intersect" once or twice? The tail of the fish clearly has a crossing, but the nose of the fish seems a bit fishy to me... is that really an intersection?

- AFAIC deranged skepticism is just as bad as deranged hype, the framing here is at risk of appealing to the former

It's absolutely fair to make the point that these models are not perfect, fail a bunch of the time, and to point out the edge cases where they suck. That moves the field forwards. But the hyperbole (as pointed out by another commenter) is very annoying.

I'm also confused about some of the figures' captions, which don't seem to match the results:

- "Only Sonnet-3.5 can count the squares in a majority of the images", but Sonnet-3, Gemini-1.5 and Sonnet-3.5 all have accuracy of >50%

- "Sonnet-3.5 tends to conservatively answer "No" regardless of the actual distance between the two circles.", but it somehow gets 91% accuracy? That doesn't sound like it tends to answer "No" regardless of distance.

To be fair, the paper has an emoji in the _title_, so I wouldn't read it as a particularly particularly serious academic study as much as the equivalent of the Gawker of AI research. It is a "gotcha" paper that exploits some blind spots (sorry) that will easily be patched up with a few batches of training. I do think it highlights the lack of AGI in these things, which some people lacking situational awareness might need to see.

I am not sure where their experimental data came from. I tested it on GPT-4o using the prompt and images they provided, and the success rate was quite high, with significant differences from the results they provided.

This is an interesting article and goes along with how I understand how such models interpret input data. I'm not sure I would characterize the results as blurry vision, but maybe an inability to process what they see in a concrete manner.

All the LLMs and multi-modal models I've seen lack concrete reasoning. For instance, ask ChatGPT to perform 2 tasks, to summarize a chunk of text and to count how many words are in this chunk. ChatGPT will do a very good job summarizing the text and an awful job at counting the words. ChatGPT and all the transformer based models I've seen fail at similar concrete/mathematical reasoning tasks. This is the core problem of creating AGI and it generally seems like no one has made any progress towards synthesizing something with both a high and low level of intelligence.

My (unproven and probably incorrect) theory is that under the hood these networks lack information processing loops which make recursive tasks, like solving a math problem, very difficult.

I hope you are aware of the fact that LLMs does not have direct access to the stream of words/characters. It is one of the most basic things to know about their implementation.

Yes, but it could learn to associate tokens with word counts as it could with meanings.

Even still, if you ask it for token count it would still fail. My point is that it can’t count, the circuitry required to do so seems absent in these models

Could some of the "wrong" answers be the LLM attempting to give an explanation rather than the answer, eg. instead of answering 'X', the LLM answers 'The letter is partially hidden by the oval, so cannot be certain, but it appears to be the english letter X'.

The scoring criteria would rank this answer as 'T', which is wrong.

This says to me that there are huge opportunities for improvement in providing vision modules for LLMs. Human minds aren't made of just one kind of thing: we have all sorts of hacky modular capabilities – there's no reason to think that a future AGI wouldn't also.

That's not anything like "myopia", though.

FWIW I tried the line intersection and the circled letter test from the article with CogVLM (which is far from reaching the current SotA) and it correctly passed both. I haven't tried it with Sonnet/4o but I suspect there might be something wrong with how the author did their tests. Don't get me wrong, but too many "the model can't do that" claims ended up with demonstrations of the model doing exactly that...

Same way they answer any question: piece together a statistically probable sequence of words to follow the prompt. All they know about an image is a handful of words a classifier might choose to describe it. If those words have nothing to do with the question being asked, they can't nudge the model in the general direction of a correct answer, so it's a crapshoot- even moreso than usual.

We see through thoughts and memories. We see when we desire, the vision just adds on a word pf thoughts and consciousness of being conscious.

Vision links thoughts with reality

Are the "random-baseline accuracy" numbers correct?

In the "Two circles" test, do they really have 50% chance of overlapping? I think this comes from "Distances between circle perimeters: -0.15 to 0.5 times the diameter" but doesn't say the distribution they use.

They asked the AI a question with a yes/no response. If the AI chose randomly, it would be correct 50% of the time. That’s what “random baseline accuracy” means.

I love some of the interpretations there. For example "Fig. 10: Only Sonnet-3.5 can count the squares in a majority of the images.", when that model simply returns "4" for every question and happens to be right.

A few comments below talk about how tokenizing images using stuff like CLIP de-facto yields blurry image descriptions, and so these are ‘blind’ by some definitions. Another angle of blurring not much discussed is that the images are rescaled down; different resolutions for different models. I wouldn’t be surprised if Sonnet 3.5 had a higher res base image it feeds in to the model.

Either way, I would guess that we’ll need new model architectures for multimodal to get really good at some of this, and even then some of these tasks are adjacent to things that we know LLMs are already bad at (numeric logic, for instance).

As context lengths get longer, devoting more tokens to the image tokenization should help a bit here as well. Anyway, I’d anticipate next year we’d see 80s and 90s for most of these scores with next gen models.

At least for gpt 4o, it can create a facsimile of images that it still can’t analyze properly, so I think it’s more than just its “eyes” that are broken.

It clearly wasn’t trained on this task and suffers accordingly.

However, with chatgpt, it will create python to do the analysis and has better results.

The problem with the current crop of projectors such as LLaVA is that as far as I know do not take the previous conversation into account. You only really get zero shot responses. This means that you cannot steer the model towards paying attention to specific instruction related details. The projector simply creates a token representation of the visuals (not necessarily human language tokens) and the LLM just processes that as usual.

The original gpt4 did this too, it had almost no memory before or after the image provided. I haven’t tested gpt4o on this directly, but my feeling is that it’s better from casual usage.

I do think some of these thin line drawings are likely extra hard to tokenize depending on the image scaling sizes for tokenization. I’d wager thicker lines would help, although obviously not all of this is just ‘poor tokenization’.

This could easily be fixed with training and fine tuning. Simply generate 100,000 examples or so, and train with ground truth until however long you want and its a solved problem.

Solved for this benchmark... and at what cost to the rest of the system?

These tasks are interesting because they're existence proofs of generalization failure. Like the haystack problem, direct solutions here are much less interesting than structural improvements that address the class of failure.

AlphaGeometry is a hyper-specific system trained to add auxiliary geometric objects, like extra lines, to existing Euclidean geometry configurations. These prompts are not even sensible inputs to AlphaGeometry.

Well, all the models (especially Claude 3.5 Sonnet) seem to perform much better than random, so they are clearly not blind. The only task where Claude Sonnet 3.5 does not perform better than random is the one where you have to follow many different paths (the ones where the answer from A to C is 3), something that would take me several seconds to solve.

I have the feeling that they first choose the title of the paper and then run the evaluation on the new Claude 3.5 Sonnet on these abstract images.

>their vision is, at best, like that of a person with myopia seeing fine details as blurry

This also makes no sense, since the images evaluate the abstract capabilities of the models, not their eyesight.

This is kind of the visual equivalent of asking an LLM to count letters. The failure is more related to the tokenization scheme than the underlying quality of the model.

I'm not certain about the specific models tested, but some VLMs just embed the image modality into a single vector, making these tasks literally impossible to solve.

Current approaches of multi-modal models work on embeddings and tokenizations of images, which is the fundamental problem: you are feeding blurry, non-precise data into the model. Yes, they are blind because of exactly this.

An embedding isn't conceptually that much different from feeding a 1024-word description of an image instead of the actual image.

At the moment compute power isn't good enough to feed high-res pixel data into these models, unless we discover a vastly different architecture, which I am also convinced likely exists.

> An embedding isn't conceptually that much different from feeding a 1024-word description of an image instead of the actual image.

An embedding needs less words. You can embed individual words, phrases, like a whole prompt and longer paragraphs. You don't need 1024 words for a text embed. For example a famous library is called Sentence BERT (sbert).

When you embed images on the other hand, you cut them up into little squares on the tune of 32x32 px, and embeds one of them separately. chatGPT uses something like 250 tokens for smaller images. So a smaller image costs about as much as 200 words if represented graphically, and maybe much less words if you embed a text description of it.

> needs less words

Yes I'm aware of this, and work in ML -- the thing is embeddings are not designed for faithful image reconstruction, and aren't even trained that way. You can easily find two images that have substantially similar CLIP (or whatever) embeddings that are visually very different. If you query the LLM about that difference, the LLM wouldn't even have the information to differentiate answers for the two images if you only supply it with the embedding.

On the other hand, SDXL autoencoder latents passed into an LLM alongside the embedding might be a step up from just an image embedding, since they are designed for image reconstruction, but I don't have access to the compute or data resources to attempt training this.

They interact with pixel buffers as a mathematical array. To call them blind is to confuse what they doing with the experience of sight...

Humans 'see' by tightly packed rods and cones in the retina sending signals up the optic nerve. Not as tidy as a mathematical array but nonetheless not all that different. Ultimately what comes to the brain from the retina can be thought of as a data structure of sorts.

The title for this page and argument should be qualified with the specific generation of tools.

That's in the abstract, but, it's bad to not be specific. In this case, because current public-facing models are WIWEB: the worst it will ever be.

And there are trillion-dollar prizes at stake, so, improvement is happening as quickly as it possibly can.

I was hoping that someone in the comments talking the paper down would have published a paper or have had relevant publications of their own to point to. You know, meet the lads halfway sort of thing.

So what I’m left with to judge instead is anonymous online commenters vs. the publication of 2 prestigious universities. Whose word do I take on this? Decisions, decisions.

You can swap LM out with Web3 out with NFT out with Crypto in this case.

> I’m left with [...] is anonymous online commenters vs. the publication of 2 prestigious universities. Whose word do I take on this?

Maybe you need to judge the contents of those online comments and the contents of the publication, instead of relying on argument from authority.

（评论） (comments)

（评论）
(comments)