![]() |
|
![]() |
| There are quite a few "ai apologists" in the comments but I think the title is fair when these models are marketed towards low vision people ("Be my eyes" https://www.youtube.com/watch?v=Zq710AKC1gg) as the equivalent to human vision. These models are implied to be human level equivalents when they are not.
This paper demonstrates that there are still some major gaps where simple problems confound the models in unexpected ways. These is important work to elevate otherwise people may start to believe that these models are suitable for general application when they still need safeguards and copious warnings. |
![]() |
| "maybe I missed the totally rigorous case study on how X company became Y percent more profitable by doing Z thing with LLMs (or similar), and if so I'd be grateful for citations, but neither Google nor any of the big models seem to know about it."
Goldman Sachs recently issued a report. https://www.goldmansachs.com/intelligence/pages/gs-research/... "We estimate that the AI infrastructure buildout will cost over $1tn in the next several years alone, which includes spending on data centers, utilities, and applications. So, the crucial question is: What $1tn problem will AI solve? Replacing low- wage jobs with tremendously costly technology is basically the polar opposite of the prior technology transitions I’ve witnessed in my thirty years of closely following the tech industry" |
![]() |
| Be My Eyes user here. I disagree with your uninformed opinion. Be My Eyes is more often than not more useful then a human. And I am reporting from personal experience. What experience do you have? |
![]() |
| There's definitely something interesting to be learned from the examples here - it's valuable work in that sense - but "VLMs are blind" isn't it. That's just clickbait. |
![]() |
| LLMs are bad at counting things just in general. It’s hard to say whether the failures here are vision based or just an inherent weakness of the language model. |
![]() |
| If so, it's better than any other ocr on the market.
I think they just train it on a bunch of text. Maybe counting squares in a grid was not probably considered important enough to train for. |
![]() |
| Why do you think it's probable? The much smaller llava that I can run in my consumer GPU can also do "OCR", yet I don't believe anyone has hidden any OCR engine inside llama.cpp. |
![]() |
| I had a remarkable experience with GPT-4o yesterday. Our garage door started to fall down recently, so I inspected it and found that our landlord had installed the wire rope clips incorrectly, leading to the torsion cables losing tension. I didn't know what that piece of hardware was called, so I asked ChatGPT and it identified the part as I expected it to. As a test, I asked if there was anything notable about the photo. ChatGPT correctly identified that the cables were installed backwards, with the side of the cable that was (previously) under tension on top of the slack end, instead of sandwiched securely in the middle. To diagnose that requires tracing the cable through space and inferring which end is under tension from the geometry, though I can't rule out an educated guess.
What was really remarkable though was that it failed to notice that one of the two nuts was obviously missing, even after I told it there was a second problem with the installation. Screenshot: https://imgur.com/a/QqCNzOM |
![]() |
| Vision Transformers do a shocking amount of compression in the tokenizer. In the [Chameleon paper](https://arxiv.org/pdf/2405.09818) they say the tokenizer "encodes a 512 × 512 image into 1024 discrete tokens from a codebook of size 8192". That's 256 pixels per token (512 * 512 / 1024). If we assume that a pixel is 24 bits (3x 8 bit channels), this implies that they've compressed 256 * 24 = 6144 bits into 13 = (log2(8192)). [An Image is Worth 32 Tokens for Reconstruction and Generation](https://yucornetto.github.io/projects/titok.html) pushes this even further. If these models work similarly, it's no wonder they struggle with some vision tasks.
|
![]() |
| for some reason I started thinking about trying to describe the taste of a fruit to someone who hasn't tried it as something that can be similar to this as a non-visual sensory modal in humans |
![]() |
| It is embarrassingly, shockingly bad, because these models are advertised and sold as being capable of understanding images.
Evidently, all these models still fall short. |
![]() |
| Isn't this the attention mechanism, the reason we're using transformers for these things? Maybe not greater resolution per se, but focusing on a region with greater neural connectivity |
![]() |
| Lots and lots of eye tracking data paired with what was being looked at in order to emulate human attention processing might be one of the lower hanging fruits for improving it. |
![]() |
| I hope you are aware of the fact that LLMs does not have direct access to the stream of words/characters. It is one of the most basic things to know about their implementation. |
![]() |
| We see through thoughts and memories. We see when we desire, the vision just adds on a word pf thoughts and consciousness of being conscious.
Vision links thoughts with reality |
![]() |
| They asked the AI a question with a yes/no response. If the AI chose randomly, it would be correct 50% of the time. That’s what “random baseline accuracy” means. |
![]() |
| This could easily be fixed with training and fine tuning. Simply generate 100,000 examples or so, and train with ground truth until however long you want and its a solved problem. |
![]() |
| They interact with pixel buffers as a mathematical array. To call them blind is to confuse what they doing with the experience of sight... |
> their vision is, at best, like that of a person with myopia seeing fine details as blurry
is a crazy thing to write in an abstract. Did they try to probe that hypothesis at all? I could (well actually I can't) share some examples from my job of GPT-4v doing some pretty difficult fine-grained visual tasks that invalidate this.
Personally, I rate this paper [1], which makes the argument that these huge GenAI models are pretty good at things - assuming that it has seen a LOT of that type of data during training (which is true of a great many things). If you make up tasks like this, then yes can be REALLY bad at them, and initial impressions of AGI get harder to justify. But in practice, we aren't just making up tasks to trip up these models. They can be very performant on some tasks and the authors have not presented any real evidence about these two modes.
[1] https://arxiv.org/abs/2404.04125