![]() |
|
![]() |
| It was just to show that you can run it locally, in opposition to "cloud APIs" referred in the thread, but you are right, the more correct term is local |
![]() |
| > There's certainly smaller and even better models for OCR
Could you please list some? I am developing a tool that relies on OCR and everything I've found refers to tesseract as being the best choice |
![]() |
| The API: https://developer.apple.com/documentation/vision/recognizing...
In my experience it works remarkably well for features like scanning documents in Notes and in copying or translating text embedded in images in Safari. It is not open source, but free to use locally. Someone has written a Python wrapper (apple-ocr) around it if you want to use it in other workflows. The model files might be in /System/Library/PrivateFrameworks/TextRecognition.framework if you wanted to port them to other platforms. |
![]() |
| Do you know of any guides or tutorials to doing this? I tried using the MiniCPM model for this task, but it just OCRed a tiny bit of information then told me that it couldn't extract the rest. |
![]() |
| Love this curious and open-minded exploration of how this stuff works.
The pyramid strategy loosely tracks with renormalization group theory, which has been formally studied for years as a method of interpreting machine learning models: https://arxiv.org/abs/1410.3831 I love the convergence we're seeing in the use of models from different fields to understand machine learning, fundamental physics, and human consciousness. What a time to be alive. |
![]() |
| Something I don't get is why OpenAI don't provide clear, comprehensive documentation as to how this actually works,
I get that there's competition from other providers now so they have an instinct to keep implementation details secret, but as someone building on their APIs this lack of documentation really holds me back. To make good judgements about how to use this stuff I need to know how it works! I had a hilarious bug a few weeks ago where I loaded in a single image representing multiple pages of a PDF and GPT-4 vision effectively hallucinated the contents of the document when asked to OCR it, presumably because the image was too big and was first resized to a point where the text was illegible: https://simonwillison.net/2024/Apr/17/ai-for-data-journalism... If OpenAI had clear documentation about how their image handling works I could avoid those kinds of problems much more effectively. |
![]() |
| For document text/table extraction, nothing beats the quality from the cloud providers. It can get costly but the accuracy is much higher than what you will find using an openai API. |
![]() |
| > At some point we're going to go from tokens to embeddings for everything.
Yes, I agree. Further down the road, I imagine we will end up finding interesting connections to the symbolic approaches of GOFAI, given that the embedding of a token, object, concept, or other entity in some vector space is basically a kind of symbol that represents that token, object, concept, or entity in that vector space. Interestingly, old terms like "representation" and "capsule," which didn't become as widely adopted as "embedding," tried more explicitly to convey this idea of using vectors/matrices of feature activations to stand in for objects, concepts, and other entities. For example, see Figure 1 in this paper from 2009-2012: http://www.cs.princeton.edu/courses/archive/spring13/cos598C... -- it's basically what we're talking about! |
![]() |
| The way this tests GPT-4o performance by feeding in a 7x7 grid of colored shapes and requesting them back as JSON (about half way down the page) is really clever. |
![]() |
| Great article. Perhaps some part of this magic number simply factors in the amount of compute necessary to run the image through the CNN (proportional to compute use per token in the LM). |
![]() |
| > It's easy to get GPT to divulge everything from system prompt to tool details,
It's easy enough to get it to hallucinate those things. It doesn't actually tell them to you. |
![]() |
| I'm well aware of that, but there are plenty of ways to induce verbatim quoting from "hidden" information, and mostly verify it (through sampling a large number of times in separate runs).
Models are improving in truly hiding or ignoring information these days though. As the author of the article states, you'll have a hard time tricking GPT-4o to read text in images as instructions, most likely thanks to this research: https://openai.com/index/the-instruction-hierarchy/ I do feel pretty confident that when the model is happily spitting out its system prompt, and all metadata around the image, but not its pixel dimensions, that probably those dimensions were not provided in any system/assistant/tool message. So maybe part of the image embeddings also encode the pixel dimensions somehow (it would also help the model not think of the image as a squished square for non-1:1 images that have been resized to 512x512). |
![]() |
| It would be interesting to see what happens when you slightly shift the grid of objects until they're split across multiple tiles, and how that affects accuracy. |
![]() |
| I've always wondered how Text to Image LLMs like stable diffusion work, do they just encode RGB values into a matrix and then have a helper tool convert that data into a jpg? |
![]() |
| > CLIP embeds the entire image as a single vector, not 170 of them.
Single token? > GPT-4o must be using a different, more advanced strategy internally Why |
![]() |
| How do we know it generates the images itself and isn’t passing the text to dalle? It’s supposedly how the current gpt4 model does listen mode (with whisper but same idea) |
![]() |
| Because no one knows how to prep the images. With the right file type and resolution I get under a single character error per 10 pages and it's been that good since the late 00s. |
![]() |
| I really hope we see improvements to the resolutions large multimodal models can handle. Right now this patchwork approach leads to lots of unwieldly workarounds in applications. |
![]() |
| I'm assuming that the tokens used to encode an image are entirely distinct from the tokens used to encode text. Does anyone know if this is actually the case? |
We desperately need a modern open source replacement for tesseract built on current SoTA ML tech. It is insane that we are resorting to using LLMs — which aside from being the wrong tool and far too overpowered for the job also are prone to hallucinations, have insanely expensive training and inference costs, etc — for this purpose because the “best” non-LLM solution is so bad it can’t even correctly ocr monospaced hi-res scans of ascii text with sufficient accuracy.