![]() |
|
![]() |
| I work in a health insurance adjacent field. I can see my work going the way of the dodo as soon as VLLs take off in interpreting historical health records with physicians’ handwriting. |
![]() |
| The problem is, regardless of the confidence number, you can scan and mark document for grammatical errors.
In VLM/LLM powered methods, the missing/misred data will be hallucinated and you can't know whether something scanned correctly or not. I personally scan and OCR tons of personal documents, I prefer "gibberish" rather than "hallucinations", because they're easier to catch. We had this problem before [0], on some Xerox scanners and copiers. Results will be disastrous. It's not a question of if, but when. I personally tried Gemini and OpenAI's models for OCR, and no, I won't continue using them further. [0]: https://www.theregister.com/2013/08/06/xerox_copier_flaw_mea... |
![]() |
| That's not OCR.
It is an absolute miracle. It is transmutating a picture into JSON. I never thought this would be possible in my lifetime. But that is different from what your interlocutor is discussing. |
![]() |
| > I never thought this would be possible in my lifetime.
I used to work in Computer Vision and Image Processing. These days I utter this sentence on an almost daily basis. :-D |
![]() |
| Kind of. Tesseract's confidence is just a raw model probability output. You could easily use the entropy associated with each token coming out of an LLM to do the same thing. |
![]() |
| Saw your benchmark, looks great. Will run our models against those benchmark and share some of our learnings.
As you mentioned there are a few caveats to VLMs that folks are typically unaware of (not at all exhaustive, but the ones you highlighted): 1. Long-form text (dense): Token limits of 4/8K mean that dense pages may go over limits of the LLM outputs. This requires some careful work to make them work as seamlessly as OCR. 2. Visual grounding a.k.a. bounding boxes are definitely one of those things that VLMs aren't natively good at (partly because the cross-entropy losses used aren't really geared for bounding box regression). We're definitely making some strides here [1] to improve that so you're going to get an experience that is almost as good as native bounding box regression (all within the same VLM). [1] [1] https://colab.research.google.com/github/vlm-run/vlmrun-cook... |
![]() |
| The tool could include all known open source fonts, and for the rest, maybe could have a model recreate missing fonts for non-patented fonts, as while font files (.ttf, .otf, .woff, etc.) are copyrighted, styles usually do not have design patents, so tracing and re-creating them is usually not an issue as far as I'm aware (not a lawyer.) [1]
Though if it accidentally "traces" one of the few exceptions, then you've potentially committed a crime, and the big difficulty in typeface detection you mention increases those odds. That said, there are so few exceptions that even if the model couldn't properly identify a font, it might be able to identify whether a font is likely to have a design patent. I do think getting an AI to create a high quality vector font from a potentially low-res raster graphic is going to be quite challenging though. Raster to vector tools I've tried in the past left a bit to be desired. 1. https://www.copyright.gov/comp3/chap900/ch900-visual-art.pdf > As a general rule, typeface, typefont, lettering, calligraphy, and typographic ornamentation are not registrable. 37 C.F.R. § 202.1(a), (e). These elements are mere variations of uncopyrightable letters or words, which in turn are the building blocks of expression. See id. The Office typically refuses claims based on individual alphabetic or numbering characters, sets or fonts of related characters, fanciful lettering and calligraphy, or other forms of typeface. This is true regardless of how novel and creative the shape and form of the typeface characters may be. > There are some very limited cases where the Office may register some types of typeface, typefont, lettering, or calligraphy, such as the following: > • Pictorial or graphic elements that are incorporated into uncopyrightable characters or used to represent an entire letter or number may be registrable. Examples include original pictorial art that forms the entire body or shape of the typeface characters, such as a representation of an oak tree, a rose, or a giraffe that is depicted in the shape of a particular letter. > • Typeface ornamentation that is separable from the typeface characters is almost always an add-on to the beginning and/or ending of the characters. To the extent that such flourishes, swirls, vector ornaments, scrollwork, borders and frames, wreaths, and the like represent works of pictorial or graphic authorship in either their individual designs or patterned repetitions, they may be protected by copyright. However, the mere use of text effects (including chalk, popup papercraft, neon, beer glass, spooky-fog, and weathered-and-worn), while potentially separable, is de minimis and not sufficient to support a registration. > The Office may register a computer program that creates or uses certain typeface or typefont designs, but the registration covers only the source code that generates these designs, not the typeface, typefont, lettering, or calligraphy itself. For a general discussion of computer programs that generate typeface designs, see Chapter 700, Section 723. |
![]() |
| Cool to see, may use this locally for OCR in some cases. But I think the "handwriting" example is a little misleading. Thats a font, not a scan of hand written material |
![]() |
| The AI OCR build into snipping tool in windows is better than tesseract, albeit more inconvenient than something like powertoys or Capture2Text, which use a quick shortcut. |
![]() |
| I wonder what the speed of this approach vs traditional ocr techniques. Also, curious if this could be used for text detection (find a bounding box containing text within an image). |
![]() |
| We've seen so many different schemas and ways of prompting the VLMs. We're just standardizing it here, and making it dead-simple to try it out across model providers. |
![]() |
| Wait, but we're doing that already, and it works well (Qwen 2.5 VL)? If need be, you can always resort to structured generation to enforce schema conformity? |
![]() |
| VLM's can't replace ocr one to one.. most hosted multimodal models seem to have a classical OCR (tesseract-based) step in their inference loop |
![]() |
| People really only started talking about the cost of running things when LLMs came out. Most everything before that was too cheap to be a serious consideration. |
![]() |
| I rather see machine learning used to help OCR by
- recognizing/recreating exact font used - helping align/rotate source Not to hallucinate gibberish when source lacks enough data. |
I tried using a VLM to recognize handwritten text in genealogical sources, and it made up names and dates that sort of fit the vibe of the document when it couldn’t read the text! They sounded right for the ethnicity and time period but were entirely fake. There’s no way to ground the model using the source text when the model is your OCR.