![]() |
|
![]() |
|
I've been trying out alternative versions of this that pass images through to e.g. the Claude 3 vision models, but they're harder to share with people because they need an API key!
|
![]() |
|
Yes it does, but I've not dug into the more sophisticated parts of the API at all yet. I'm using it in the most basic way possible right now:
|
![]() |
|
My s3-ocr tool could do that with quite a bit of extra configuration. https://github.com/simonw/s3-ocr You would need to upload them all to S3 first though, which is a bit of a pain just to run OCR (that's Textract's fault). You could try stitching together a bunch of scripts to run the CLI version of Tesseract locally. |
![]() |
|
Why would that suggest I'm being paid by anyone? Oh I think I see. No, I'm not being paid to promote LLMs. The point of my blog post was two-fold: first, to introduce the OCR tool I built. And second, to provide yet another documented example of how I use LLMs in my daily development work. The tool doesn't use LLMs itself, they were just a useful speed-up in building it. It's part of a series of posts, see also: https://simonwillison.net/tags/aiassistedprogramming/ |
![]() |
|
For now, I'm still using OCRmyPDF as it maybe slow but incredible usefull. The files become big but it just works. If an alternative is quicker / lighter I will use it but it must just works. |
Not saying that the article was being misleading about this, just saying that the LLM part is basically doing some standard interfacing and HTML/CSS/JS around that core engine, which wasn’t immediately obvious to me when scanning the screenshots.