（评论）

If you haven't considered it, you can also use the direct wikitext markup, from which the HTML is derived.

Depending on how you use it, the wikitext may or may not be more ingestible if you're passing it through to an LLM anyway. You may also be able to pare it down a bit by heading/section so that you can reduce it do only sections that are likely to be relevant (eg. "Life and career") type sections.

You can also download full dumps [0] from Wikipedia and query them via SQL to make your life easier if you're processing them.

[0] https://en.wikipedia.org/wiki/Wikipedia:Database_download#Wh...?

> reduce it do only sections that are likely to be relevant (eg. "Life and career")

True but I also managed to do this from HTML. I tried getting pages wikitext through the API but couldn't find how to.

Just querying the HTML page was less friction and fast enough that I didn't need a dump (although when AI becomes cheap enough, there is probably a lot of things to do from a wikipedia dump!).

One advantage of using online wikipedia instead of a dump is that I have a pipeline on Github Actions where I just enter a composer name and it automagically scrapes the web and adds the composer to the database (takes exactly one minute from the click of the button!).

This doesn't directly address your issue but since this caused me some pain I'll share that if you want to parse structured information from Wikipedia infoboxes the npm module wtf_wikipedia works.

Funnily enough, web scraping was actually the motivating use-case that started my co-founder and I building what is now openpipe.ai. GPT-4 is really good at it, but extremely expensive. But it's actually pretty easy to distill its skill at scraping a specific class of site down to a fine-tuned model that's way cheaper and also really good at scraping that class of site reliably.

I'm working on a Chrome extension to do web scraping using OpenAI, and I've been impressed by what ChatGPT can do. It can scrape complicated text/html, and usually returns the correct results.

It's very early still but check it out at https://FetchFoxAI.com

One of the cool things is that you can scrape non-uniform pages easily. For example I helped someone scrape auto dealer leads from different websites: https://youtu.be/QlWX83uHgHs . This would be a lot harder with a "traditional" scraper.

Companies like Instagram (Facebook/Meta/Garbage) abuse their users' data day in and day out. Who cares what their TOS says. Let them spend millions of dollars trying to block you, its a drop in the bucket.

instead, don't do it because it's disrespectful to people. A lot of people weren't made aware- or didn't have the option- to object to that TOS change. Saying "well, THOSE guys do it! why can't I!" isn't a mature stance. Don't take their images because it's the right thing to do

To scale such an approach you could have the LLM generate JS to walk the DOM and extract content, caching the JS for each page.

We've had lots of success with this at Rastro.sh - but the biggest unlock came when we used this as benchmark data to build scraping code. Sonnet 3.5 is able to do this. It reduced our cost and improved accuracy for our use case (extracting e-commerce products), as some of these models are not reliable to extract lists of 50+ items.

I would definitely approach this problem by having the LLM write code to scrape the page. That would address the cost and accuracy problems. And also give you testable code.

GPT-4o mini is 33x cheaper than GPT-4o, or 66x cheaper in batch mode. But the article says:

> I also tried GPT-4o mini but yielded significantly worse results so I just continued my experiments with GPT-4o.

Would be interesting to compare with the other inexpensive top tier models, Claude 3 Haiku and Gemini 1.5 Flash.

author here: I'm working on a follow-up post where I benchmark pre-processing techniques (to reduce the token count). Turns out, removing all HTML works well (much cheaper and doesn't impact accuracy). So far, I've only tried gpt-4o and the mini version, but trying other models would be interesting!

As others have mentioned, converting html to markdown works pretty well.

With that said, we've noticed that for some sites that have nested lists or tables, we get better results by reducing those elements to a simplified html instead of markdown. Essentially providing context when the structures start and stop.

It's also been helpful for chunking docs, to ensure that lists / tables aren't broken apart in different chunks.

As a poc, we first took a screenshot of the page, cropped it to the part we needed and then passed it to GPT. One of the things we do is compare prices of different suppliers for the same product (i.e. airline tickets), and sometimes need to do it manually. While the approach could look expensive, it is in general cheaper than a real person, and enables the real person to do more meaningful work… so it’s a win-win. I am looking forward to put this in production hopefully

GPT-4 (and Claude) are definitely the top models out there, but: Llama, even the 8b, is more than capable of handling extraction like this. I've pumped absurd batches through it via vLLM.

With serverless GPUs, the cost has been basically nothing.

Can you explain a bit more about what "serverless GPUs" are exactly? Is there a specific cloud provider you're thinking of, e.g. is there a GPU product with AWS? Google gives me SageMaker, which is perhaps what you are referring to?

There are a few companies out there that provide it, Runpod and Replicate being the two that I've used. If you've ever used AWS Lambda (or any other FaaS) it's essentially the same thing.

You ship your code as a container within a library they provide that allows them to execute it, and then you're billed per-second for execution time.

Like most FaaS, if your load is steady-state it's more expensive than just spinning up a GPU instance.

If your use-case is more on-demand, with a lot of peaks and troughs, it's dramatically cheaper. Particularly if your trough frequently goes to zero. Think small-scale chatbots and the like.

Runpod, for example, would cost $3.29/hr or ~$2400/mo for a single H100. I can use their serverless offering instead for $0.00155/second. I get the same H100 performance, but it's not sitting around idle (read: costing me money) all the time.

There’s a lot of data that we should have programmatic access to that we don’t.

The fact that I can’t get my own receipt data from online retailers is unacceptable. I built a CLI Puppeteer scraper to scrape sites like Target, Amazon, Walmart, and Kroger for precisely this reason.

Any website that has my data and doesn’t give me access to it is a great target for scraping.

I'd say scrapers have always been popular, but I imagine they're even more popular nowadays with all the tools (AI but also non-AI) readily available to do cool stuff on a lot of data.

Bingo. During the pandemic, I started a project to keep myself busy by trying to scrape stock market ticker data and then do some analysis and make some pretty graphs out of it. I know there are paid services for this, but I wanted to pull it from various websites for free. It took me a couple months to get it right. There are so many corner cases to deal with if the pages aren't exactly the same each time you load them. Now with the help of AI, you can slap together a scraping program in a couple of hours.

I'm sure it was profitable in keeping him busy during the pandemic. Not everything has to derive monetary value, you can do something for experience, fun, kick the tyres, open-source and/or philanthropic avenues.

Besides it's a low margin, heavily capitalized and heavily crowded market you'd be entering and not worth the negative-monetary investment in the short and medium term (unless you wrote AI in the title and we're going to the mooooooon babyh)

Building scrapers sucks.

It's generally not hard because it's conceptually very difficult, or that it requires extremely high level reasoning.

It sucks because when someone changes "

" to "

" your scraper breaks. I just want the bio and it's obvious what to grab, but machines have no nuance.

LLMs have enough common sense to be able to deal with these things and they take almost no time to work with. I can throw html at something, with a vague description and pull out structured data with no engineer required, and it'll probably work when the page changes.

There's a huge number of one-off jobs people will do where perfect isn't the goal, and a fast solution + a bit of cleanup is hugely beneficial.

Another approach is to use a regexp scraper. These are very "loose" and tolerant of changes. For example, RNSAFFN.com uses regular expressions to scrape the Commitments of Traders report from the Commodity Futures Trading Commission every week.

My experience has been the opposite: regex scrapers are usually incredibly brittle, and also harder to debug when something DOES change.

My preferred approach for scraping these days is Playwright Python and CSS selectors to select things from the DOM. Still prone to breakage, but reasonably pleasant to debug using browser DevTools.

I don't know if many has the same use case but... I'm heavily relying on this right now because my daughter started school. The school board, the school, and the teacher each use a different app to communicate important information to parents. I'm just trying to make one feed with all of them. Before AI it would have been hell to scrape, because you can imagine those apps are terrible.

Fun aside: The worst one of them is a public Facebook page. The school board is making it their official communication channel, which I find horrible. Facebook is making it so hard to scrape. And if you don't know, you can't even use Facebook's API for this anymore, unless you have a business verified account and go through a review just for this permission.

Scrapers have always been notoriously brittle and prone to breaking completely when pages make even the smallest of structural changes.

Scraping with LLMs bypasses that pitfall because it's more of a summarization task on the whole document, rather than working specifically on a hard-coded document structure to extract specific data.

Personally I find it's better for archiving as most sites that don't provide a convenient way to save their content directly. Occasionally, I do it just to make a better interface over the data.

There's been a large push to do server-side rendering for web pages which means that companies no longer have a publicly facing API to fetch the data they display on their websites.

Parsing the rendered HTML is the only way to extract the data you need.

We've been doing something simliar for VLM Run [1]. A lot of websites that have obfuscated HTML / JS or rendered charts / tables tend to be hard to parse with the DOM. Taking screenshots are definitely more reliable and future-proof as these webpages are built for humans to interact with.

That said, the costs can be high as the OP says, but we're building cheaper and more specialized models for web screenshot -> JSON parsing.

Also, it turns out you can do a lot more than just web-scraping [2].

[1] https://vlm.run

[2] https://docs.vlm.run/introduction

What do you think all these LLM stuff will evolve into? Of course it's moving on from chitchat on stale information and now onto "automate the web" kinda phase, like it or not.

Instead of directly scraping with GPT-4o, what you could do is have GPT-4o write a script for a simple web scraper and then use a prompt-loop when something breaks or goes wrong.

I have the same opinion about a man and his animals crossing a river on a boat. Instead of spending tokens on trying to solve a word problem, have it create a constraint solver and then run that. Same thing.

The author claims that attempting to retrieve xpaths with the LLM proved to be unreliable. I've been curious about this approach because it seems like the best "bang for your buck" with regards to cost. I bet if you experimented more, you could probably improve your results.

What people mentioned above is pretty much what they did at octabear and as an extension of the idea it's also what a lot of startups applicants did for other type of media like video scraping, podcast scraping, audio scraping, etc [0] https://www.octabear.com/

I think that LLM costs, even GPT-4o, are probably lower compared to proxy costs usually required for web scraping at scale. The cost of residential/mobile proxies is a few $ per GB. If I were to process cleaned data obtained using 1GB of residential/mobile proxy transfer, I wouldn't pay more for LLM.

This looks super useful, but from what i've heard, if you try to do this at any meaningful scale your scrapers will be blocked by Cloudflare and the likes

I used to do a lot of web scraping. Cloudflare is an issue, as are a few Cloudflare competitors, but scraping can still be useful. We had contracts with companies we scraped that allowed us to scrape their sites, specifically so that they didn't need to do any integration work to partner with us. The most anyone had to do on the company side was allowlist us with Cloudflare.

Would recommend web scraping as a "growth hack" in that way, we got a lot of partnerships that we wouldn't otherwise have got.

Not sure why author didn't use 4o-mini. 4o for reasoning but things like parsing/summarizing can be done by cheaper models with little loss in quality.

I’m curious to know more about your product. Currently, I’m using visualping.io to keep an eye on the website of my local housing community. They share important updates there, and it’s really helpful for me to get an email every few months instead of having to check their site every day.

Asking for XPaths is clever!

Plus you can probably use that until it fails (website changes) and then just re scrape it with llm request

is it really so hard to look at a couple xpaths in chrome? insane that people actually use an llm when trying to do this for real. were headed where automakers are now- just put in idiot lights, no one knows how to work on any parts anymore. suit yourself i guess

I was thinking of adding a feature of my app to use LLMs to extract XPaths to generate RSS feeds from sites that don't support it. The section on XPaths is unfortunate.

On this note, does anyone know how Cursor scrapes websites? Is it just fetching locally and then feeding the raw html or doing some type of preprocessing?

Isn't ollama an answer to this? Or is there something inherent to OpenAI that makes it significantly better for web scraping?

GPT-4o (and the other top-tier models like Claude 3.5 Sonnet and Gemini 1.4 Pro) is massively more capable than models you can run on your own machine using Ollama - unless you can run something truly monstrous like Llama 3.1 405b, but that's requires 100GBs of GPU RAM which is very expensive.

I understand that if the web scraping activity has some ROI it is perfectly affordable. It is iust for fun it doesn't make sense but the article is already paying for a service and looking for a goal.

Can anyone recommend an AI vision web browsing automation framework rather than just scraping? My use case: automate the monthly task of logging into a website and downloading the latest invoice PDF.

Most discussion I found about the topic is how to extract information. Is there any technique for extracting interactive elements? I reckon listing all of inputs/controls would not be hard, but finding the corresponding labels/articles might be tricky.

Another thing I wonder is, regarding text extraction, would it be a crazy idea to just snapshot the page and ask it to OCR & generate a bare minimum html table layout. That way both the content and the spatial relationship of elements are maintained (not sure how useful but I'd like to keep it anyway).

I just want something that can take all my bookmarks, log into all by subscriptions using my credentials, and archive all those articles. I can then feed them to an LLM of my choice to ask questions later. But having the raw archive is the important part. I don’t know if there are any easy to use tools to do this though, especially with paywalled subscription based websites.

Offtopic:

What are some good frameworks for webscraping and PDF document processing -- some public and some behind login, some requiring multiple clicks before the sites display relevant data.

We need to ingest a wide variety of data sources for one solution. Very few of those sources supply data as API / json.

I'm starting to think that LLM's are a solution in need of a problem like how crypto and the blockchain was. Have we not already solved web scraping?

New technologies can solve problems better.

"I'm starting to think computers are a solution in the need of a problem. Have we not already solved doing math?"

（评论） (comments)

（评论）
(comments)