使用 Playwright 跟踪超市价格

使用 Playwright 跟踪超市价格
Tracking supermarket prices with Playwright

原始链接: https://www.sakisv.net/2024/08/tracking-supermarket-prices-playwright/

为了建立一个监控希腊超市价格变化的网站，用户使用了 Pythong 库“Playwright”。这使他能够自动化访问多个网页、模拟用户交互和解析相关数据的过程。最初遇到由于 JavaScript 引起的问题，他使用 Playwright 来导航、检查和请求这些网页中的元素，尽管依赖 JavaScript，但仍确保了流畅的交互。他进一步讨论了克服某些商店施加的无限滚动、产品加载和处理限制等限制的方法。此外，他还解决了流程扩展和自动化的需求，从而可以在具有最佳规格的云服务器上进行部署，从而与 Amazon Web Services 相比，执行速度更快，成本更低。用户还强调了避免 IP 限制复杂性的考虑因素，并实施了优化策略，以最大程度地减少与抓取过程相关的停机时间和成本。最后，用户提供了有关该项目的可扩展性和未来方向的见解。

本文讨论了创建隐形眼镜价格比较网站的个人经历，该网站在大约 30 个国家/地区运营。由于网站之间命名约定的差异，创建者在匹配产品时遇到了困难。为了克服这个问题，他开发了自定义抓取工具和基础设施，同时处理抓取过程中的定期更新和潜在错误。他指出，最困难的部分是维护刮刀并确定产品丢失或刮刀问题背后的原因。文章还提到了一个类似的项目 SuperPrijzenVergelijker NL，该项目专注于比较荷兰超市之间的价格。创建者使用 Playwright 实例、Haskell、AWS ECS 和 NextJS 进行 Web 开发。该项目的主要挑战是将各个超市的产品连接起来，以便更轻松地查看价格。最后，读者发现自己质疑原始文章是否缺乏道德意图，因为作者似乎期望抗议项目的重点是从大公司夺回权力，而不是其他商业活动。然而，读者得出的结论是，从此类项目中赚钱是创作者的权利。总的来说，本文提供了与为比较网站创建爬虫和克服相关障碍相关的富有洞察力的观点和经验。

原文

Back in Dec 2022, when inflation was running high, I built a website for tracking price changes in the three largest supermarkets of my home country, Greece.

In the process I ran into a few interesting hurdles that I had to clear before being able to "go live". There were also several lessons I learned and faulty assumptions that I made.

In this post I will talk about scraping: What I use, where it runs, how I got around some restrictions and how I got it to the point that it can run for months without intervention.

Scraping js sites

The main problem was, what else? Javascript.

All three eshops are rendered with js and even in the few spots where they don't, they still rely on js to load products as you scroll, similar to infinite scrolling in social media. This means that using plain simple curl or requests.get() was out of the question; I needed something that could run js.

Enter Playwright.

Playwright allows you to programmatically control a web browser, providing an API to handle almost everything, from opening a new tab and navigating to a URL, to inspecting the DOM, retrieving element details, and intercepting and inspecting requests.

It also supports Chromium, Safari and Firefox browsers and you can use it with Node, Java, .NET and, most importantly, Python.

With this in my belt, doing the scrolling and the loading of the products looked like this:


def visit_links(self):
    #
    # <truncated>
    #

    # do the infinite scrolling
    load_more_locator = self.page.get_by_test_id(
        "vertical-load-more-wrapper"
    ).all()
    while len(load_more_locator) == 1:
        load_more_locator[0].scroll_into_view_if_needed()
        load_more_locator = self.page.get_by_test_id(
            "vertical-load-more-wrapper"
        ).all()

    # once the infinite scroll is finished, load all the products
    products = self.page.locator("li.product-item").all()

    # and then ensure that we only keep products that don't have the "unavailable" text
    products = [
        p
        for p in products
        if len(p.get_by_test_id("product-block-unavailable-text").all()) == 0
    ]

    #
    # <truncated>
    #

By the end of this method, I have a list of product <div>s, each containing the product's name, price, photo, link, etc, that I get to parse to extract this information. Then I do the same for the next product category, and so on.

Once I wrote the scrapers for all 3 supermarkets, the next step was to run it in an automated way, so that every day I will have the new prices and I can compare with the day before.

Automating

Running this whole process for an entire supermarket on my M1 MBP would take anywhere from 50m to 2h 30m, depending on the supermarket. I could also run all 3 scrapers in parallel with no noticeable difference.

However using my laptop was good for development and testing, but I needed a more permanent solution.

Use an old laptop?

The first attempt to a solution came in the form of an old laptop, dating all the way back to 2013.

It had dual core M-family processor clocked @ 2.20GHz and 4GB of RAM, which are on the low side especially by today's standards, but then again I only needed to open a single tab in a headless browser and navigate to a bunch of different URLs. Surely a dual core CPU with 4GB of RAM could handle that, right?

Well, not quite.

Even after bumping its memory to 12GB, the performance was far too disappointing, measuring in more than 2h even for the "fast" supermarket.

Use the cloud?

So the next option was to use the cloud ☁️

My first thought was to use AWS, since that's what I'm most familiar with, but looking at the prices for a moderately-powerful EC2 instance (i.e. 4 cores and 8GB of RAM) it was going to cost much more than I was comfortable to spend for a side project.

At the time of writing, a c5a.xlarge instance is at $0.1640 per hour in eu-north-1. That comes to a nice $118.08 per month, or $1,416.96 per year. Not so cheap.

So I turned to Hetzner.

The equivalent server there is cpx31, which is at $17.22 (€15.72) per month, or $206.64 per year, which is ~7 times cheaper than AWS, which makes it a no-brainer for my use case.

So Hetzner it is.

Use an old laptop AND the cloud!

Having settled on the cloud provider to run the scraping from, next step was to finally automate it and run it once per day.

The old laptop may not be powerful enough to run the scraping itself, but it was more than capable of running a CI server that just delegates scraping to the much beefier server from Hetzner.

My CI of choice is Concourse which describes itself as "a continuous thing-doer". While it has a bit of a learning curve, I appreciate its declarative model for the pipelines and how it versions every single input to ensure reproducible builds as much as it can.

Here's what the scraping pipeline looks like:

The scraping pipeline in concourse

It has a nightly trigger which creates the scraping server and then triggers the three scraping jobs in parallel. When all of them are finished, it tears down the server to keep the costs down, and the raw output of each scraper is passed onto a job that transforms it to a format which is then loaded into pricewatcher.gr. Last, but definitely not least, when any of these steps fails I get an email alert.

Avoiding IP restrictions

But of course nothing is ever that simple.

While the supermarket that I was using to test things every step of the way worked fine, one of them didn't. The reason? It was behind Akamai and they had enabled a firewall rule which was blocking requests originating from non-residential IP addresses.

I had to find a way to do a "reverse vpn" kind of thing, so that the requests would appear to originate from my home IP.

Enter tailscale.

Tailscale allows you to put your devices on the same network even if, in reality, they are all over the place.

Once they are in the same network, then you can designate one of them to act as an "exit node". You can then tell the other devices to use the designated exit node and, from that point on, all their requests flow through there.

And if you're lucky like me, your ISP is using CGNAT, which basically means that your public IP address is not tied to you specifically, but it is shared with other customers of the ISP. This allows some flexibility and peace of mind that your IP won't be blindly blocked because that wouldn't affect just you.

So after all that, my good ol' laptop just got one more duty for itself:

Acting as an exit node for the scraping traffic.

How and when does it fail?

I've been running this setup for a year and a half now, and it has proved to be quite reliable.

Of course, as with any scraping project, you're always at the mercy of the developers of the website you're scraping, and pricewatcher is no exception.

There are 2 kinds of failures that I've faced: breaking changes and non-breaking changes.

The first one is easy: They changed something and your scraper failed. It may be something simple as running a survey so you have to click one more button, or something more complex where they completely change the layout and you have to make a larger refactor of your scraper.

The second kind is nastier.

They change things in a way that doesn't make your scraper fail. Instead the scraping continues as before, visiting all the links and scraping all the products. However the way they write the prices has changed and now a bag of chips doesn't cost €1.99 but €199, because they now use <sup> to separate the decimal part:

screenshot of a ruffles chips bag

screenshot of the html showing <sup>
tag being used

To catch these changes I rely on my transformation step being as strict as possible with its inputs.

In both cases however, the crucial part is to get feedback as soon as possible. Because this runs daily, it gives me some flexibility to look at it when I have time. It can also be a source of anxiety in case something breaks when I'm off for 2 weeks for holidays. (Luckily they don't change their sites that often and that dramatically)

Optimising

While the overall architecture has remained pretty much the same since day 1, I've made several changes to almost all other aspects of how scraping works, to improve its reliability and to reduce the amount of work I had to do.

I configured email alerts for when things fail, heuristics that alert me when a scraping yields too many or too few products for that supermarket, timeouts, retries that don't start from the beginning, etc.

All of these could be a post of their own, and maybe they will (write a comment if you want!), but I want to highlight two other optimisations here.

The main bottleneck in this entire process is the duration of the scraping. The longer it takes, the more I end up paying and the more annoying it is if something fails and it needs to be retried from the beginning.

After lots of trial and error, two changes improved the times significantly: A bigger server and reducing the amount of data that I fetch.

Use a bigger server

I went from 4vCPUs and 16GB of RAM to 8vCPUs and 16GB of RAM, which reduced the duration by about ~20%, making it comparable to the performance I get on my MBP. Also, because I'm only using the scraping server for ~2h the difference in price is negligible.

Fetch fewer things

Part of the truncated snippet above is this bit, taken straight from the playwright docs:

def visit_links(self):
        images_regexp = re.compile(r"(\.png?.*$)|(\.jpg.*$)")
        for index, link in enumerate(self.links):
            self.page.goto(link)
            self.page.route(images_regexp, lambda route: route.abort())
        #
        # <truncated>
        #

This aborts all requests to fetch images when I'm loading products. This not only makes the scraping quicker, but also saves bandwidth and, presumably, a few cents, for the websites.

While both of these are obvious in hindsight, sometimes it takes conscious effort to take a step back and re-examine the problem at hand. It also helps to talk about it out loud, either to a friend who knows or, even better, to a friend who doesn't know, for some rubber-duck debugging.

Cost

So how much does the scraping cost?

Accoring to the last invoice from Hetzner: €4.94 for the 31 servers I spinned up and €0.09 for the 31 IPv4 addresses they received.

The data from the scraping are saved in Cloudflare's R2 where they have a pretty generous 10GB free tier which I have not hit yet, so that's another €0.00 there.

Conclusion

These are the main building blocks of how I built a scraping pipeline for pricewatcher in order to keep track of how supermarkets change their prices.

Leave a comment below if it piqued your interest and/or you'd like me to dive a bit deeper in any area.

/playwright/ /python/ /concourse/ /pricewatcher/ /tailscale/