![]() |
|
![]() |
| I liked the analogy to Gabe Newell's "piracy is a service problem" adage, embodied in Virgin API consumer vs Chad third-party scraper https://x.com/gf_256/status/1514131084702797827
Make it easier to get the data, put less roadblocks in the way for legitimate access, and you'll find fewer scrapers. Even if you make scraping _very_ hard, people will still prefer scraping if legitimate use is even more cumbersome than scraping, or you refuse to even offer a legitimate option. Admittedly, we are talking here because some people are scraping OSM when they could get the entire dataset for free... but I'm hoping these people are outliers, and most consume the non-profit org's data in the way they ask. |
![]() |
| Arguably, trying to scale everything to the whole planet is the root cause of most of these problems. So "that won't scale to the whole planet" might, in the long view, be a feature and not a bug. |
![]() |
| Many would oppose the idea, but if any service (e.g. eBay, LinkedIn, Facebook) were to dump the snapshot to S3 every month, that could be a solution. You can't prevent scraping anyway. |
![]() |
| Yeah, you can get dumps of Wikipedia and stackoverflow/stackexchange that way.
(Not sure if created by the admins or a 3rd party, but done once for many is better than overlapping individual efforts). |
![]() |
| > What's the endgame here?
We've had good success with - Cloudflare Turnstile - Rate Limiting (be careful here, as some of these scrapers use large numbers of IP addresses and User Agents) |
![]() |
| You can't cache this stuff for bot consumption. Humans only want to see the popular stuff. Bots download everything. The size of your cache then equals the size of your content database. |
![]() |
| OpenStreetMap Foundation chairperson here.
OpenStreetMap's data is available for free in bulk from https://planet.openstreetmap.org. We encourage using these instead of scraping our site. Scraping puts a high load on our donated resources. We block scraping IPs, but even that takes us work and time. Respecting our time and resources helps us keep the service free and accessible for everyone. |
![]() |
| If you're talking about the new-ish data dumps provided in protobuf format, this is a heavily optimised binary format. OrganicMaps uses these files directly to be able to store and lookup whole countries locally. With this format, the dump for France is only 4.3Gb at the time of writing.
Also, instead of downloading the whole map, you can use one of the numerous mirrors like Geofabrik [0] to download only the part you're interested in.
[0] https://download.geofabrik.de/
|
![]() |
| On https://www.openstreetmap.org/, click "Export" (upper-left). It lets you choose a small rectangle (click "Manually select a different area"). It gives you a .osm right from the browser.
For literally single point, on the map icons on the right, one is arrow with question mark ("Query features"). With this you can click on single features and get their data. |
![]() |
| > they might care, because they have more respect to the rules
Do they? Didn't OpenAI scrape everything regardless of licence forbidding reuse without attribution or for commercial interest? |
![]() |
| Someone recently pointed out the Aaron Schwartz was threatened with going to prison for scraping, meanwhile there's hundred of billion of dollars right now invested in AI LLMs build from... scraping. |
![]() |
| Anything they can, judging by the fact that they're hitting random endpoints instead of using those offered to developers. Similar thing happened to readthedocs[1] causing a surge of costs that nobody wants to answer for.
In the readthedocs situation there was one case that was a bugged crawler causing it to try and scrape the same HTML files repeatedly to the tune of 75TB, could also be happening here with OSM (partially). [1] https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse... |
![]() |
| I can hear the AI groaning about regular humans suddenly caring a lot about IP protection and discussing ways to add DRM to protect it.
I really hope the irony isn't lost on everyone. |
![]() |
| The joke is that you can already download it for free, no donation or bandwidth reimbursement needed
https://wiki.openstreetmap.org/wiki/Planet.osm I guess since it's posted to osm.town Mastodon, this is assumed to be known. Was surprised to see it without context here on HN; I can understand the confusion. Apparently most people here are already aware that one can download the full OpenStreetMap data without scraping |
![]() |
| Never heard of any NTA or of business expenses being converted into money to myself when it's not on my account. Looking up what an NTA is in relation to taxes, they're a nonprofit themselves and don't appear to have authority of any kind besides PR https://en.m.wikipedia.org/wiki/National_Tax_Association
Regardless, if this were a concern in Germany, I'm sure our boss/director would have mentioned it on that call as a simple reason rather than finding excuses and saying we could pick a different nonprofit who needs it more to donate to Companies donate all the time... this argument about it being considered income makes no sense, and if it did, just donate {income tax rate}% less if the company can't afford more and no problem either |
![]() |
| How come, in this era of no privacy, pixel trackers, data brokers, etc can they not easily stop scraping? Somehow bots have a way to be anonymous online yet consumers have to fight an uphill battle? |
![]() |
| If.
(unfortunately the historical example is poor, since Philip II proved both able and willing to make it happen, whereas AI has no demonstration of a path to utility) |
![]() |
| Is it really? It's all boils down to powerplay, really. And lately lots of powerful entities, from nations to corporations, seem to be poised on disrupting this institution. |
What's the endgame here? AI can already solve captchas, so the arms race for bot protection is pretty much lost.