![]() |
|
![]() |
| Assume everyone is familiar with this project, dating back to 1996:
https://en.wikipedia.org/wiki/WWWOFFLE https://ftp.netbsd.org/pub/pkgsrc/distfiles/wwwoffle-2.9j.tg... The way the www is going, it seems like downloading a copy of libgen, i.e., nonfiction books, and scimag, i.e., academic journals, via torrent, would be more valuable than archiving websites, in general. These primary sources are part of the material used to train so-called "AI" anyway. The problem is that this so-called "AI" also includes all the garbage from the www. Worst case is eventually these books and journals will again become publicly inaccessible but "AI" will be offered as a bogus substitute; a future where few people will do research using primary materials anymore, they will just submit questions to a remote "AI" server. Truth will be decimated. |
![]() |
| Missed the edit window, but here's the command I use. Newlines added here for clarity.
Some notes:— This command hits servers as fast as possible. Not sorry. I have encountered a very small number of sites-I-care-to-mirror that have any sort of mitigation for this. The only site I'm IP banned from right now is http://elm-chan.org/ and that's just because I haven't cared to power-cycle my ISP box or bother with VPN. If you want to be a better neighbor than me, look into wget's `--wait`/`--waitretry`/`--random-wait`. — The only part of this I'm actively unhappy with is the fixed version number in my fake User-Agent string. I go in and increment it to whatever version's current every once in a while. I am tempted to try automating it with an additional call to `date` assuming a six-week major-version cadence. — The `--reject-regex` is a hack to work around lots of CMS I've encountered where it's possible to build up links with an infinite number of path separators, e.g. an `www.example.com///whatever` containing a link to `www.example.com////whatever` containing a link to… — I am using wget1 aka wget. There is a wget2 project, but last time I looked into it wget2 did not support something I needed. I don't remember what that something was lol — I have avoided WARC because I usually prefer the ergonomics of having separate files and because WARC seems more focused on use cases where one does multiple archives over time (as is the case for Wayback Machine or a search engine) where my archiving style is more one-and-done. I don't tend to back up sites that are actively changing/maintained. — However I do like to wrap my mirrored files in a store-only Zip archive when there are a great number of mostly-identical pages, like for web forums. I back up to a ZFS dataset with ZSTD compression, and the space savings can be quite substantial for certain sites. A TAR compresses just as well, but a `zip -0` will have a central directory that makes it much easier to browse later. Here is an example of the file usage for http://preserve.mactech.com with separate files vs plain TAR vs DEFLATE Zip archive vs store-only Zip archive. These are all on the same ZSTD-compressed dataset and the DEFLATE example is here to show why one would want store-only when fs-level compression is enabled.
Also I lied and don't have a full TiB yet ;)
Some of this could stand to be re-organized. Since I've gotten more into it I've gotten better at anticipating an ideal directory depth/specificity at archive time instead of trying to come back to them later. Like `DIY` (i.e. home improvement) there should go into `Hobby` which did not exist at the time, `SA` (SomethingAwful) should go into `Communities` which did not exist at the time, `Cars` into `Transportation`, etc.`Personal` is the directory that's been hardest to sort because personal sites are one of my fav things to back up but also one of the hardest things to try and organize when they reflect diverse interests. For now I've settled on a hybrid approach. If a site is geared toward one particular interest or subsulture, it gets sorted into `Personal/ |
![]() |
| Why, what's the point in doing such nonsense? Unless it's someone with lots of money, contacts in the dark web, and some historic Barbara Streisand type chip on the shoulder. |
![]() |
| That is exactly the problem. These services are constantly at war with each other and are attacked by competitors. Cloudflare provides DDoS protection to the DDoS providers so they can keep their services online, which directly benefits Cloudflare by DDoS being a bigger problem than if they were all busy attacking each other.
This is a sampling of currently available services and who they use for DDoS protection:
Just for fun head over to Cloudflare's abuse reporting site and try to figure out how to get one of these taken down. https://abuse.cloudflare.com/ |
![]() |
| What's the significance of that?
(Googling "Jason Scott TIA" gives me "Dr Jason Scott is a Senior Research Fellow in the Tasmanian Institute of Agriculture" which doesn't explain much to me) |
![]() |
| The beauty of acronyms/initialisms that people are too lazy to spell out!
TIA = The Internet Archive (i.e. the victim of the DDoS). >The user you're responding to is Jason Scott of The Internet Archive |
![]() |
| Seriously? People do this shit for fun. There used to be a program (LOIC) popular on 4chan used for DDoS attacks all the time, it's the origin of the "firin mah lazer" meme. |
![]() |
| > Similar systems were proposed in the late '90s/early 2000s (hashcash/micropayments) to combat spam.
These ideas indeed exist for decades. > The big problem isn't a technological one, it's that it presupposes some "sweet spot" price (negligible for legitimate users, yet prohibitive for abusers) that has never been shown to exist in reality. Advances in technology, software and hardware, make it easier and easier for that sweet spot to exist. That sweet spot, didn't exist in the past, certainly, but we are close right now. One example that i think is useful here, is aluminum cans for fizzy drinks. Aluminum, a strong metal compared to cardboard or plastic or glass, is better at withstanding pressurized gases without exploding. The downside, is that it's more expensive. When manufacturing prices dropped down a lot, then it was feasible to drink half a liter of liquid and just throw away the metal. Aluminum still not free though, but the small price did worth it. Huge waste of energy as well to smelt all that metal and throw it away after 10 minutes of drinking, but it is economically viable. One could manufacture Titanium cans, and drink even more fizzy drinks. But that's not economically viable as of today. > you're arguably just moving the problem to DDoSing the payment processing / firewall mechanism. Yes, the problem is moved elsewhere, that's the weak link in the scheme i described. The thing is that a flood of transactions still costs money. Blockchains cannot be flooded just with requests, they have to be flooded by transactions. Take a look at the article [1] which outlines some ideas. I don't agree with a lot of things in there, but it states the problem and gives some numbers. The theory when it comes to blockchain deterring DDoS attacks (and other kind of attacks), is that there are not bad guys in general, just rational economic actors who use dirty tricks. When a dirty trick starts to cost money, and profit disappears from an attack, then the rational economic actor will stop the attack. The bad guy will resume the attack regardless of profit, but that's one of the axioms of the theory, that there are no bad guys. [1] https://www.dlnews.com/articles/defi/ddos-attacks-are-an-inc... |
![]() |
| Making the news is the goal. As long as a few people can verify it was you, word will get around about the person who can take down big targets, and will cite the news articles as part of the proof. |
![]() |
| Anyone who doesn't like the availability and accessibility of history and documents.
Lots of people want to rewrite or erase history. Quoting a story I wrote about this a few years ago: "Everything you speak, all ideas, all things, all thoughts, they are all of the past. Society and knowledge is a composite of the shadows of former presents. When people lie or misrepresent knowledge they speak of a past they wish to change. What if people who have the most to gain from deceit had a tool to actually change the past and make these lies the truth?" Here it is if you're curious https://kristopolous.medium.com/stephen-hawking-had-a-time-t... |
![]() |
| > Sorry to say, archive.org is under a ddos attack. The data is not affected, but most services are unavailable. We are working on it & will post updates in comments. |
![]() |
| Yes. This is why it's maddening that most people don't take computer security seriously. Virus infected devices are what give these botnets their scale and wide distribution. |
I have a `wget-mirror` shell function invoking wget with all the trimmings that takes care of 99% of sites. I’ll edit the full command into this comment when I get home if anybody else wants to start doing the same :)