![]() |
|
![]() |
| You can apply statistical techniques to anything you want. Embeddings are just vectors of numbers which capture some meaning, so statistical analysis of them will work fine. |
![]() |
| Embeddings have structure, or they wouldn't be very useful. E.g. cosine similarity works because (many) embeddings are designed to support it. |
![]() |
| Hi snats, great article. You mention the accuracy of the various techniques you used, could you explain more about how you calculated the accuracy? Were the pdfs already categorized?
Thanks! |
![]() |
| Back in 2006 there were multiple 1tb collections of textbooks as torrents. I imagine the size and number has only grown since then. |
![]() |
| Just wondering what do you collect? Is it mainly mirroring things like libgen?
I have a decent collection of ebooks/pdfs/manga from reading. But I can’t imagine how large a 20TB library is. |
![]() |
| > I bet the total number of PDFs is close to a petabyte if not more.
That's a safe bet. I'v seen PDF's in the GBs from users treating it like a container format (which it is). |
![]() |
| It's probably tens of petabytes if not more, if you count PDFs that'd be private. Invoices, order confirmations, contracts. There's just so so much. |
![]() |
| I have >10TB of magazines I've collected so far, and I could probably source another 50TB if I had the time. I'm working on uploading them, but I've had too much on my plate lately: https://en.magazedia.wiki/
There is a significant issue with copyright, though. I'll remove anything with a valid DMCA, but 99.9% of the world's historical magazine issues are now in IP limbo as their ownership is probably unknown. Most of the other .1% aren't overly concerned as distribution is their goal and their main income is advertising, not sales. |
![]() |
| Classification is just a start. Wondering if it's worth doing something more -- like turning all of the text into Markdown or HTML? Would anyone find that interesting? |
![]() |
| My first thought on seeing the PCA embeddings scatterplot was "I wonder what pdfs are at the centre of those two clusters?" The most typical pdfs on the internet. |
![]() |
| Ive been playing with https://www.aryn.ai/ for Partitioning. Curious if anyone has tried these tools for better data extraction from PDFs. Any other suggestions?
(I'm a bit disappointed that most of the discussion is about estimating the size of PDFs on the internet, I'd love to hear more about different approaches to extracting better data from the PDFs.) |
![]() |
| >I feel like RTBF is kind of a lost battle these days
For those of us who aren't familiar with this random acronym, I think RTBF = right to be forgotten. |
![]() |
| >RTBF isn't about having your information wiped from the internet.
your take is misleading enough to be considered wrong. It's "don't use public information about me in search engines, I don't want people to find that information about me", not simply "don't use my information for marketing purposes" https://en.wikipedia.org/wiki/Right_to_be_forgotten first paragraph of the article: The right to be forgotten (RTBF) is the right to have private information about a person be removed from Internet searches and other directories in some circumstances. The issue has arisen from desires of individuals to "determine the development of their life in an autonomous way, without being perpetually or periodically stigmatized as a consequence of a specific action performed in the past". The right entitles a person to have data about them deleted so that it can no longer be discovered by third parties, particularly through search engines. |
![]() |
| There is a whole business sector for ”Online reputation fixers”
https://www.mycleanslate.co.uk/ What they usually do - Spam Google with the name to bury content - Send legal threads and use GDPR They have legit use cases, but are often used by convicted or shady businessmen, politicians, and scammers to hide their earlier misdeeds. |
![]() |
| I upvoted this comment because, though the number is wrong, it proves the point. The fact that the correct number proves the point even more, is a reason _not_ to downvote the comment. |
![]() |
| I haven't downvoted you but it is presumably because of your hasty typing or lack of proofreading/research.
33TB (first google result from 5 years ago) not 33GB. Larger figures from more recently. |
Newton, G., A. Callahan & M. Dumontier. 2009. Semantic Journal Mapping for Search Visualization in a Large Scale Article Digital Library. Second Workshop on Very Large Digital Libraries at the European Conference on Digital Libraries (ECDL) 2009. https://lekythos.library.ucy.ac.cy/bitstream/handle/10797/14...
I am the first author.