![]() |
|
![]() |
| Human me just thought it was a good word for this. It implies some irreversible process of mixing, I think that characterizes this process really well. |
![]() |
| if you are a sales & marketing intern, have a potato laptop and $100 budget to spend on seo, you aren't going to be self hosting anything even if you know what that means. |
![]() |
| I don’t think they were talking about the quality of Google search results. I believe they were talking about how the data was processed by the wordfreq project. |
![]() |
| I personally am yet to see this beyond some slop on youtube. And I am here for the AI meme videos. I recognize the dangers of this, all I am saying is that I don't feel the effect, yet. |
![]() |
| Or check out "money printer" on github: a tongue in cheek mashup of various tools to take a keyword as input and produce a youtube video with subtitles and narration as output. |
![]() |
| I love the table-diagrams at the end. I've never seen anything like that until now and it really seems useful for visualization of the recipe and the sequence of steps. |
![]() |
| I’m not sure what you’ve done to get that level of spam, but I get about 10 spam emails a day at most and that’s across multiple accounts including one that I’ve used for almost 30 years and had used on Usenet which was the uber-spam magnet. A couple newer (10–15 year old) addresses which I’ve published on webpages with mailto links attract maybe one message a week and one that I keep for a specialized purpose (fiction and poetry submissions) gets maybe one to two messages per year, mostly because it’s of the form [email protected] so easily guessed by enterprising spammers.
Looking at the last days’ spam¹ I have three 419-style scams (widows wanting to give away their dead husbands’ grand piano or multi-million euro estate) and three phishing attempts. There are duplicate messages in each category. About fifteen years ago, I did a purge of mailing list subscriptions and there’s very little that comes in that I don’t want, most notably a writer who’s a nice guy, but who interpreted my question about a comment he made on a podcast as an invitation to be added to his manually managed email list and given that it’s only four or five messages a year, I guess I can live with that. ⸻ 1. I cleaned out spam yesterday while checking for a confirmation message from a purchase. |
![]() |
| Books printed before 2018, right?
I already find myself mentally filtering out audible releases after a certain date unless they're from an author I recognize. |
![]() |
| Steel without nuclear contamination is sought after, and only available from pre-war / pre-atomic sources.
The analogy is that data is now contaminated with AI like steel is now contaminated with nuclear fallout. https://en.wikipedia.org/wiki/Low-background_steel >Low-background steel, also known as pre-war steel[1] and pre-atomic steel,[2] is any steel produced prior to the detonation of the first nuclear bombs in the 1940s and 1950s. Typically sourced from ships (either as part of regular scrapping or shipwrecks) and other steel artifacts of this era, it is often used for modern particle detectors because more modern steel is contaminated with traces of nuclear fallout.[3][4] |
![]() |
| IMO HN actually scores quite highly in terms of health/politics and so forth content because the both mainstream and fringe ideas get both shown and pushback.
A vaping discussion brought up glycerin used was safe and the same thing used in smoke machines and someone else brought up a study showing that smoke machines are an occasional safety issue. Nowhere near every discussion goes that well but stick around and you’ll see in-depth discussion. Go to a public health website by comparison and you’ll see warnings without context and a possibility positive spin compared to smoking. https://www.cdc.gov/tobacco/e-cigarettes/index.html I suspect most people get basically nothing from looking at it. |
![]() |
| As a software engineer married to a healthcare professional, I disagree strongly about the quality of the healthcare discussions here. A whole lot of the conversation is software engineers who think that they can reason from first principles in two minutes about this thing that professionals dedicate their whole lives to mastering, and who therefore don't understand the most basic concepts of the field.
Sometimes I try and engage, but honestly, mostly I think it's not worth it. Otherwise you end up doing this with your life: https://xkcd.com/386/ |
![]() |
| Regarding the human genome project specifically it was research and no matter what was claimed (give us all of these medical breakthroughs) we (as the public) should understand there is no guarantee. Similarly to how most tech startups propose plans that lead to huge scales and ROI, but nobody is amazed when 3-4 years later they have a modest revenue (the lucky ones).
The benefits for understanding more about genomes are growing (ex: list of adverse effects based on genotype https://go.drugbank.com/pharmaco/genomics) but the field is/(was) so chaotic (just one example: there was not one standard about how to count: https://tidyomics.com/blog/2018/12/09/2018-12-09-the-devil-0...) and so lacking data that it will take many years to reap the benefits (ex: one of the largest study UK Bio bank gave access to researchers only in 2017 - https://en.wikipedia.org/wiki/UK_Biobank) |
![]() |
| Obviously on an objective scale HN isn’t good, but nobody is doing a good job here.
I’ve worked on the government side of this stuff and find it disheartening. |
![]() |
| Until N ad views are worth more than $X account creation fee. Then the spammers will just sell ad posts for $X*1.5.
I can’t find it, but there’s someone selling sock puppet posts on HN even. |
![]() |
| They likely never started critically thinking, so they never had to get started on not doing so.
(If children are never taught to think critically, then...) |
![]() |
| > is artificially created
You imply that thousands of year ago everybody was thinking critically? Thinking critically is hard, stressful and might take some joy from your life. |
![]() |
| > Good will prevail in the end.
Even if, this is a dangerous thought that discourages decisive action that is likely to be necessary for this to happen. |
![]() |
| tangentially related, but Marx also predicted that crypto and NFT's would exist in 1894 [1] and I only bring it up because its kind of wild how we keep crossing these "red lines" without even blinking. It's like that meme:
Sci-fi author: I created the Torment Nexus to serve as a cautionary tale... Tech Company: Alas, we have created the Torment Nexus from the classic Sci-fi novel "Don't Create the Torment Nexus" 1. https://www.marxists.org/archive/marx/works/1894-c3/ch25.htm |
![]() |
| Can vouch for this. It’s the first non-Google search alternative I’ve used that has 100% replaced Google. I don’t need Google as a fallback like I did with others. |
![]() |
| I've been slowly detaching myself from the web for the past 10 years. These days I mostly build offline apps using native technologies. Those capabilities are still around. They just receded for a while because they'd gotten so polluted with toolbars and malware. But now the malware is on the other side, and native apps are cool again. If you know where to look. Here's my shingle: https://akkartik.name/freewheeling-apps
On the other hand, what you call "The Web" seems to be just what you can get at through search engines. There's still the old web, the thing that's mediated by relationships and reputation rather than aggregation services with billions of users. Like the link I shared above. Or this heroically moderated site we're using right now. |
![]() |
| Ah, you mean the web version of https://en.wikipedia.org/wiki/Blinkers_(horse_tack) . I don't think that helps when you're stopped in your tracks by an upsell. Dominos won't let you order a pizza online until you've declined garlic bread, cinnamon rolls and a liter of pepsi three times. And you can't just click "pepperoni pizza near me", you have to build your pepperoni pizza, after putting in your zip code, selecting the store, carry out, then click build again, sure you don't want buffalo wings too?, ....
|
![]() |
| I suppose it is just Amazon problems. I have never lived in the area where Amazon is prevalent. Where I live, search engines still can't find synonyms or process misspellings. |
![]() |
| There are a series of challenges like:
https://www.nytimes.com/interactive/2024/09/09/technology/ai... https://www.nytimes.com/interactive/2024/01/19/technology/ar... These are a little bit unfair, in that we're comparing handpicked examples, but I don't think many experts will pass a test like this. Technology only moves forward (and seemingly, at an accelerating pace). What's a little shocking to me is the speed of progress. Humanity is almost 3 million years old. Homosapiens are around 300,000 years old. Cities, agriculture, and civilization is around 10,000. Metal is around 4000. Industrial revolution is 500. Democracy? 200. Computation? 50-100. The revolutions shorten in time, seemingly exponentially. Comparing the world of today to that of my childhood.... One revolution I'm still coming to grips with is automated manufacturing. Going on aliexpress, so much stuff is basically free. I bought a 5-port 120W (total) charger for less than 2 minutes of my time. It literally took less time to find it than to earn the money to buy it. I'm not quite sure where this is all headed. |
![]() |
| > so much stuff is basically free
It really isn't. Have a look at daily median income statistics for the rest of the planet: https://ourworldindata.org/grapher/daily-median-income?tab=t...
And more generally:
I looked around on Ali, and the cheapest charger that doesn't look too dangerous costs around five bucks. So it's roughly equal to one day's income of at least half the population of our planet. |
![]() |
| There are smaller, gated communities that are still very valuable. You're posting in one. But yes, the open Internet is basically useless now, thanks ultimately to advertising as a business model. |
![]() |
| Sure, there's bad actors everywhere, but there's really no incentive to do it here so I don't think it's a problem in the same way it is on the open internet, where slop is actively rewarded. |
![]() |
| That's a nice analogy. Fortunately (un)real estate is easier to manufacture out of thin air online. We have lost some valuable spaces like Twitter and Reddit to some degree though. |
![]() |
| That went off the rails quickly. Calm down dude: my mother-in-law isn't going to forget words because of AI; she's gonna forget words because she's 3 glasses of crappy Texas wine into the evening. |
![]() |
| I guess it would be interesting but differentiating pollution from language evolution seems very tricky since getting a non polluted corpus gets harder and harder |
![]() |
| Arguably it is a form of language evolution. I bet humans have started using "delve" more too, on average. I think the best we can do is look at the trends and think about potential causes. |
![]() |
| >AI didn't just occur in 2021. Nobody knows how much text was machine generated prior to 2021
But we do know that now it's a lot more, with a big LOT. |
![]() |
| Twitter has been accused of being full of bots long before ChatGPT appeared. For 140 symbols, a template with synonyms would be enough to create mass-generated content. |
![]() |
| >our AIs may not be very good at learning but our brains are
Brains aren't nearly as good at slightly adjusting the statistical properties of a text corpus as computers are. |
![]() |
| Maybe, it if you’re studying the way humans use language you’re still getting human made data from rubbish. There isn’t any value in AI generated content is what you’re cataloging is human language. |
![]() |
| It might be fun to collect the same data if not for any other reason than to note the changes but adding the caveat that it doesn’t represent human output.
Might even change the tool name. |
![]() |
| Exactly, like how "mindful" and "demure" recently became more popular for seemingly no reason. Humans do this all the time.
And language in general stagnates and shrinks in vocabulary over time ( https://evoandproud.blogspot.com/2019/09/why-is-vocabulary-s... ). (Link that ChatGPT helped me find :P) I think AI will increase the average persons vocabulary, since it appears to in general be better/more professionally written than a lot of what the average person is exposed to online. |
![]() |
| Okay but how big of a sample size do we even actually need for word frequencies? Like what’s the goal here? It looks like the initial project isn’t even stratified per year/decade |
![]() |
| I've wondered from time to time why I collect history books, keep my encyclopedias, when I could just google it. Now I know why. They predate AI and are unpolluted by generated bilge. |
![]() |
| Sad to see wordfreq halted, it was a real party for linguistics enthusiasts. For those seeking new tools, keep expanding your knowledge with socialsignalai. |
![]() |
| Hah! I'm trying to figure out the exact date that crossed from "plausible line from a Stross or Sterling novel" [1] to "of course they did".
[1] Or maybe Sheckley or Lem, now that I think about it. |
![]() |
| Most of the "random" bot content pre-2021 was low-quality Markov-generated text. If anything, these genitive AI tools would improve the accuracy of scraping large corpora of text from the web. |
![]() |
| "Sure, there was spam in the wordfreq data sources, but it was manageable and often identifiable."
How sure can we be about that? |
![]() |
| I don't know what difference you are referring to. I was agreeing with you.
And also agreed: many trumpet the merits of "unassisted" human output. However, they're suffering from ancestor veneration: human writing has always been a vast mine of worthless rock (slop) with a few gems of high-IQ analysis hidden here and there. For instance, upon the invention of the printing press, it was immediately and predominantly used for promulgating religious tracts. And even when you got to Newton, who created for us some valuable gems, much of his output was nevertheless deranged and worthless. [1] It follows that, whether we're a human or an LLM, if we achieve factual grounding and the capacity to reason, we achieve it despite the bulk of the information we ingest. Filtering out sludge is part of the required skillset for intellectual growth, and LLM slop qualitatively changes nothing. [1] https://www.newtonproject.ox.ac.uk/view/texts/diplomatic/THE... |
![]() |
| This has to be the most annoying hacker news comment section I've ever seen. It's just the same ~4 viewpoints rehashed again, and again, and again. Why don't folks just upvote other comments that say the same thing instead of repeating the same things?
And now a hopefully new comment: having a word frequency measure of the internet as we're going into AI being more used would be IMMENSELY useful specifically _because_ more of the internet is being AI generated! I could see such a dataset being immensely useful to researchers who are looking for the impacts of AI on language, and to test empirically a lot of claims the author has made in this very post! What a shame that they stopped measuring. Also: as to the claims that AI will cause stagnation and a reduction of the variance of English vocabulary used, this is a trend in English that's been happening for over 100 years ( https://evoandproud.blogspot.com/2019/09/why-is-vocabulary-s... ). I believe the opposite will happen, AI will increase the average persons vocabulary, since chat AIs tends to be more professionally written than a lot of the internet. It's like being able to chat with someone that has an infinite vocabulary. It also makes it possible for people to read complicated documents well out of their domain, since they can ask not just for definitions but more in depth explanations of what words/sections mean. Here's to a comment that will never be read because of all the noise in this thread :/ |
![]() |
| They want to display how they’re truly intelligent (unlike LLMs) by checks notes rehashing opinions that they’ve read millions of times online.
Sound familiar to anyone? |
![]() |
| I wonder whether future generations will be ingrained with a Truman Show fear that maybe only the few thousand people they meet are real and everything else is generated background noise. |
![]() |
| Man the AI folks really wrecked everything. Reminds me of when those scooter companies started just dumping their scooters everywhere without asking anybody if they wanted this. |
![]() |
| Are you saying that the Overton window can never move right as long as we maintain more human rights and tolerance than we had decades before whatever the current time period is? |
![]() |
| They said nothing about it being “only” on the left.
I somewhat expect authoritarianism on the right and therefore would hold the left (to which I belong) at a higher standard. |
![]() |
| The problems are well-known and highly-documented. You should leave the determination of (b) up to those who know and understand (a), which includes the author. |
![]() |
| Reddit shut their API access down only very recently, after the AI craze went off. Twitter did so right after Musk took over, way before Reddit, way before AI ever went nuts. |
It also made the web a less than ideal source for training. And yet LLMs were still fed articles written for Googlebot, not humans. ML/LLM is the second iteration of writing pollution. The first was humans writing for corporate bots, not other humans.