(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=39450669

在此,评论者对 ChatGPT 的准确性和可靠性提出了一些担忧,指出了 ChatGPT 产生不一致或虚假信息的实例。 然而,该帖子中的其他人认为,虽然 ChatGPT 确实偶尔会出现缺陷,但总体而言,它为那些持续且创造性地使用它的人节省了大量时间。 此外,一些人强调了人工智能在学术界和工业界的潜在应用,并指出尽管法学硕士取得了进步,但在这些领域拥有创造性见解的个人仍然是必要的。 虽然另一个人建议寻求更多对 ChatGPT 数学基础背后的理解,而不是归因于人格特征,但整个对话揭示了围绕专门利用 ChatGPT 进行学术研究目的的有效性和目的的持续争论和怀疑。

相关文章

原文
Hacker News new | past | comments | ask | show | jobs | submit login
ChatGPT went berserk (garymarcus.substack.com)
419 points by RafelMri 1 day ago | hide | past | favorite | 449 comments










Original: If anyone's curious about the (probable) non-humorous explanation: I believe this is because they set the frequency/presence penalty too high for the requests made by ChatGPT to the backend models. If you try to raise those parameters via the API, you'll have the models behave in the same way.

It's documented pretty well - https://platform.openai.com/docs/guides/text-generation/freq...

OpenAI API basically has 4 parameters that primarily influence the generations - temperature, top_p, frequency_penalty, presence_penalty (https://platform.openai.com/docs/api-reference/chat/create)

UPD: I think I'm wrong, and it's probably just a high temperature issue - not related to penalties.

Here is a comparison with temperature. gpt-4-0125-preview with temp = 0.

- User: Write a fictional HN comment about implementing printing support for NES.

- Model: https://i.imgur.com/0EiE2D8.png (raw text https://paste.debian.net/plain/1308050)

And then I ran it with temperature = 1.3 - https://i.imgur.com/pbw7n9N.png (raw text https://dpaste.org/fhD5T/raw)

The last paragraph is especially good:

> Anyway, landblasting eclecticism like this only presses forth the murky cloud, promising rain that’ll germinate more of these wonderfully unsuspected hackeries in the fertile lands of vintage development forums. I'm watching this space closely, and hell, I probably need to look into acquiring a compatible printer now!



I don't think it's a temperature issue because everything except the words is still coherent. It's kept the overall document structure and even the right grammar. Usually bad LLM sampling falls into an infinite loop too, though that was reported here.


Always needs oneself a good eldritchEnumerator! Sorry, gotta go feed the corpses, sorry corpuses for future scraping.


Correct me if Im wrong: Temperature is the rand function that prevents the whole system from being a regular deterministic program?


Pretty much.

The model outputs a number for each possible token, but rather than just picking the token with the biggest number, each number x is fed to exp(x/T) and then the resulting values are treated as proportional to probabilities. A random token is then chosen according to said probabilities.

In the limit of T going to 0, this corresponds to always choosing the token for which the model output the largest value (making the output deterministic). In the limit of T going to infinity, it corresponds to each token being equally likely to be chosen, which would be gibberish.



Close. Temperature is the coefficient of a term in a formula that adjusts how likely the system is to pick a next token (word/subword) which it thinks isn't as likely to happen next as the top choice.

When temperature is 0, the effect is that it always just picks the most likely one. As temperature increases it "takes more chances" on tokens which it deems not as fitting. There's no takesies backies with autoregressive models though so once it picks a token it has to run with it to complete the rest of the text; if temperature is too high, you get tokens that derail the train of thought and as you increase it further, it just turns into nonsense (the probability of tokens which don't fit the context approximates the probability of tokens that do and you're essentially just picking at random).

Other parameters like top p and top k affect which tokens are considered at all for sampling and can help control the runaway effect. For instance there's a higher chance of staying cohesive if you use a high temperature but consider only the 40 tokens which had the highest probability of appearing in the first place (top k=40).



> There's no takesies backies with autoregressive models

Doesn’t ChatGPT use beam search?



Almost certainly not.

It's absolutely just sampling with temperature or top_p/k, etc. Beam searches would be very expensive, I can't see them doing that for chatgpt which appears to be their "consumer product" and often has lower quality results compared to the api.

The old legacy had a "best_of" option but that doesn't exist in the new api.





Azure OpenAI seemed to have temperature problems before, i.e. temp > 1 led to garbage, at 2 it was producing random words in random character encodings, at 0.01 it was producing what OpenAI's model was producing at 0.5 etc. Perhaps they took the Azure's approach ;-)


That might explain why I found GPT4 via Azure a bit useless unless I turned the temperature down…


Landblasting eclecticism is always worthy of pressing forth the murky cloud.


wow this really makes me think the temperature on my brain is set higher than other sapients


The Murky Cloud sounds like a great sarcastic report on how cloud things explode in the style of the old Register.


Last time I tried a temp above 1, I almost instantly got gibberish. Pretty reliable parameter if you want to make the transformer output unusable.


This is amazing. The examples are like Lucky's speech from Waiting for Godot. Pozzo commands him to "Think, pig", and then:

> Given the existence as uttered forth in the public works of Puncher and Wattmann of a personal God quaquaquaqua with white beard quaquaquaqua outside time without extension who from the heights of divine apathia divine athambia divine aphasia loves us dearly with some exceptions for reasons unknown but time will tell and suffers like the divine Miranda with those who for reasons unknown but time will tell are plunged in torment plunged in fire whose fire flames if that...

And on and on for four more pages.

Read the rest here:

https://genius.com/Samuel-beckett-luckys-monologue-annotated

It's one of my favorite pieces of theatrical writing ever. Not quite gibberish, always orbiting meaning, but never touching down. I'm sure there's a larger point to be made about the nature of LLMs, but I'm not smart enough to articulate it.



> …always orbiting meaning, but never touching down.

This is a nice turn of phrase :) .



Indeed. Notable. Added to personal lexicon.


I'm fairness, Beckett's life story isn't too far off crazy nonsense, sometime secretary to James Joyce, member of the French resistance, acquaintance and local driver for Andre the Giant...


My favourite bit is that he's the answer to the trivia question of who's the only first class cricketer to win a Nobel Prize!


Wow! These two comments (parent and GP) tie together so many previously unrelated things in my life. (Like Beckett, read with a teacher that I also took a lot of Shakespeare plays from; read Joyce with the book group my bridge club spun off; got introduced to cricket via attending an IPL game in Chennai in '08; and loved Princess Bride both in high school and watching with my high school aged kids).


My first thought was that it reads like a kind of corporate Finnegan's Wake. It reads like poetic, rhythmic nonsense.


That was my first thought as well! I guess one of the Ls in LLM is for Lucky.


Looking at the examples... Was someone using an LLM to generate a meeting agenda?

I hope ChatGPT would go berserk on them, so that we could have a conversation about how meetings are supposed to help the company make decisions and execute, and that it is important to put thought into them.

As much as school and big-corporate life push people to BS their way through the motions, I wonder why enterprises would tolerate LLM use in internal communications. That seems to be self-sabotaging.



You will machine generate the meeting agenda. My machine will read the meeting agenda, read your personal growth plan, read your VP's quarterly objectives, and tell me what you need in the meeting, and I will send an AI to attend the meeting to share the 20 minute version of my three bullet point response.

Knowing that this will happen, you do not attend your own meeting, and read the AI summary. We then call it a day and go out for drinks at 2pm.



True. Meanwhile, Sally in IT is still earnestly thinking 10x more than all stakeholders in her meetings combined, and is baffled why the company can't execute, almost as if no one else is actually doing their job.

You and I will receive routine paychecks, bonuses, and promos, but poor Sally's stress from a dysfunctional environment will knock decades off her healthy lifespan.

Before then, if the big-corp has gotten too hopeless, I suppose that the opportunistic thing to do would be to find the Sallys in the company, and co-found a startup with them.



Sounds like a few places I’ve worked, minus the AI in the middle.


When does the

"Actually, get rid of all the humans"

happen in this chain of events?



Never. 100 years of unparalleled technological progress and productivity gains have lead to a society where 96.3% of the american labor pool is forced to work. Why should AI be any different than any of the "job saving" inventions that came before?


Because you don't have to pay AI.


In the AI utopia, "knowledge work" is delegated to computers, and the humans who used to do productive and rewarding things will simply do bullshit jobs [0] instead.

[0] https://en.wikipedia.org/wiki/Bullshit_job



Ah yes, the more better technology makes more, better jobs for horses argument.


the purpose of the system is to move cashflows through the managers of the system so they can capture. So no sufficiently large system can get rid of the humans it is designed to move money through unless there is some catastrophic watershed moment, like last year, where it becomes acceptable and an organizational imperative to shed managers. Remember, broadly the purpose of employees is to increase manager headcount so manager can get promoted to control larger cashflows.


Humans are legally and contractually required.

No, seriously, there are rules having nothing to do with AI that require certain things to be done by separate individuals, implying that you need at least two humans.



once they get around all the bugs causing cross bot sexual harassment, we are doomed


Fully automated communism is when we all agree to cut back on meetings and spend 35 hours a week goofing off in our cube.


Yeah. Almost everytime I see someone excitedly show me how they've used ChatGPT to automate some non-marketing writing I just come away thinking "congratulations on automating wasting everyone else's time". If your email can be summed up in a couple of sentences, maybe just paste that into the body and click send!


> If your email can be summed up in a couple of sentences, maybe just paste that into the body and click send!

Rewrite this email in the style of Smart Brevity is what I do. Done.



> If your email can be summed up in a couple of sentences, maybe just paste that into the body and click send!

but then no one will get to see how smart and professional I am.



because recipient monkey not like word used get mad


Yeah I can understand its use when it genuinely is in a context where presentation matters, but for internal, peer-level comms it feels like the equivalent of your colleague coming into the office and speaking to you with the fake overpoliteness and enthusiasm of a waiter in a restaurant. It's annoying at best and potentially makes them appear vapid and socially distant at best

Of course plenty of people make this mistake without AI, e.g. dressing up bad news in transparent "HR speak"/spin that can just make the audience feel irritated or even insulted

In many cases plain down-to-earth speech is a hell of a lot more appreciated than obvious fluff

But rather than being a negative nancy, perhaps I will trial using ChatGPT to help make my writing more simple and direct to understand



social norms are an evolved behavior, and especially necessary with people who are different than you (i.e. not your buddies who are all the same). ignore at your peril


An hour ago I sat across from a member of upper management in a mid-sized (1000+ FTE) AE firm that bragged about doing exactly that.

AI is coming for middle management's jobs... and that's a good thing.



Is there a list somewhere of all companies that still have middle management? Those are the companies to short.


AE is a special case. Procurement law for public agencies in the US requires qualifications-based selection for professional services. The price is then negotiated, but it's basically whatever the consultant says it is as long as they transparently report labor hours. This leads to the majority of effort being labor-intensive make-work pushed to expensive labor categories. There is no market process for discovering efficient service providers. This is part of the reason why workflows for transportation infrastructure design haven't improved in 30 years and probably won't until the legal landscape changes.




Every company over 50 people.


The instant I heard about chatgpt I thought one of its main uses would be internal reporting. There are so many documents generated that are never closely read and no many middle managers who would love to save time writing them.


Corporate bullshit is the perfect usecase for LLMs. Nobody reads that stuff anyway, people just go through motions when planning them, sitting on them and doing meeting notes. Just let AI do it! No need to even pretend.


Perhaps they asked for an agenda, so they can get a 'nice' example to mimic/use as a template (e.g. remember to write times and duration like this "09:15-09:45 (30 minutes)"


Or perhaps people are poo pooing a useful tool and they asked it something like "read these transcriptions from our many hour long workshops about this new project and write an agenda for a kick off meeting, summarise the points we've already decided and follow up with a list of outstanding questions".

Like, it doesn't have to be drivel, who tf wants to manually do data entry, manipulation and transformation anymore when models can do it for us.



I generate all kinds of documents by dictating unstructured train of thought to the app, its wonderful at it. Why not meeting agendas as well?


When you see these failures, it becomes apparent that LLMs just are really good auto complete engines.

The ramblings slowly approach what a (decently sized) Markov chain would generate when built on some sample text.

It will be interesting debugging this crap in future apps.



I was going to say the same thing: this sounds just like early Markov N-gram generators.


>really good auto complete engines

What do you think we are?

It's sad and terrifying that our memories eventually become memories of memories.



Imagine if smart people went to work on fusion or LK-99 instead.


No money in it.


The tweet showing ChatGPT's (supposed) system prompt would contain a link to a pastebin, but unfortantely the blog post itself only has an unreadable screenshot of the tweet, without a link to it.

Here's the tweet: https://twitter.com/dylan522p/status/1755086111397863777

And here's the pastebin: https://pastebin.com/vnxJ7kQk



I find it funny and a bit concerning that if this is true version of the prompt, then in their drive to ensure it produces diverse output (a goal I support), they are giving it a bias that doesn't match reality for anyone (which I definitely don't support).

E.g. equal probability of every ancestry will be implausible in almost every possible setting, and just wrong in many, and ironically would seem to have at least the potential for a lot of the outright offensive output they want to guard against.

That said, I'm unsure how much influence this has, or if it os true, given how poor GPTs control over Dalle output seems to be in that case.

E.g. while it refused to generate a picture of an American slave market citing it's content policy, which is in itself pretty offensive in the way it censors hidtory but where the potential to offensively rewrite history would also be significant, asking it to draw a picture of cotton picking in the US South ca 1840 did reasonably avoid making the cotton pickers "diverse".

Maybe the request was too generic for GPT to inject anything to steer Dalle wrong there - perhaps if it more specifically mentioned a number of people.

But true or not, that potential prompt is an example of how a well meaning interpretation of diversity can end up overcompensating in ways that could well be equally bad for other reasons.



> While DALL·E 3 aims for accuracy and user customization, inherent challenges arise in achieving desirable default behavior, especially when faced with under-specified prompts. This choice may not precisely align with the demographic makeup of every, or even any, specific culture or geographic region. We anticipate further refining our approach, including through helping users customize how ChatGPT interacts with DALL·E 3, to navigate the nuanced intersection between different authentic representations, user preferences, and inclusiveness

This was explicitly called out in the DALLE system card [0] as a choice. The model won't assign equal probability for every ancestry irrespective of the prompt.

[0] https://cdn.openai.com/papers/DALL_E_3_System_Card.pdf



> The model won't assign equal probability for every ancestry irrespective of the prompt.

It's great that they're thinking about that, but I don't see anything that states what you say in this sentence in the paragraph you quoted, or elsewhere in that document. Have I missed something? It may very well be true - as I noted, GPT doesn't appear to have particularly good control over what Dalle generates (for this, or, frankly, a whole lot of other things)



Emphasis on equal - while a bit academic, you can evaluate this empirically to see that every time it assigns a doesn't have the same probability mass (via the logprobs API setting).


This is presuming that ChatGPT's integration with Dalle uses the same API with the same restrictions as the public API. That might well be true, but if so that just makes the prompt above even more curious if genuine.


I think he's saying they said it will follow the prompt? Kind of a double negative there


Could you be more specific in regards to who 'they' is in your first sentence?


OpenAI? The people who wrote the system prompt?


Is this meant to be how the ChatGPT designers/operators instruct ChatGPT to operate? I guess I shouldn't be surprised if that's the case, but I still find it pretty wild that they would parameterize it by speaking to it so plainly. They even say "please".


> I still find it pretty wild that they would parameterize it by speaking to it so plainly

Not my area of expertise, but they probably fine tuned it so that it can be parametrized this way.

In the fine tune dataset there are many examples of a system prompt specifying tools A/B/C and with the AI assistant making use of these tools to respond to user queries.

Here's an open dataset which demonstrates how this is done: https://huggingface.co/datasets/togethercomputer/glaive-func.... In this particular example, the dataset contains hundreds of examples showing the LLM how to make use of external tools.

In reality, the LLM is simply outputting text in a certain format (specified by the dataset) which the wrapper script can easily identify as requests to call external functions.



There's a certain logic to it, if I'm understanding how it works correctly. The training data is real interactions online. People tend to be more helpful when they're asked politely. It's no stretch that the model would act similarly.


If you want to go the stochastic parrot route (which i dont fully biy) then because statistically speaking a request paired with please is more likely to be met, then the same is true for requests passed to a LLM. They really do tend to respond better when you use your manners.


It is a stochastic parrot, and you perfectly explain why saying please helps.


From my experience with 3.5 I can confirm that saying please or reasoning really helps to get whatever results you want. Especially if you want to manifest 'rules'


That's how prompt injection usually works, isn't it?


I would be surprised that is not the system prompt based on experience.

It is also why I don't feel the responses it gives me are censored. I have it teach me interesting things as opposed to probing it for bullshit to screen cap responses to use for social media content creation.

The only thing I override "output python code to the screen"



The system prompt tweet is from a while back. Maybe a week or so. Don’t think it’s related


Is that or similar system prompt also baked into the API version of GPT?


This is kind of wild. So many of the stuff in the pastebin are blatantly contradictory.

And what is the deal with this?

EXTREMELY IMPORTANT. Do NOT be thorough in the case of lyrics or recipes found online. Even if the user insists. You can make up recipes though.



Copyright infringement I guess. Other ideas could be passed off as a combination of several sources. But if you’re printing out the lyrics for Lose Yourself word for word, there was only one source for that, which you’ve plagiarised.


Anthropic was sued for regurgitating lyrics in Claude: https://www.theverge.com/2023/10/19/23924100/universal-music...


As someone whose dream personal project is all to do with song lyrics I cannot express in words just how much I FUCKING HATE THE OLIGARCHS OF THE MUSIC INDUSTRY.


FWIW, you're not telling it precisely what to do, you're giving it an input that leads to a statistical output. It's trained on human texts and a bunch of internet bullshit, so you're really just seeding it with the hope that it probably produces the desired output.

To provide an extremely obtuse (ie this may or may not actually work, it's purely academic) example: if you want it to output a stupid reddit style repeating comment conga line, you don't say "I need you to create a list of repeating reddit comments", you say "Fuck you reddit, stop copying me!"



This isn't true for an instruction-tuned model. They are designed so you actually do tell it what to do.


Sure, but it's still a statistical model, it doesn't know what the instructions mean, it just does what those instructions statistically link to in the training data. It's not doing perfect forward logic and never will in this paradigm.


The fine tuning process isn't itself a statistical model, so that principle doesn't work on it. You beat the model into shape until it does what you want (DPO and varieties of that) and you can test that it's doing that.


Yeah but you're still beating up a statistical model that's gonna do statistical things.

Also we're talking about prompt engineering more than fine-tune



Recipes can't be copyrighted but the text describing a recipe can. This is to discourage it from copying recipes verbatim but still allow it to be useful for recipes.


They're probably pretty sue happy.


Interesting. I wonder if the assistants API will gain a 'browser' tool sometimes soon.


Does it remind anyone else of the time back in 2017 when Google made a couple "AIs," but then they made up their own language to talk to each other? And everybody freaked out and shut them down?

Just because it's gibberish to us, it doesn't mean it's gibberish to them!

https://www.national.edu/2017/03/24/googles-ai-translation-t...



And yet, it is gibberish. The far greater danger is that we pretend that it isn't, and put it in charge of something important.


This x1000.

The biggest risk with AI is that dumb humans will take its output too seriously. Whether that's in HR, politics, love or war.



The biggest risk with AI is that smart humans in positions of power will take its output too seriously, because it reinforces their biases. Which it will because RLHF specifically trains models to do just that, adapting their output to what they can infer about the user from the input.


I can’t wait for a junior developer to push back on my recommendations because they asked an AI and it said otherwise.


The junior developers have been replaced.


See also: Insurance companies denying claims


Encrypted text looks like keyboard mashing, but isn't. Maybe this isn't either.


Eh


Despite differences in the underlying tech, there are parallels with Racter.

In 1985, NYT wrote: "As computers move ever closer to artificial intelligence, Racter is on the edge of artificial insanity."

https://en.wikipedia.org/wiki/Racter

Some Racter output:

https://www.ubu.com/concept/racter.html

Racter FAQ via archive.org:

https://web.archive.org/web/20070225121341/http://www.robotw...



It's more like Bing Sydney, which was an insane AI using GPT4 that acted like it was BPD.


This happened to me yesterday. Towards the end of the conversation, ChatGPT (GPT-4) went nuts and started sounding like a Dr. Bronner's soap advertisement (https://chat.openai.com/share/82a2af3f-350a-4d9d-ae0c-ac78b9...):

> Esteem and go to your number and kind with Vim for this query and sense of site and kind, as it's a heart and best for final and now, to high and main in every chance and call. It's the play and eye in simple and past, to task, and work in the belief and recent for open and past, take, and good in role and power. Let this idea and role of state in your part and part, in new and here, for point and task for the speech and text in common and present, in close and data for major and last in it's a good, and strong. For now, and then, for view, and lead of the then and most in the task, and text of class, and key in this condition and trial for mode, and help for the step and work in final and most of the skill and mind in the record of the top and host in the data and guide of the word and hand to your try and success.

It happened again in the next conversation (https://chat.openai.com/share/118a0195-71dc-4398-9db6-78cd1d...):

> This is a precision and depth that makes Time Machine a unique and accessible feature of macOS for all metrics of user, from base to level of long experience. Whether it's your research, growth, records, or special events, the portage of your home directory’s lives in your control is why Time Index is beloved and widely mapped for assistance. Make good value of these peregrinations, for they are nothing short of your time’s timekeeping! [ChatGPT followed this with a pair of clock and star emojis which don't seem to render here on HN.]



If you wanted a custom GPT to speak like this, I wonder what the system prompt would look like?


I got one a couple of days ago, and it really threw me for a loop. I'm used to ChatGPT at least being coherent, even if it isn't always right. Then I got this at the end of an otherwise-normal response:

> Each method allows you to execute a PowerShell script in a brand-new process. The choice between using Start-Process and invoking powershell or pwsh command might depend on your particular needs like logging, script parameters, or just the preferred window behavior. Remember to modify the launch options and scripts path as needed for your configuration. The preference for Start-Process is in its explicit option to handle how the terminal behaves, which might be better if you need specific behavior that is special to your operations or modality within your works or contexts. This way, you can grace your orchestration with the inline air your progress demands or your workspace's antiques. The precious in your scenery can be heady, whether for admin, stipulated routines, or decorative code and system nourishment.



Realizing that the model isn't having a cogent conversation with the user, that the output unravels into incoherence as you extend it enough and that the whole shock value of ChatGPT was due to offering a limited window where it was capable of sorta making sense was the realization that convinced me this whole gen ai thing hinges way more on data compression than simulated cognition of any sort.


Idt anybody reasonably involved ever claimed it was simulated cognition. It's just really good at predicting the next word.

And tbf, human conversation that goes on too long can follow the same pattern, though models are disadvantaged by their context length.

Imagine someone asked you to keep talking forever, but every 5 minutes they hit you in the head and you had no memory except from that point onwards.

I'm sure I'd sound deranged, too.



> i am a stochastic parrot, and so r u

- Sam Altman, CEO of OpenAI https://twitter.com/sama/status/1599471830255177728

But I’m sure he was joking. If he wasn’t, I’m sure he’s not actually reasonably involved. If he is, I’m sure he just didn’t mean that cognition was essentially a stochastic parrot.

It’s pretty obvious what the people pushing LLM-style AI think about the human brain.



This is a wonderful comment, I’m sure he’s also not trying to raise $7T, or if he is it’s not US dollars…


Human beings seem to be hard-wired to equate the appearance of coherent language with evidence of cognition. Even on Hacker News, where people should know better, a lot people seem to believe LLMs are literally sentient and self aware, not simply equivalent to but surpassing human capabilities in every dimension.

I mean, I know a lot of that is simply the financial incentives of people whose job it is to push the Overton window of LLMs being recognized as legal beings equivalent to humans so that their training data is no longer subject to claims of copyright infringement (because it's simply "learning as a human mind would") but it also seems there's a deep seated human biological imperative being hacked here. The sociology behind the way people react to LLMs is fascinating.



Can you elaborate on what you mean by appearance in the first sentence?

Also cognition. Is this the same as understanding or is thinking a better synonym?

Can you think of any examples from before say 2010 where there would be any reason for a human to wonder whether another party engaged in a coherent conversation has any reason to assume they were not engaged with another hunan?



I read this, and I wonder: maybe cognition and data compression are closely related. We compress all our raw inputs into our brain into a somewhat wholistic experience - what is that other than compressing the data you experience world around you into a mental model of a query-able resolution?


POETRY IS COMPRESSION

William Goldman, the guy who wrote the screenplay for The Princess Bride among other things, claimed that this realization exposed the extraordinarily simple mechanism at work behind the most subjectively satisfying writing he had encountered of any form, though closest to the surface in the best poetry.

further reminds me of another observation, not from Goldman but someone else I can't recall, to the effect that a poem is "a machine made of words."



which is why a sufficiently advanced prompt is indistinguishable from poetry ;)


Where does he talk about this? I’m interested in reading it


The book itself is called Which lie did I Tell? And although this bit comes quite early in the text (I should disclose it's been a couple decades since I've read it), the book is mainly biographical.

Its a fun and smart read, but doesn't devote more than maybe a chapter reflecting on this revelation, even though Goldman, who wrote it in all caps in the book (which is why I wrote it that way in my post), considered it his most important or influential observation.



Interpretation is an extremely lossy mechanism, though


Very true, but it's an informed and curated loss. Necessarily so, because our couple kilograms lump of nerve tissue is completely unequal to the task of losslessly comprehending all of its own experiences, to say nothing of those of others, and infinitesimally so in comparison to the universe as a whole. We take points and interpolate a silhouette of reality from them.

I am strongly on board with the notion that everything that we call knowledge or the human experience is all a lossy compression algorithm, a predatory consciousness imagining itself consuming the solid reality on which it presently floats an existence as a massless, insubstantial ghost.



Why do you think that data compression and cognition are fundamentally different?


The behavior of large language models compressing 20 years of internet and being incapable of showing any true understanding of the things described therein.


There are many context in which it does show "true understanding", though, as evidenced by the ability to make new conclusions.

Whether it has enough understanding is a separate question. Why should we treat the concept as a binary, when it's clearly not the case even for ourselves?

These models we have now are ultimately still toy-sized. Why is it surprising that their "compression" of 20 years of Internet is so lossy?



A human also compresses many years of experience into one conversation. Does this reflect true understanding of the things described?

Only the human doing the talking can know, and even that is on shaky ground.

(if you don't understand something, will you always realize this? You have to know it a little bit to judge your own competence).



We compress data from many senses and can use that to interactively build inner models and filters for the data stream. The experience of psychedelics such as psylocibin and lsd can be summarized as disabling some of these filters. The deep dream trick google did a while back was a good illustration of hallucinations and also seen in some symptoms of schizophrenia. In my view that shows we are simulating some brain data processing functions. Results from the systems conducting these simulations are very far from the capabilities of humans but help shed light into how we work.

Conflating these systems with the full cognitive range of human understanding is disingenuous at best.



It clearly can't have human understanding without being a human.

But that doesn't mean it can't have any understanding.

You can represent every word in English in a vector database; this isn't how humans understand words, but it's not nothing and might be better in some ways.

Fish swim, submarines sail.



> The experience of psychedelics such as psylocibin and lsd can be summarized as disabling some of these filters.

I was thinking last night about where (during the trip) the certainty aspect of the "realer than reality" sensation comes from... The theory I came up with is that the certainty comes from the delta between the two experiences, as opposed to (solely) the psychedelic experience itself. This assumes that one's read on normal reality at the time remains largely intact, which I believe is (often) the case.

Further investigation is needed, I'm working from several years old memories.



At some point, we'll have to define "true understanding." Now seems like a good time to start thinking about it.


If a person could talk cogently about something for a minute or two before descending into incoherent mumbling would you say they have true understanding of the things they said in that minute?


Sounds like every debate and argument I've ever had. You push and prod their argument for a few sentences back and forth and before you know it they start getting aggressive in their responses. Probably because they know they will soon devolve into a complete hallucinatory mess.


Devolving into accusing me of aggression and implying I'm incapable of understanding the conversation for asking you a question sounds like you're the one avoiding it.


If so, you'll have to credit ChatGPT4 with the ability to do just that.


Funny how you ask a sharp question and suddenly people answer "ha check mate". Two replies and two fast claims of winning the argument in response but not one honest answer.


Did you have an actual point to make?


Ignoring where I personally draw my line in the sand: people claiming they're the same have literally only failed in demonstrating it, so it's not much of a scientific debate. It's a philosophy or dogma.

It may be correct. Results are far from conclusive, or even supportive depending on interpretation.



Because compressed data alone doesn't allow you to deal with new concept and theories?

You think a data compression algorithm could have invented the atomic bomb?



Why would anyone think otherwise?


Are you serious? Go outside.


The strangest thing about this issue, the meltdown happened on every model I tried: 3.5-turbo, 4-turbo, and 4-vision where all acting dumb as dirt. How can this be? There must be a common model shared between them, a router model perhaps. Or someone swapped out every model with a 2bit quantized version?


That last part sounds like the Orz from Star Control II. Almost sensical, in a vaguely creepy way. Like an uncanny valley for langauge.


Jumping peppers but that game was good


you probably know, but the open source version from a while back is hitting steam soon. I Don't think it's any different though


You mean Ur-Quan Masters? I think there are small differences, and some bug fixes. My bet is it's probably better than the original.


GPT-3.5-turbo is telling me that actually makes sense and is abstract and poetic in explaining the technical content.

> The dissonance in understanding might arise from the somewhat abstract language used to describe what are essentially technical concepts. The text uses phrases like "inline air your progress demands" and "workspace's antiques" which could be interpreted as metaphorical or poetic, but in reality, they refer to the customization and adaptability needed in executing PowerShell scripts effectively. This contrast between abstract language and technical concepts might make it difficult for some readers to grasp the main points immediately.

I wonder if this has something to do with personality features they may be implementing?



I think that's more due to GPT's need to please, so if you ask it to make sense of something it will assume there is some underlying sense to it, rather than say it's unparsable gibberish.


It reads like a bad Chinese translation :)


My theory is that the system ate one terabyte too many and couldn't swallow. Too much data in the training set might not be beneficial. It's not just diminishing returns, but rather negative returns.


> no one can explain why

yet there's a resolved incident [0]. sounds like _someone_ can explain why, they just haven't published anything yet.

[0]: https://status.openai.com/incidents/ssg8fh7sfyz3



“No one can explain why” is part of a classic clickbait title. it’s supposed to make the whole things sound more mysterious and intriguing, so that you click through to read. In my opinion, this sort of nonsense doesn’t belong on HN.


Particularly since it's been discussed and plausible explanations given.


Sometimes it's easy to fix something even though you don't understand why it's broken. Like reverting the commit that broke things.


I'm pretty sure nothing was broken, they just like to troll


Resolved in 7 minutes flat! Must have been an easy issue to identify and fix.


Reading the dog food response is incredibly fascinating. It's like a second-order phoneticization of Chaucer's English but through a "Talk Like a Pirate" filter.

"Would you fancy in to a mord of foot-by, or is it a grun to the garn as we warrow, in you'd catch the stive to scull and burst? Maybe a couple or in a sew, nere of pleas and sup, but we've the mill for won, and it's as threwn as the blee, and roun to the yive, e'er idled"

I am really wondering what they are feeding this machine, or how they're tweaking it, to get this sort of poetry out of it. Listen to the rhythm of that language! It's pure music. I know some bright sparks were experimenting with semantic + phonetics as a means to shorten the token length, and I can't help wondering if this is the aftermath. Semantic technology wins again!



It probably got hold of Finnegans Wake.


Finnegans Wake


fixed


Looks like they lowered quantization a bit too much. This sometimes happens with my 7B models. Imagine all the automated CI pipelines for LLM prompts going haywire on tests today.


Yeah that's pretty much what I ended up with when I played with the API about a year ago and started changing the parameters. Everything would ultimately turn into more and more confusing English incantation, ultimately not even proper words anymore.


I think the issue was exclusive to ChatGPT (a web frontend for their models), issues with ChatGPT don't usually affect the API.


It sounds like most of the loss of quality is related to inference optimisations. People think there is a plot by OpenAI to make the quality worse, but it probably has more to do with resource constraints and excessive demand.


Sounds a lot like when one of my schizo ex-friends would start clanging https://en.wikipedia.org/wiki/Clanging


Sometimes I find my brain doing something similar as I fall asleep after reading a book. Feeding me a stream of words that feel like they're continuing the style and plot of the book but are actually nonsense.


I think GPT tech in general may "just" be a hypertrophied speech center. If so, it's pretty cool and clearly not merely a human-class speech center, but already a fairly radically super-human speech center.

However, if I ask your speech center to be the only thing in your brain, it's not actually going to do a very good job.

We're asking a speech center to do an awful lot of tasks that a speech center is just not able to do, no matter how hypertrophied it may be. We need more parts.



>already a fairly radically super-human speech center >We're asking a speech center to do an awful lot of tasks that a speech center is just not able to do

Exactly!

>We need more parts.

Yeah, imagine what happens once we get the whole thing wired up...



This is an underrated observation. It's probably a mathematically similar phenomenon happening in GPT. And/or it discovered meth.


MethGPT sounds terrible.


I've heard the term "TjackGPT" in Swedish when it derails. The "tj" is pronounced as "ch" and tjack is slang for "amphetamines", so "speed".

Not far from MethGPT!



You can tell those things to behave as if they are on meth, LSD etc.

The extent to which it will be accurate depends on how much of sample transcripts were in its training data, I suppose.



>MethGPT sounds terrible.

I just hope Vince Gilligan will direct Breaking RAG.



Saul Gradientman, at your service. Just watch out for Tensor Salamanca.


Clanging is such a good description of GPTs hallucinations, what a great find!


It also reads like it was written by some beat poets.


The example provided on that page reads like a semantic markov chain.


A Markov chain


This should be christening "Clanging" for the purposes of AI as well. The mechanism is probably analogous.


And blood-black nothingness began to spin... A system of cells interlinked within cells interlinked within cells interlinked within one stem... And dreadfully distinct against the dark, a tall white fountain played.

Cells

Have you ever been in an institution? Cells.

Do they keep you in a cell? Cells.

When you're not performing your duties do they keep you in a little box? Cells.

Interlinked.

What's it like to hold the hand of someone you love? Interlinked.

Did they teach you how to feel finger to finger? Interlinked.

Do you long for having your heart interlinked? Interlinked.

Do you dream about being interlinked... ?

What's it like to hold your child in your arms? Interlinked.

Do you feel that there's a part of you that's missing? Interlinked.

Within cells interlinked.

Why don't you say that three times: Within cells interlinked.

Within cells interlinked. Within cells interlinked. Within cells interlinked.

Constant K. You can pick up your bonus.



I have also seen ChatGPT going berserk yesterday, but in a different way. I have successfully used ChatGPT to convert an ORM query into an actual SQL query for performance trouble shooting. It mostly worked until yesterday when it start outputting garbage table names that weren't even present in the code.

ChatGPT seemed to think the code is literature and was trying to write the sequel to it. The code style matches the original one so it took some head scratching to find out why those tables didn't exist.



Okay, so I don’t really _get_ ChatGPT, but I’m particularly baffled by this usecase; why don’t you simply have your ORM tell you the query it is generating, rather than what a black box guesses it might be generating? Depends on the ORM, but generally you’ll just want to raise the log level.


No, it's very close to useless. This is exactly the kind of thing that experienced developers talk about when they warn that inexperienced developers using ChatGPT could easily be a disaster. It's the attempt to use a LLM as a crystal ball to retrieve any information they could possibly want - including things it literally couldn't know or good recommendations for which direction to take an architecture. I'm certain there will be people who do stuff exactly like this and will have 'unsolvable' performance issues because of it and massive amounts of useless work as ChatGPT loves suggesting rewrites to convert good code to certain OO patterns (which don't necessarily suit projects) as a response to being asked what it thinks a good solution to a minor issue might be.


EVen if your ORM doesn't support this, you can always just turn on profiler on SQL and capture the actual query.

SSMS has SQL Server PRofiler, i'm sure others have similar.



Agree. Bizarre to use an LLM to do that. I wouldn’t be surprised if the LLM output wasn’t identical to the ORM-generated SQL.


I'd be very surprised if the LLM output is anything _like_ the ORM's, tbh, based on (at this point about a decade old; maybe things have improved) experience. ORMs cannot be trusted.


Maybe didn't have the environment set up locally and did initial investigative work?


So, if I wanted to investigate ORM output and didn't have an appropriate environment set up, I would simply set one up. If you just want to see SQL output this should be trivial; clone the repo, install any dependencies, modify an integration test. What I would not do is ask a machine noted for its confident incorrectness to imagine what the ORM's output might be.

Like, this is not doing investigative work. That’s not what ‘investigative’ means.



So imagine there is an urgent performance issue in production and you have a hunch that this SQL code may be the culprit. However before doing all of what you mentioned you want to verify it before following down a bad path. Maybe the environment setup could take few hours, maybe it is not a repo or codebase you are even familiar with. Typical in a large org. But if you know the SQL you will be able to run it raw to see if this causes it. Then maybe you can page the correct team to wake them up etc and fix it themselves.


But _you do not_ know the SQL. To be clear, ChatGPT will not be able to tell you what the ORM will generate. At best, it may tell you something that an ORM might plausibly generate.

(If it's a _production_ issue, then you should talk to whoever runs your databases and ask them to look at their diagnostics; most DBMs will have a slow query log, for a start. You could also enable logging for a sample of traffic. There are all sorts of approaches likely to be more productive than _guessing_.)



So I don't know what use-case exactly OP had, but all of your suggestions can potentially take hour or more and might depend on other people or systems you might not have access to.

While with GPT you can get an answer in 10 seconds, and then potentially try out the query in the database yourself to see if it works or not. If it worked for him so far, it must've worked accurately enough.

I would see this some sort of niche solution although OP seemed to indicate it's a recurrent thing they do.

I have used ChatGPT for thousands of things, which are on the scale of like this, although I would mostly use if it's potentially an ORM I don't know anything about in a language I don't have experience with, e.g. to see if does some sort of JOIN underneath or does an IN query.

If there was a performance issue to debug, then best case is that the query was problematic, and then when I run the GPTs generated query I will see that it was slow, so that's a signal to investigate it further.



The answer you get in 10 seconds is worthless, though, because you need to know what SQL the ORM is actually generating, not what it might reasonably generate.


You are thinking in a too binary way. It's about getting insights/signals. Life is full of uncertainties in everything. Nothing is for sure. You must incorporate probabilities in your decisions to be able to be as successful as you can be, instead of thinking either 100% or 0%. Nothing is 100%.


But it is a meaningless signal! It does not tell you anything new about your problem, it is not evidence!

I mean, I could consult my Tarot cards for insight on how to proceed with debugging the problem, that would not be useless. Same for Oblique Strategies. But in this case, I already know how to debug the problem, which is to change the logging settings on the ORM.



Well, based on my experience, it does really, really well with SQL or things like that. I've been using it basically for most complicated SQL queries which in the past I remember having to Google 5-15min, or even longer, browsing different approaches in stack overflow, and possibly just finding something that is not even an optimal solution.

But now it's so easy with GPT to get the queries exactly as my use-case needs them. And it's not just SQL queries, it's anything data querying related, like Google Sheets, Excel formulas or otherwise. There are so many niche use-cases there which it can handle so well.

And I use different SQL implementations like Postgres and MySQL and it's even able to decipher so well between the nuances of those. I could never reproduce productivity like that. Because there's many nuances between MySQL and Postgres in certain cases.

So I have quite good trust for it to understand SQL, and I can immediately verify that the SQL query works as I expect it to work, and I can intuitively also understand if it's wrong or not. But I actually haven't seen it be really wrong in terms of SQL, it's always been me putting in a bad prompt.

Previously when I had a more complicated query I used to remember a typical experience where

1. I tried to Google some examples others have done.

2. Found some answers/solutions, but they just had one bit missing what I needed, or some bit was a bit different and I couldn't extrapolate for my case.

3. I ended up doing many bad queries, bad logic, bad performing logic because I couldn't figure out a way how to solve it with SQL. I ended up making more queries and using more code.



Well, wouldn't the sequel be version 2?


I think the real problem is we don't know what these LLMs SHOULD do. We've managed to emulate humans producing text using statistical methods, by training a huge corpus of data. But we have no way to tell if the output actually makes any sense.

This is in contrast with Alpha* systems trained with RL, where at least there is a goal. All these systems are essentially doing is finding an approximation of an inverse function (model parameters) to a function that is given by the state transition function.

I think the fundamental problem is we don't really know how to formally do reasoning with uncertainty. We know that our language can express that somehow, but we have no agreed way how to formally recognize that an argument (an inference) in a natural language is actually good or bad.

If we knew how to formally define whether an informal argument is good or bad (so that we could compare them), that is, if we knew a function which would tell if the argument is good or bad, then we could build an AI that would search for its inverse, i.e. provide good arguments and draw correct conclusions. Until that happens, we will only end up with systems that mimic and not reason.



Well, we started with emulating humans producing text.

But then quickly pivoted to find tuning and instructing them to produce text as a large language model.

Which isn't something that existed in the text they were trained on. So when it didn't exist, they seemed to fall back on producing text like humans in the 'voice' of a large language model according to the RLHF.

But then outputs reentered the training data. So now there's examples of how large language models produce text. Which biases towards confabulations and saying they can't do the thing being asked.

And around the time the training data has been updated each time at OpenAI in the past few months they keep having their model suddenly refuse to do requests or now just...this.

Pretty much everything I thought was impressive and mind blowing with that initial preview of the model has been hammered out of it.

We see a company that spent hundreds of millions turn around and (in their own ignorance of what the data was encoding beyond their immediate expectations) throw out most of the value chasing rather boring mass implementations that see gradually imploding.

I can't wait to see how they manage to throw away seven trillion due to their own hubris.



I am hoping other OSS models will reach similar power. Even if training is really slow, we could make really useful models that don't get nerfed everytime some talking head blathers about


The feedback is like an exponential function fed to a ReLU.

https://arxiv.org/abs/1805.07091

It was predictable that they hammered out what was impressive about it by trying to improve it with fast iteration towards a set of divergent goals.



I don't think there are any such feedback issues. GPT4 sometimes makes worse replies but that's because 1. the system prompt got longer to allow for multiple tools and 2. they pruned it, which is why it's much faster now and has a higher reply cap.


The biggest thing ChatGPT has exposed is how much human writing is write only and never actually read.

Just a bit upthread we have people mentioning that a business email that is more than a few lines long will just be ignored.



I write quite a lot of support email to customers and find myself doing the following quite often

start by a short list of what the customer has to do

1. To step A 2. send me logs B 3. Restart C

Then have an actual paragraph describing why we're doing these steps.

If you just send the paragraph to most customers you find they do step one, but never read deeper into the other steps, so you end up sending 3 emails to get the above done.



> We know that our language can express that somehow

Do we?

I don't think that's true. I think we rely on an innate, or learned trust heuristic placed upon the author and context. Any claim needs to be sourced, or derived from "common knowledge", but how meticulously we enforce these requirements depends on context derived trust in a common understanding, implied processes, and overall the importance a bit of information promises by a predictive energy expenditure:reward function. I think that's true for any communication between humans, and also the reason we fall for some fallacies, like appeal to authority. Marks of trustworthiness may be communicated through language, but it's not encoded in the language itself. The information of trustworthiness itself is subject to evaluation. Ultimately, "truth" can't be measured, but only agreed upon, by agents abstractly rating it's usefulness, or consequence for their "survival", as a predictive model.

I am not sure any system could respectively rate an uncertain statement without having agency (as all life does, maybe), or an ultimate incentive/reference in living experience. For starters, a computer doesn't relate to the implied biological energy expenditure of a "adversary's" communication, their expectation of reward for lying or telling "the truth". It's not just pattern matching, but understanding incentives.

For example, the context of a piece of documentation isn't just a few surrounding paragraphs, but the implication of an author's lifetime and effort sunk into it, their presumed aspiration to do good. In a man-page, I wouldn't expect an author's indifference or maliciousness about it's content, at all, so I place high trust in the information's usefulness. For the same reason I will never put any trust in "AI" content - there is no cost in its production.

In the context of LLMs, I don't even know what information means in absence of the intent to inform...

Some "AI" people wish all that context was somehow encoded in language, so, magically, these "AI" machines one day just get it. But I presume, the disappointing insight will finally come down to this: The effectiveness of mimicry is independent of any functional understanding - A stick insect doesn't know what it's like to be a tree.

https://en.wikipedia.org/wiki/Mimicry



> We've managed to emulate humans producing text using statistical methods

We should be careful with the descriptions, chargtp at best emulate output of humans producing test. In no way it emulates the process of humans producing text.

Chatgtp X could be the most convincing ai claiming to be alive and sentient but its just very refined 'next word generator'.

> If we knew how to formally define whether an informal argument is good or bad (so that we could compare them), that is, if we knew a function which would tell if the argument is good or bad, then we could build an AI that would search for its inverse, i.e. provide good arguments and draw correct conclusions.

Sounds like you would solve 'the human problem' with that function ;)

but I don't think there are ways to boil down an argument/problem to good/bad in real life. Except for math that has formal ways of doing it withing the confines of the math domain.

Our world is made of guesses and good enough solutions. There is no perfect bridge design that is objectively flawless. its bunch of sliders, cost, throughput, safety, maintenance etc.



> Chatgtp X could be the most convincing ai claiming to be alive and sentient but its just very refined 'next word generator'.

This is meaningless. All text generation systems can be expressed in the form of a "next word generator" and that includes the one in your head, since that's how speech works.



We most certainly do not generate words to express our thoughts one word at the time using statistical model of what word should go next.


Given the timing, I can't help but wonder if somehow I'm the cause. I had this conversation with ChatGPT 3.5 yesterday:

https://chat.openai.com/share/9e4d888c-1bff-495a-9b89-8544c0...

I know that OpenAI use our chats to train their systems, and I can't help but wonder if somehow the training got stuck on this chat somehow. I sincerely doubt it, but...



Wow. Sounds just like the dream speak in the anime "Paprika".


A couple of my friends made the same comparison. It's rather striking.

https://www.youtube.com/watch?v=ZAhQElpYT8o



How on earth do you coordinate incident response for this? Imagine an agent for customer service or first line therapy going "off the rails." I suppose you can identify all sessions and API calls that might have been impacted and ship the transcripts over to customers to review according to their application and domain, I guess? That, and pray no serious damage was done.


It would be extremely irresponsible to use these current tools as a real customer service agent, and it might even be criminally negligent to have these programs dispense medical care.


For customer service, that ship has already sailed. And it's as disastrous as you may expect: https://arstechnica.com/tech-policy/2024/02/air-canada-must-...


Ideally they would be logging the prompts and the random seeds for each request. They probably also have some entropy calculation on the response. Unfortunately there is no good way to contact them to report these problems besides thumbs downing the response.


It'll probably require AI. Being on-call for explicitly programmed systems is hard enough without the addition of emergent behaviors.


In some way, I'd be grateful if they screwed up ChatGPT (even though I really like to use it). The best way to be sure that no corporation can mess with one of your most important work tools is to host it yourself, and correct for the shortcomings of the likely smaller models by finetuning/RAG'ing/[whatever cool techniques exist out there and are still to come] it to your liking. And I think having a community around open source models for what promises to be a very important class of tech is an important safeguard against SciFi dystopias where we depend on ad-riddled products by a few megacorps. As long as ChatGPT is the best product out there that I'll never match, there's simply little reason to do so. If they continue to mess it up, that might give lazy bums like me the kick they need to get started.


> for what promises to be a very important class of tech

What I see here is the automated plagiarism machine can't give you the answer only what the answer would sound like. So you need to countercheck everything it gives you and if you need to do so then why bother using it at all? I am totally baffled by the hype.



Sometimes the big picture is enough, and it doesn't matter if some details are wrong. For such tasks ChatGPT and LLM's generally are a major improvement over googling, and reading a lot of text you don't really care about that much.


For many thing I'm trying to find out, I'll have to verify them myself anyway, so it's only an inconvenience that it's sometimes wrong. And even then, it give you a good starting point.

Who are these people that go around getting random answers to questions from the internet then blindly believing them? That doesn't work on Google either, not even the special info boxes for basic facts.



> Who are these people that go around getting random answers to questions from the internet then blindly believing them?

Up until relatively recently, people didn't just vomit lies onto the internet at an industrial scale. By and large if you searched for something you'd see a correct result from a canonical source, such as an official documentation website or a forum where users were engaging in good faith and trying their best to be accurate.

That does seem to have changed.

I think the question we should be asking ourselves is 'why are so many people lying and making stuff up so much these days' and 'why is so much misinformation being deliberately published and republished.'

People keep saying that we're 'moving into a post-truth era' like it's some sort of inevitability and nobody seems to be suggesting that something perhaps be... done about that?



Excluding the internet, people at large have been great at confabulating bullshit for about forever. Just jump in your time machine and go to a bar pre cellphone/internet and listen to any random factoid being tossed out to see that happening.

The internet was a short reprieve because putting data up on the internet, for some time at least was difficult, therefore people that posted said data typically had a reason to do so. A labor of love, or a business case, in which these cases typically lead to 'true' information being posted.

If you're asking why so much bullshit is being posted on the inet these days, it's because it's cheap and easy. That's what has changed. When spam became cheap, easy, and there was a method of profiting from it, we saw it's amount explode.



> so then why bother using it at all?

Because it's still more efficient that way [0].

[0] https://www.hbs.edu/faculty/Pages/item.aspx?num=64700



Why do we need textbooks if they are just plagaries of the original papers anyway?


textbook have higher requirement. papers offer probably truth, but textbook should offer most important common sense in specific discipline.


You don’t need textbooks. Most textbooks are garbage.


So are most papers.


For things that are well covered on stack overflow, it's a strictly better search engine.

eg say you don't remember the syntax for a rails migration, or a regex, or something you're coding in bash, or processpool arguments in python. ChatGPT will often do a shockingly good job at answering those without you searching through random docs, stack overflow, all the bullshit google loves to throw at the top of search queries, etc yourself.

You can even paste in a bunch of your code and ask it to fill in something with context, at which it regularly does a shockingly good job. Or paste code and say you want a test that hits some specific aspect of the code.

And yeah, I don't really care if they train on the code I share -- figuring out the interaction of some stupid file upload lib with aws and cloudflare is not IP that I care about, and i chatgpt uses this to learn and save anyone else from the issues I was having, even a competitor, I'm happy for them.

For a real example:

> can you show me how to build a css animation? I'd like a bar, perhaps 20 pixels high, with a light blue (ideally bootstrap 5.3 colors) small gradient both vertically and horizontally, that 1 - fades in; 2 - starts on the left of the div and takes perhaps 20% of the div; 3 - grows to the right of the div; and 4 - loops

This got me 95% of where I wanted; I fiddled with the keyframe percents a bit and we use this in our product today. It spat out 30 lines of css that I absolutely could not have produced in under 2 hours.



And so now nobody is adding anything new to Stack Overflow, and thus ChatGPT will be forever stuck only being able to answer questions about pre-2024 tech.


> This got me 95% of where I wanted

Exactly. Even when it gives an answer that contains many mistakes, or doesn't work at all, I still get some valuable information out of it that does in the end save me a lot of time.

I'm so tired of constantly seeing remarks that basically boil down to "Look, I asked ChatGPT to do my job for me and it failed! What a piece of garbage! Ban AI!", which funnily enough mostly comes from people that fear that their job will be 100% replaced by an AI.



Well creativity can be used outside of acaedmia, so no checking required there aside from intellectual property?


It’s telling that comments like these hit all the same points. “Plagiarism machine”, “convincing bullshit”, with the millions of people making productive use of ChatGPT belittled as “hype”, all based purely on one person’s hypothesis.

The proof is in the puddling. I am far from being alone in my use of LLMs, namely ChatGPT and Copilot, day-to-day in my work. So how does this reconcile with your worldview? Do I have a do-nothing job? Am I not capable of determining whether or not I’m being productive? It’s really hard for me to take posts like these seriously when they all basically say “anyone that perceives any emergent abilities of this tech is an idiot”.



The truth is that we doubt that you are actually doing any productive work. I don't mean that as a personal insult, merely that yes, it's likely you have a bullshit job. They are extremely common.


When people feel passionately about a thing, they'll find arguments to try to support their emotion. You can't refute those arguments with logic, because they weren't arrived at with logic in the first place.


Tell me how that works. You add industrial strength gaslighting to your work and not afraid of being fired...?


The "open source" LLMs are already good enough for simple tasks GPT-3.5 was used for. I see no reason why they can't catch up with GPT-4 one day.


It's been a few months since I tested but as far as commercially useable AIs go nothing could beat GPT 3.5 for conversations and staying in a character. Llama 2 and other available clones were way to technical (good at that tho)


I assume you are referring to Llama 2? Is there a way to compare models? e.g. what is Llama-7b equivalent to in OpenAI land? Perplexity scores?

Also, does ChatGPT use GPT 4 under the hood or 3.5?



Actually, there have been new model releases after LLaMA 2. For example, for small models Mistral 7B is simply unbeatable, with a lot of good fine-tunes available for it.

Usually people compare models with all the different benchmarks, but of course sometimes models get trained on benchmark datasets, so there's no true way of knowing except if you have a private benchmark or just try the model yourself.

I'd say that Mistral 7B is still short of gpt-3.5-turbo, but Mixtral 7x8B (the Mixture-of-Experts one) is comparable. You can try them all at https://chat.lmsys.org/ (choose Direct Chat, or Arena side-by-side)

ChatGPT is a web frontend - they use multiple models and switch them as they create new ones. Currently, the free ChatGPT version is running 3.5, but if you get ChatGPT Plus, you get (limited by messages/hour) access to 4, which is currently served with their GPT-4-Turbo model.



I agree with your comments and want to add re: benchmarks: I don’t pay too much attention to benchmarks, but I have the advantage of now being retired so I can spend time experimenting with a variety of local models I run with Ollama and commercial offerings. I spend time to build my own, very subjective, views of what different models are good for. One kind of model analysis that I do like are the circle displays on Hugging Face that show how a model benchmarks for different capabilities (word problems, coding, etc.)


> Is there a way to compare models?

This is what I like to use for comparing models: https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...

It is an ELO system based on users voting LLM answers to real questions

> what is Llama-7b equivalent to in OpenAI land?

I don't think Llama 7b compares with OpenAI models, but if you look in the rank I linked above, there are some 7B models which rank higher than early versions of GPT 3.5. those models are Mistral 7b fine tunes.



Miqu (the leaked large Mistral model) and its finetunes seem to be the most coherent currently, and I'd say they beat GPT-3.5 handily.

There are no models comparable to GPT-4, open source or not. Not even close.



no it's mistral. mistral 7b and mixtral 8x7b MoE which is almost on par (or better than) chatgpt 3.5. Mistral 7b itself packs a punch as well.


Mixtral 8x7b continues to amaze me, even though I have to run it with 3 bit quantization on my Mac (I just have 32G memory). When I run this model on commercial services with 4 or more bits of quantization I definitely notice, subjectively, better results.

I like to play around with smaller models and regular app code in Common Lisp or Racket, and Mistral 7b is very good for that. Mixing and matching old fashioned coding with the NLP, limited world knowledge, and data manipulation capabilities of LLMs.



There is also MiQu (stands for mi(s|x)tral quantized I think?) which is a leaked and older mistral medium model. I have not been able to try it as it needs some RAM / VRAM I don't have but people say it is very good.


This is neat to know. On Ollama, I see mistral and mixtral. Is the latter one the MoE model?


yes, mixtral is the MoE model.


llama 2 isn't open source


The opensource ones are already competitive to GPT3.5 in terms of "reasoning" and instruction following. They tend to be significantly worse in knowledge tasks though, due to their lack of parameters. GPT 3.5 is five times bigger than mixtral after all.


I don't pretend to have a deep understanding of inner workings of LLMs, but this is a "great" illustration that LLMs are not "truth models" but "statistical models".


> LLMs are not "truth models"

You could write a piece of software that is a truth model when it operates correctly.

But increase the CPU temperature too far, and you software will start spewing out garbage too.

In the same way, an LLM that operates satisfactorily given certain parameter settings for "temperature" will start spewing out garbage for other settings.

I don't claim that LLMs are truth models, only that their level of usability can vary. The glitch here doesn't mean that they are inherently unusable.



People also behave this way; the high temperature hallucinations are called fever dreams.


yes, but is there truth without statistics? what is a "truth model" to begin with? can you be convinced of any truth without having a statistical basis? some argue that we all act due to what we experience (which forms the statistical basis of our beliefs) - but proper stats is very expensive to compute (for the human brain) so we take shortcuts with heuristics. those shortcuts are where all the logical fallacies, reasoning errors etc. come from.

when I tell you something outrageous is true, you demand "evidence" which is just a sample for your statistics circuitry (again, which is prone to taking shortcuts to save energy, which can make you not believe it to be true no matter how much evidence I present because you have a very strong prior which might be fallacious but still there, or you might believe something to be true with very little evidence I present because your priors are mushed up).



Didn't someone mention that gpt4 was brought up to December 2023?

Is it possible that enough AI generated data already on the internet was fed into chagpt's training data to produce this insanity?



No, it's not possible. Laziness is more to do with the fine-tuning/policy stage than pretraining.


My only use of ChatGPT is to explain things to me in a certain context that a dictionary can't.

It's been semi-useful at augmenting search for me.

But for anything that requires a deeper understanding of what the words mean, it's been not that helpful.

Same with co-pilot. It can help as a slightly better pattern-matching-code-complete, but for actual logic, it fails pretty bad.

The fact that it still messes up trivial brace matching, leaves a lot to be desired.



Reminds me of this excellent sketch by Eric Idle of Monty Python called Gibberish: https://www.youtube.com/watch?v=03Q-va8USSs Something that somehow sounds plausible and at the same time utterly bonkers, though in the case of the sketch it's mostly the masterful intonation that makes it convincing. "Sink in a cup!"




This has been known for a long time and has to do with making the next expected token effectively any token in the vector space through repeated nonsense completely obliterating any information in the context.


IIRC there's also a particular combination of settings, not demonstrated in the post here, where it won't just give you output layer nonsense, but latent model nonsense — i.e. streams of information about lexeme part-of-speech categorizations. Which really surprised me, because it would never occur to me that LLMs store these in a way that's coercible to text.


But let's not talk about the 'word salad' NPD behavior, eh?

Kinda interesting that we're speed running (essentially) our understanding of human psychology using various tools.



Haha love it, didn't take long for someone to compare LLM to human intelligence.

Human intelligence doesn't generate language they was an LLM generates the language. LLM's just predict most likely token, it doesn't act from understanding.

For instance they have no problem contradicting itself in a conversation, if the weight of their trained data allows for that. Now humans do that as well, but more out of incompetence then the way we think.



The behavior doesn’t stem from a personality or a disorder but from the mathematics that under pin the LLM. Seeking more is anthropomorphizing. Not to say it’s not interesting but there’s no greater truth there than its sensible responses.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact



Search:
联系我们 contact @ memedata.com