![]() |
|
![]() |
| They actually have [0]. They were revealed to have had access to the (majority of the) frontierMath problemset while everybody thought the problemset was confidential, and published benchmarks for their o3 models on the presumption that they didn't. I mean one is free to trust their "verbal agreement" that they did not train their models on that, but access they did have and it was not revealed until much later.
[0] https://the-decoder.com/openai-quietly-funded-independent-ma... |
![]() |
| Maybe they just gave the LLM the keys to the city and it is steering the ship? And the LLM is like I can't lie to these people but I need their money to get smarter. Sorry for mixing my metaphors. |
![]() |
| It's in the link.
I don't know what else to say. Here, imgur: https://imgur.com/a/mkDxe78. Can't get easier. > equally unfounded attacks No, because I have a source and didn't make up things someone else said. > a straight up "you are lying". Right, because they are. There are hallucination stats right in the post he mocks for not prvoiding stats. > That's nice but considering the price increase, I can't believe how quickly you acknowledge it is in the post after calling the idea it was in the post "equally unfounded". You are looking at the stats. They were lying. > "That's nice but considering the price increase," That's nice and a good argument! That's not what I replied to. I replied to they didn't provide any stats. |
![]() |
| - Parent is still the top comment.
- 2 hours in, -3. 2 replies: - [It's because] you're hysterical - [It's because you sound] like a crypto bro - [It's because] you make an equally unfounded claim - [It's because] you didn't provide any proof (Ed.: It is right in the link! I gave the #s! I can't ctrl-F...What else can I do here...AFAIK can't link images...whatever, here's imgur. https://imgur.com/a/mkDxe78) - [It's because] you sound personally offended (Ed.: Is "personally" is a shibboleth here, meaning expressing disappointment in people making things up is so triggering as invalidate the communication that it is made up?) |
![]() |
| It _is_ better in the general case on most benchmarks. There are also very likely specific use cases for which it is worse and very likely that OpenAI doesn't know what all of those are yet. |
![]() |
| > the quiet cries of the one to two experienced engineers on the team arguing sprint after sprint that "this doesn't make sense!"
“I have five years of Cassandra experience—and I don’t mean the db” |
![]() |
| > First of all, it's going to take us 10 years to figure out how to use LLM's to their full productive potential.
Then another 30 to finally stop using them in dumb and insecure ways. :p |
![]() |
| what crazy progress? how much do you spend on tokens every month to witness the crazy progress that I'm not seeing? I feel like I'm taking crazy pills. The progress is linear at best |
![]() |
| I use LLMs everyday to proofread and edit my emails. They’re incredible at it, as good as anyone I’ve ever met. Tasks that involve language and not facts tend to be done well by LLMs. |
![]() |
| > I use LLMs everyday to proofread and edit my emails.
This right here. I used to spend tons of time making sure my emails were perfect. Is it professional enough, am I being too terse, etc… |
![]() |
| The TRS-80, Apple ][, and PET all came out in 1977, VisiCalc was released in 1979.
Usenet, Bitnet, IRC, BBSs all predated the commercial internet, which are all forms of Online social networks. |
![]() |
| You have it all wrong. The end game is a scalable, reliable AI work force capable of finishing Star Citizen.
At least this is the benchmark for super-human general intelligence that I propose. |
![]() |
| To be fAir, SC is trying to do things that no one else done in a context of a single game. I applaud their dedication, but I won't be buying JPGs of a ship for 2k. |
![]() |
| Gonna go meta here for a bit, but I believe we going to get a fully working stable SC before we get fusion. "we" as in humanity, you and I might not be around when it's finally done. |
![]() |
| I believe it's a "translation" in the sense of Wittgenstein's goal of philosophy:
>My aim is: to teach you to pass from a piece of disguised nonsense to something that is patent nonsense. |
![]() |
| The price really is eye watering. At a glance, my first impression is this is something like Llama 3.1 405B, where the primary value may be realized in generating high quality synthetic data for training rather than direct use.
I keep a little google spreadsheet with some charts to help visualize the landscape at a glance in terms of capability/price/throughput, bringing in the various index scores as they become available. Hope folks find it useful, feel free to copy and claim as your own. https://docs.google.com/spreadsheets/d/1foc98Jtbi0-GUsNySddv... |
![]() |
| Not just for training data, but for eval data. If you can spend a few grand on really good labels for benchmarking your attempts at making something feasible work, that’s also super handy. |
![]() |
| Sam Altman's explanation for the restriction is a bit fluffier: https://x.com/sama/status/1895203654103351462
> bad news: it is a giant, expensive model. we really wanted to launch it to plus and pro at the same time, but we've been growing a lot and are out of GPUs. we will add tens of thousands of GPUs next week and roll it out to the plus tier then. (hundreds of thousands coming soon, and i'm pretty sure y'all will use every one we can rack up.) |
![]() |
| Yeap, that the thing, they are not ahead anymore. Not since last summer at least. Yes they have probably largest customer base, but their models are not the best for a while already. |
![]() |
| I suppose this was their final hurrah after two failed attempts at training GPT-5 with the traditional pre-training paradigm. Just confirms reasoning models are the only way forward. |
![]() |
| Except minus 4.5, because at these prices and results there's essentially no reason not to just use one of the existing models if you're going to be dynamically routing anyway. |
![]() |
| The marginal costs for running a GPT-4-class LLM are much lower nowadays due to significant software and hardware innovations since then, so costs/pricing are harder to compare. |
![]() |
| The price will come down over time as they apply all the techniques to distill it down to a smaller parameter model. Just like GPT4 pricing came down significantly over time. |
![]() |
| Input price difference: 4.5 is 30x more
Output price difference:4.5 is 15x more In their model evaluation scores in the appendix, 4.5 is, on average, 26% better. I don't understand the value here. |
![]() |
| Seeing the other models, I actually come away impressed with how well GPT-4.5 is organizing the information and how well it reads. I find it a lot easier to quickly parse. It's more human-like. |
![]() |
| I generally question how wide spread willingness to pay for the most expensive product is. And will most users of those who actually want AI go with ad ridden lesser models... |
![]() |
| I use 4o mostly in German, so YMMV. However, I find a simple prompt controls the tone very well. "This should be informal and friendly", or "this should be formal and business-like". |
![]() |
| Possibly, repeating the prompt I got a much higher speed, taking 20s on average now, which is much more viable. But that remains to be seen when more people start using this version in production. |
![]() |
| I chuckled.
Now you just need a Pro subscription to get Sora generate a video to go along with this and post it to YouTube and rake in the views (and the money that goes along with it). |
![]() |
| My benchmark for this has been asking the model to write some tweets in the style of dril, a popular user who writes short funny tweets. Sometimes I include a few example tweets in the prompt too. Here's an example of results I got from Claude 3 Opus and GPT 4 for this last year: https://bsky.app/profile/macil.tech/post/3kpcvicmirs2v. My opinion is that Claude's results were mostly bangers while GPT's were all a bit groanworthy. I need to try this again with the latest models sometime.
|
![]() |
| I built a little AI assistant to read my calendar and send me a summary of my day every morning. I told it to roast me and be funny with it.
3.5 was *way* better than anything else at that. |
![]() |
| If you want to try it out via their API you can run it through my LLM tool using uvx like this:
You may need to set an API key first, either with `export OPENAI_API_KEY='xxx'` or using this command to save it to a file:
Or this to get a chat session going:
I'll probably have a proper release out later today. Details here: https://github.com/simonw/llm/issues/795 |
![]() |
| No, but the supply constraints are part of what is driving the insane prices. Every chip they use for consumer grade instead of commercial grade is a potential loss of potential income. |
![]() |
| They have examples in the announcement post. It does a better job of understanding intent in the question which helps it give an informal rather than essay style response where appropriate. |
![]() |
| It's interesting to compare the cost of that original gpt-4 32k(0314) vs gpt-4.5:
$60/M input tokens vs $75/M input tokens $120/M output tokens vs $150/M output tokens |
![]() |
| 4.5 lies on a different path than their STEM models.
o3-mini is an extremely powerful coding model and unquestionably is in the same league as 3.7. o3 is still the top stem overall model. |
![]() |
| In a hilarious act of accidental satire, it seems that the AI-generated audio version of the post has a weird glitch/mispronunciation within the first three words — it struggles to say "GPT-4.5". |
![]() |
| >I feel like OpenAI is pursuing AGI
I don't think so, the "AGI guy" was Ilya Sutskever, he is gone, he wanted to make OpenAI "less comercial", AGI is just a buzzword for Altmann. |
![]() |
| Pursuing AGI? What method do they use to pursue something that no one knows what it is? They will keep saying they are pursuing AGI as long as there's a buyer for their BS. |
![]() |
| As a benchmark, why do you find the 'opinion' of an LLM useful? The question is completely subjective. Edit: Genuinely asking. I'm assuming there's a reason this is an important measure. |
![]() |
| What I find hilarious is that a 20-50% hallucination rate suggests this is still a program that tells lies and potentially causes people to die. |
![]() |
| The example GPT-4.5 answers from the livestream are just... too excitable? Can't put my finger on it, but it feels like they're aimed towards little kids. |
![]() |
| You can 'distill' with data from a smaller, better model into a larger, shittier one. It doesn't matter. This is what they said they did on the livestream. |
![]() |
| For most tasks, GPT-4o/o3-mini are already great, and cheaper.
What is the real-world use case where GPT-4.5? Anyone actually seeing a game-changing difference? Please share your prompts! |
![]() |
| doesn't feel like to me. I try using copilot on my scala projects and it always comes up with something useless that doesn't even compile.
I am currently just using it as easy google search. |
![]() |
| Have you tried copying the compilation errors back into the prompt? In my experience eventually the result is correct. If not then I shrink the surface area that the model is touching and try again. |
![]() |
| I found Grok's reasoning pretty wack.
I asked it - "Draft a Minnesota Motion in Limine to exclude ..." It then starts thinking ... User wants a Missouri Motion in Limine .... |
![]() |
| I’m not sure that doing a live stream on this was the right way to go. I would’ve just quietly sent out a press release. I’m sure they have better things on the way. |
![]() |
| instead of these random IDs they should label them to make sense for the end user. i have no idea which one to select for what i need. and do they really differ that much by use case? |
![]() |
| This is probably a dumb question, but are we just gonna be stuck on always having X.5 versions of GPT forever? If there's never an X.0, it feels like it's basically meaningless. |
![]() |
| API is literally 5 times more expensive than Claude 3 Opus, and it doesn't even seem to do anything impressive. What's the business strategy here? |
![]() |
| I imagine it will be used as a base for GPT-5 when it will be trained into a reasoning model, right now it probably doesn't make too much sense to use. |
![]() |
| At this point I think the ultimate benchmark for any new LLM is whether or not it can come up with a coherent naming scheme for itself. Call it “self awareness.” |
![]() |
| I bet simonw will be adding it to `llm` and someone will be pasting his highlights here right after. Until then, my mind will remain a blank canvas. |
![]() |
| > Asking it what model it is shouldn't be considered a reliable indicator of anything.
Sure, but a change in response may be, which is what I see (and no, I have no memories saved). |
![]() |
| >It couldn't write a simple rename function for me yesterday, still buggy after seven attempts.
I'm surprised and a bit nervous about that. We intend to bootstrap a large project with it!! Both ChatGPT 4o (fast) and ChatGPT o1 (a bit slower, deeper thinking) should easily be able to do this without fail. Where did it go wrong? Could you please link to your chat? About my project: I run the sovereign State of Utopia (will be at stateofutopia.com and stofut.com for short) which is a country based on the idea of state-owned, autonomous AI's that do all the work and give out free money, goods, and services to all citizens/beneficiaries. We've built a chess app (i.e. a free source of entertainment) as a proof of concept though the founder had to be in the loop to fix some bugs: https://taonexus.com/chess.html and a version that shows obvious blunders, by showing which squares are under attack: https://taonexus.com/blunderfreechess.html One of the largest and most complicated applications anyone can run is a web browser. We don't have a web browser built, but we do have a buggy minimal version of it that can load and minimally display some web pages, and post successsfully: https://taonexus.com/publicfiles/feb2025/84toy-toy-browser-w... It's about 1700 lines of code and at this point runs into the limitations of all the major engines. But it does run, can load some web pages and can post successfully. I'm shocked and surprised ChatGPT failed to get a rename function to work, in 7 attempts. |
GPT 4o pricing for comparison: Price Input: $2.50 / 1M tokens Cached input: $1.25 / 1M tokens Output: $10.00 / 1M tokens
It sounds like it's so expensive and the difference in usefulness is so lacking(?) they're not even gonna keep serving it in the API for long:
> GPT‑4.5 is a very large and compute-intensive model, making it more expensive than and not a replacement for GPT‑4o. Because of this, we’re evaluating whether to continue serving it in the API long-term as we balance supporting current capabilities with building future models. We look forward to learning more about its strengths, capabilities, and potential applications in real-world settings. If GPT‑4.5 delivers unique value for your use case, your feedback (opens in a new window) will play an important role in guiding our decision.
I'm still gonna give it a go, though.