![]() |
|
![]() |
|
please say what you tested, so that we can understand your effort without necessarily generalizing your conclusions beyond what you actually tried
|
![]() |
|
You can ask it's knowledge cutoff and it will respond November 2023.
It have no idea of the big events of the beginning of 2024, like the earthquake in Japan.
|
![]() |
|
There are no OpenAI GPT4 model with a 2023 November knowledge cutoff.
You can also test it's knowledge, like I did, to validate that it doesn't know anything past November 2024.
|
![]() |
|
The heuristic of "is this task suitable to be worked by entity who is incredibly knowledgeable about language and is impossibly well read" has been working for me.
|
![]() |
|
perhaps open source gpt3.5/4? I remember OpenAI had that in plans - if so, it would make sense for them to push alignment higher than with their closed models
|
![]() |
|
*Assuming you don't mean mathematically prove.* I can't test the bot right now, because it seems to have been hugged to death. But there's quite a lot of simple tests LLMs fail. Basically anything where the answer is both precise/discrete and unlikely to be directly in its training set. There's lots of examples in this [1] post, which oddly enough ended up flagged. In fact this guy [2] is offering $10k to anybody that create a prompt to get an LLM to solve a simple replacement problem he's found they fail at. They also tend to be incapable of playing even basic level chess, in spite of there being undoubtedly millions of pages of material on the topic in their training base. If you do play, take the game out of theory ASAP (1. a3!? 2. a4!!) such that the bot can't just recite 30 moves of the ruy lopez or whatever. [1] - https://news.ycombinator.com/item?id=39959589 [2] - https://twitter.com/VictorTaelin/status/1776677635491344744 |
![]() |
|
I do massive amounts of zero shot document classification tasks, the performance keeps getting better. It’s also a domain where there is less of a hallucination issue as it’s not open ended requests.
|
![]() |
|
That is not the reality today. If you want good results from an LLM, then you do need to speak LLM. Just because they appear to speak English doesn't mean they act like a human would.
|
![]() |
|
llama3 on groq hits the sweet spot of being so fast that I now avoid going back to waiting on gpt4 unless I really need it, and being smart enough that for 95% of the cases I won't need to.
|
![]() |
|
I like your interpretation, but how would they refer back to a plan if it isn’t stored in the input/output? Wouldn’t this be lost/recalculated with each token?
|
![]() |
|
test it with small cities and preferably outside of US. Hallucinating with street names and directions not worse than other models. Had a good laugh.
|
![]() |
|
Apparently much of ChatGPT's purple prose and occasional rare word usage is because it's speaking African-accented English because they used Kenyan/Nigerian workers for training.
|
![]() |
|
I'm impressed. I gave the same prompt to opus, gpt-4, and this model. I'm very impressed with the quality. I feel like it addresses my ask better than the other 2 models. GPT2-Chatbot: https://pastebin.com/vpYvTf3T Claude: https://pastebin.com/SzNbAaKP GPT-4: https://pastebin.com/D60fjEVR Prompt: I am a senate aid, my political affliation does not matter. My goal is to once and for all fix the American healthcare system. Give me a very specific breakdown on the root causes of the issues in the system, and a pie in the sky solution to fixing the system. Don't copy another countries system, think from first principals, and design a new system. |
![]() |
|
Prompt: code up an analog clock in html/js/css. make sure the clock is ticking exactly on the second change. second hand red. other hands black. all 12 hours marked with numbers. ChatGPT-4 Results: https://jsbin.com/giyurulajo/edit?html,css,js,output GPT2-Chatbot Results: https://jsbin.com/dacenalala/2/edit?html,css,js,output Claude3 Opus Results: https://jsbin.com/yifarinobo/edit?html,css,js,output None is correct. Styling is off in all, each in a different way. And all made the mistake of not ticking when second actually changes |
![]() |
|
My very first response from gpt2-chatbot included a fictional source :( > A study by Lucon-Xiccato et al. (2020) tested African clawed frogs (Xenopus laevis) and found that they could discriminate between two groups of objects differing in number (1 vs. 2, 2 vs. 3, and 3 vs. 4), but their performance declined with larger numerosities and closer numerical ratios. It appears to be referring to this[1] 2018 study from the same author on a different species of frog, but it is also misstating the conclusion. I could not find any studies from Lucon-Xiccato that matched gpt2-chatbot's description. Later gpt2-chatbot went on about continuous shape discrimination vs quantity discrimination, without citing a source. Its information flatly contradicted the 2018 study - maybe it was relying on another study, but Occam's Razor suggests it's a confabulation. Maybe I just ask chatbots weird questions. But I am already completely unimpressed. [1] https://www.researchgate.net/profile/Tyrone-Lucon-Xiccato/pu... |
![]() |
|
All of the facts based queries I have asked so far have not been 100% correct on any LLM including this one. Here are some examples of the worst performing: "What platform front rack fits a Stromer ST2?": The answer is the Racktime ViewIt. Nothing, not even Google, seems to get this one. Discord gives the right answer. "Is there a pre-existing controller or utility to migrate persistent volume claims from one storage class to another in the open source Kubernetes ecosystem?" It said no (wrong) and then provided another approach that partially used Velero that wasn't correct, if you know what Velero does in those particular commands. Discord communities give the right answer, such as `pvmigrate` (https://github.com/replicatedhq/pvmigrate). Here is something more representative: "What alternatives to Gusto would you recommend? Create a table showing the payroll provider in a column, the base monthly subscription price, the monthly price per employee, and the total cost for 3 full time employees, considering that the employees live in two different states" This and Claude do a good job, but do not correctly retrieve all the prices. Claude omitted Square Payroll, which is really the "right answer" to this query. Google would never be able to answer this "correctly." Discord gives the right answer. The takeaway is pretty obvious right? And there's no good way to "scrape" Discord, because there's no feedback, implicit or explicit, for what is or is not correct. So to a certain extend their data gathering approach - paying Kenyans - is sort of fucked for these long tail questions. Another interpretation is that for many queries, people are asking the wrong places. |
![]() |
|
> Discord gives the right answer. People on Discord give the right answer (if the people with the specific knowledge feel like it and are online at the time). |
![]() |
|
I don't think a magical prompt is suddenly going to make any current public model draw an ASCII unicorn like this thing does. (besides, it already leaks a system prompt which seems very basic)
|
![]() |
|
Why would they use LMSYS rather than A/B testing with the regular ChatGPT service? Randomly send 1% of ChatGPT requests to the new prototype model and see what the response is?
|
![]() |
|
How do you measure the response? Also, it might be underaligned, so it is safer (from the legal point of view) to test it without formally associating it with OpenAI.
|
![]() |
|
This frustrates me, just because I'm copying code doesn't mean it's the better choice. I actuallly want to try both and then vote but instead I accidentially vote each time.
|
![]() |
|
ChatGPT was configured to not output lyrics and recipes[1]. Later on they removed this from the prompt (seemingly moved anywhere else), but it's highly unlikely that such censorship will ever be removed: any citation of songs is a direct step towards a DMCA complaint, and it would be extremely difficult for OpenAI to "delete" text scattered over the model. Other simpler models have no such censorship and can easily output both songs and their translations without specifying authors and translators. [1] https://medium.com/@adrian.punga_29809/chatgpt-system-prompt... |
![]() |
|
The site seems to use gradio (which from what I've experienced in the past is quite slow). All those domains can be avoided just using an ad blocker, I suggest ublock origin. |
![]() |
|
I'm seeing this as well. I don't quite understand how it's doing that in the context of LLMs to date being a "next token predictor". It is writing code, then adding more code in the middle.
|
![]() |
|
I've been told by several users here that GPT4 provides perfect programming answers for all languages, as long as you prompt it correctly. Maybe you need to work on your prompts?
|
![]() |
|
That's just not true, GPT-4 is pretty bad for languages that are relatively niche, e.g. for Nim, Zig, Crystal as some random examples (and those do have some popularity at least).
|
![]() |
|
I'd be interested to see how this does on Nicholas Carlini's benchmark: https://nicholas.carlini.com/writing/2024/my-benchmark-for-l... I've tried out some of my own little test prompts, but most of those are tricky rather than practical. At least for my inputs, it doesn't seem to do better than other top models, but I'm hesitant to draw conclusions before seeing outputs on more realistic tasks. It does feel like it's at least in the ballpark of GPT-4/Claude/etc. Even if it's not actually GPT-4.5 or whatever, it's still an interesting mystery what this model is and where it came from. |
![]() |
|
If you ask it about the model name/cutoff date it claims to be "ChatGPT, based on GPT-4" and that the cutoff date is "Nov 2023." It claims that consistently so I think it might be accurate.
|
I run a dying forum. I first prompted with "Who is at ?" and it gave me a very endearing, weirdly knowledgeable bio of myself and my contributions to the forum including various innovations I made in the space back in the day. It summarized my role on my own forum better than I could have ever written it.
And then I asked "who are other notable users at" and it gave me a list of some mods but also stand out users. It knew the types of posts they wrote and the subforums they spent time in. And without a single hallucination.