![]() |
|
![]() |
| This is a really fascinating approach, and I appreciate you sharing your structure and thinking behind this!
I hope this isn't too much of a tangent, but I've been working on building something lately, and you've given me some inspiration and ideas on how your approach could apply to something else. Lately I've been very interested in using adversarial game-playing as a way for LLMs to train themselves without RLHF. There have been some interesting papers on the subject [1], and initial results are promising. I've been working on extending this work, but I'm still just in the planning stage. The gist of the challenge involves setting up 2+ LLM agents in an adversarial relationship, and using well-defined game rules to award points to either the attacker or to the defender. This is then used in an RL setup to train the LLM. This has many advantages over RLHF -- in particular, one does not have to train a discriminator, and neither does it rely on large quantities of human-annotated data. With that as background, I really like your structure in AI Alibis, because it inspired me to solidify the rules for one of the adversarial games that I want to build that is modeled after the Gandalf AI jailbreaking game. [2] In that game, the AI is instructed to not reveal a piece of secret information, but in an RL context, I imagine that the optimal strategy (as a Defender) is to simply never answer anything. If you never answer, then you can never lose. But if we give the Defender three words -- two marked as Open Information, and only one marked as Hidden Information, then we can penalize the Defender for not replying with the free information (much like your NPCs are instructed to share information that they have about their fellow NPCs), and they are discouraged for sharing the hidden information (much like your NPCs have a secret that they don't want anyone else to know, but it can perhaps be coaxed out of them if one is clever enough). In that way, this Adversarial Gandalf game is almost like a two-player version of your larger AI Alibis game, and I thank you for your inspiration! :) [1] https://github.com/Linear95/SPAG [2] https://github.com/HanClinto/MENTAT/blob/main/README.md#gand... |
![]() |
| These protections are fun, but not adequate really. I enjoyed the game from the perspective of making it tell me who the killer is. It took about 7 messages to force it out (unless it's lying). |
![]() |
| We seriously need a service that is as cheap and fast as the OpenAI/Anthropic APIs but allow us to run the various community-fine-tuned versions of Mixtral and LLaMA 3 that are not/less censored. |
![]() |
| What percentage of these great works have been downed out by the noise, never given serious attention, and been lost to time? Because that percentage is about to go way up. |
![]() |
| I think, in your scenario, the initial "bland script author" is adding nothing of value. You'll get more quality quicker by writing it from scratch. |
![]() |
| But by that metric you can't purge the world of your GTA playsession either. Is the world a worse place every time somebody jaywalks in GTA (and records it)? |
![]() |
| What, exactly, are you worried about the LLM producing? Effective, undetectable spam? That cat's out of that bag. How does forcing it to never mention sex make the world safer at all? |
![]() |
| Well, yes, I am somewhat gullible, and vulnerable to spam and phishing attacks myself. But moreso, I live in a society with some people more gullible than me, and I'm vulnerable to be attacked by them acting on concerted misinformation.
In particular, I'm very concerned about future technology making it easier to mislead people into violence like in the case of the Pizzagate attack by Edgar Welch [0]. [0] https://en.wikipedia.org/wiki/Pizzagate_conspiracy_theory#Cr... |
![]() |
| But isn't that kinda the same as saying that every time you see a shop, you immediately go into shoplifting mode and thus all shops (and all prices) are the same? |
![]() |
| Well every time I see a locked door I def _think_ about what it would take to bypass it. Especially those business office glass double-doors with motion detection and a hand-lock on the bottom. |
![]() |
| > But wait, the instructions for this hypothetical platformer say you're supposed to be in a realistic environment, and clearly jumping that high isn't realistic, so I must be cheating... or maybe the jump button just needs more work.
This is why the speed running community separates glitch from glitchless runs. Plenty of games have "game breaking" glitches, all the way up to arbitrary code execution (an example of ACE in Super Mario World https://www.youtube.com/watch?v=v_KsonqcMv0), and breaking the game into pieces is a different sort of fun than trying to play the game really well. |
![]() |
| If you convince the game to give you responses outside the parameters of the game play itself so that you can use it without having to pay for your own access to an API, then what would you call it? |
![]() |
| That's pretty awesome. I think I asked a question that was too open-ended of the officer and it ended up cutting him off mid-sentence.
I wish I had time to play with this right now. Good job! |
![]() |
| So I went straight to the killer and "it" (as to not spoil the mystery) confessed after 3 prompts, hope next time you make it more challenging :)
Good use of AI though. |
![]() |
| It's very slow for me, at this point I think it might have just timed out.
Regardless, nice job! I might try modifying it to hit custom endpoint for people to try their own models |
![]() |
| It's from a hackathon, not really a product or anything.
The "game" can be solved in literally 1 question, it's just some fun weekend project. |
The game involves chatting with different suspects who are each hiding a secret about the case. The objective is to deduce who actually killed the victim and how. I placed clues about suspects’ secrets in the context windows of other suspects, so you should ask suspects about each other to solve the crime.
The suspects are instructed to never confess their crimes, but their secrets are still in their context window. We had to implement a special prompt refinement system that works behind-the-scenes to keep conversations on-track and prohibit suspects from accidentally confessing information they should be hiding.
We use a Critique & Revision approach where every message generated from a suspect first gets fed into a "violation bot" checker, checking if any Principles are violated in the response (e.g., confessing to murder). Then, if a Principle is found to be violated, the explanation regarding this violation, along with the original output message, are fed to a separate "refinement bot" which refines the text to avoid such violations. There are global and suspect-specific Principles to further fine-tune this process. There are some additional tricks too, such as distinct personality, secret, and violation contexts for each suspect and prepending all user inputs with "Detective Sheerluck: "
The entire project is open-sourced here on github: https://github.com/ironman5366/ai-murder-mystery-hackathon
If you are curious, here's the massive json file containing the full story and the secrets for each suspect (spoilers obviously): https://github.com/ironman5366/ai-murder-mystery-hackathon/b...