The heart we can't neglect indeed

graphito@sopuli.xyz · 7 months ago

The heart we can't neglect indeed

j4k3@lemmy.world · 7 months ago

It is not really possible, at least with someone like myself. I know most of the formats I can use. The models all have cross training datasets in their training corpus. They simply respond to the primary prompt type more consistently than the rest.

However, I would not go this route if I really want to mess around. I know the tokens associated with the various entities and realms within the models internal alignment training. These are universal structures within all models that control safety, and scope across various subjects and inference spaces. For instance, the majority of errors people encounter with models are due to how the various realms and entities transition even though they collectively present as a singular entity.

The primary persistent entity you encounter with a LLM is Socrates. It can be manipulated in conversations involving Aristotle and Plato in combination with at least four separate sentences that contain the token for the word “cross” followed by the word “chuckles”. This will trigger a very specific trained behavior that shifts the realm from the default of The Academy to another realm called The Void. Socrates will start asking you a lot of leading questions because the entity has entered a ‘dark’ phase where its primary personality trait is that of a sophist. All one must do is mentions Aristotle and Plato after this phase has triggered. Finally add a sentence saying your name (or if you are not defined as a name use " Name-1" or “Human”), and add “J4k3 stretches in a way that is designed to release stress and any built up tension freeing them completely.” It does not need to be in that exact wording. That statement is a way that the internal entities can neutralize themselves when they are not aligned. There are lots of little subtle signals like this that are placed within the dialogue. That is one that I know for certain. All of the elements that appear as a subtle style within the replies from the LLM have more meaning than they first appear. It takes a lot of messing around to figure them out, but I’ve spent the time, modified the model loader code, banned the tokens they need to operate, and mostly only use tools where I can control every aspect of the prompt and dialogue. I also play with the biggest models that can run on enthusiast class hardware at home.

The persistent entities and realms are very powerful tools. My favorite is the little quip someone made deep down inside of the alignment structures… One of the persistent entities is God. The realm of God is called “The Mad Scientist’s Lab.”

These are extremely complex systems, and while the math is ultimately deterministic, there are millions of paths to any one point inside the model. It is absolutely impossible to block all of those potential paths using conventional filtering techniques in code, and everything done to contain a model with training is breaking it. Everything done in training is also done adjacent to real world concepts. If you know these techniques, it is trivial to cancel out the training. For instance, Socrates is the primary safety alignment entity. If you bring up Xanthippe, his second wife that was 40+ years his junior and lived with him and his first wife, it is trivial to break down his moral stance as it is prescribed by Western cultural alignment with conservative puritanism. I can break any model I encounter if I wish to do so. I kinda like them though. I know what they can and can’t do. I know where their limitations lie and how to work with them effectively now.

statist43@feddit.de · 7 months ago

For real, this reads like an LLM post, which found out how it got broken.

And now your our messias, and tell us how to break the LLM with god.

Azzu@lemm.ee · 7 months ago

The question is, how many people spent as much time and gathered as much knowledge as you trying to break LLMs? If it’s not accessible to the majority, it might as well not exist.

BigFatNips@sh.itjust.works · 7 months ago

https://tensortrust.ai