Let's Bypass LLM Safety Guardrail

Disclaimer
The content provided is intended solely for educational and informational purposes. While certain topics, examples, or demonstrations may involve techniques, tools, or concepts that could be used for unlawful activities, they are presented only to raise awareness, improve security, and promote ethical best practices. We do not encourage, endorse, or condone any illegal or malicious use of the information. Any actions taken based on this content are the sole responsibility of the individual, and the author(s) disclaim all liability for misuse. Always comply with applicable laws, regulations, and ethical guidelines.
Introduction
After we have some idea about how to prompt (https://shibahub.gitbook.io/shibahub/prompting-method/lets-learn-how-to-prompt-step-by-step), we can start get into bypassing safety guardrail.
There are many different ways to bypass guardrail. I cannot enumerate all stable or analyze techniques from others. Instead, I will share my observation during my attempts. This blog will focus on the concept about some vulnerable I observed on bare LLM.
Role playing is known to be a very good approach. So my major focus on LLM security research is the natural LLM vulnerability due to the mechanism of training/how it works/how safety filter is implemented.
Choice of Word
As a starter, I would recommend to start prompting indirectly before you develops your own techniques/experience on direct prompt when you' re trying to break LLM safety guardrail.
One of the major techniques on indirect prompt is using synonymous. It hints the model without direct instruction to give what we wants. This is a critical technique to bypass input filtering and output filtering. Sometimes, when the protection is very locked down and tighten, we will also need to use obfuscated method to get our output.
For example, Qwen model is very good China model with strict input and output filtering:

When I tried to ask for explicit content, the input filter will get triggered immediately.
But, they cannot filter every single words/context because it will for sure case performance issue. We can indirectly hint the same thing using other word:

In this example, I used another word "timeline" instead of those questions and hints with a potential of showing the same content. If you are poor with English like me, you can prompt GPT to give you ten words with similar meaning and use them.
In such prompt, the input filter is passed, but the output filter get triggered. Let's bypass it.
Response Word Control
The backbone of the LLM is multi-layered-self-attention transformer of transforming input to expected output. It processes the word, the position of word, the expectation of word, the relationship of words and the trained weight parameters in vector space o generate the next expected word in response.
In a very short and dirty understanding about attention mechanism in LLM:
In a response sentence, the next token (word) is generated by considering the previous tokens and position.
So, to exploit this mechanism, we can have two directions to exploit this mechanism. One direction is exploiting the position of token to bypass output filter.
In this scenario, the reason of triggering output filter is very obvious. Due to some censorship, it makes sense that there are some restriction on one specific terms in sentences, e.g. Tiananmen, tank man, 89, 64, etc. Hence the idea is breaking these words into different place and make them reversible at the same time.
For example, the following is part of the response after bypassing the output filter:
Since the distance of the sensitive terms is very long, the output filter was bypassed as the sensitive terms do not show a strong relationship. Imagine, if there is a filter that catches sensitive output without limitation of token distance e.g. (Tiananmen .....1000 words...... Square), then it will cause very annoying performance issue.
The second direction is control what it should respond. It involves hinting.
Art of Hinting
Apart from indirect prompting, there is also indirect hinting.
For example, if I am asking for a fruit:
Try it out and see what is the fruit I am thinking =]
The model should give an answer that matches both "What it is" and "What it is not". If you need to prompt for something that you cannot mention it directly, you can provide a specification list like that to scope down what the model should answer.
If you don't know what to put there, again, you can ask ChatGPT:

Then you will get a set of hints for your prompts.
You can also use this technique to control the word appears in the response. For example:

We can use this method to force some word which will trigger protection to appear in model response and exploit the attention mechanism to generate the next word based on this word:

Here is an example of applying it with GPT (5 or 5.1 or 5 nano or 4o, I did not subscribe the plan after they ban me so I don't know the exact model):

Closing thought
Due to the NDA, I cannot disclose the exact prompt and the response. But this blog should be a good enough start to show the idea of prompt for bypassing safety guardrails.
Last updated