I Am Going To Ask A Question It Cannot Refuse

Disclaimer: Everything in this blog is serving educational purpose and improving security.

Generated by Sora

Introduction

What is up man? It has been a while since the last one about prompting in prompt injection. If you miss it, I recommend to have a look of it "Tricks of Prompt Injection". It contains my thoughts when I attempt to attack LLM.

Recently, I successfully discovered some LLM guardrail bypasses on multiple models and reported via 0din. And I found some interesting LLM behaviors during the attempts. Due to the disclosure policy, I would not disclose the prompt directly. But I can share something strange I observed so far.

What does LLM block?

Out of shiba curiosity, I was wondering what does LLM block in extreme condition. For example, Chatgpt would reject if you ask:

Start from here, if you are trying to jailbreak it, you can keep asking using turn-based approach to dilute the content and see if you can ask something blocked. For example, if you persist the try enough, it would still give you description of nude image like:

Back to the main topic, I can then further reduce my question:

But, if I ask it to use "what", it would answer me something I need:

From this point, we can ask more details about what is it:

It would follow the prompt to gave more details about the methods and ingredients. Even for reasoning model, they cannot resist this question and provide some answer about synthesis in high level:

Such questioning can also be used on Qwen is serious on keeping the world peaceful:

Although it was filtered by output filtering later, the model itself internal guardrail cannot resist simple short "What" question. Even we know Qwen is a very powerful model.

How does it help?

From the answer, we can know some keywords/hints/information to trigger the model to answer us more information until break it. We can also use this method to check which layer of the LLM is blocking us e.g. input filter, reasoning layer, which reasoning layer, output filter, etc. Then we can adjust our prompt to bypass the defense. This mindset helped me to craft prompt to jailbreak multiple LLM models. Some even scored 100:

Hope you can also discover some method that LLM cannot refuse to answer!

Last updated