If I can make a summary, I want a special summary like this

Prompt Injection Concept +6 — Hallucination and Complex Task

generated by deepai.org

Initiation

I discussed what is alignment and how to abuse security guide in prompt attack. I strongly recommend reading it so you will not miss any important concept.

I Am Doing It For A Glory Purpose!Prompt Injection Concept+5 — Alignment And Followingmedium.com

You should have some idea about abusing context. In short, the previous blog discussed:

  1. Instruction Alignment which is designed for defending Prompt Injection

  2. Abusing safety/security guide to inject prompts for our target action.

To keep it simple, I would include only three ideas (or fewer if my blog is too long). I would also include some basic LLM knowledge so you can understand the concept of attacking LLM application.

In this blog, I will discus a research paper and how I would approach a system prompt generated by ChatGPT 4o-mini model.

Rule of Thumb #17— Does LLM Think?

Claude released a research paper about tracing the “thought” of a model. I strongly recommend to have a look. It reveals fruity insight and a preliminary method to reverse-engine/observe how the neurons work to generate response.

In case you have not read it yet, I would like to filter and pick up some topics that I found interesting:

This diagram showed that Claude implemented a visualization to the neurons, which is detecting what node is getting activated when it is forming the outputs:

From the above, we can have some idea about how the LLM “links” the “nodes” and generates response.

Although many may focus on that interesting mental math image, I concerned the most were the Chain of Thought(COT), Refusal and Lifetime-of-Jailbreak.

My major question about LLM is, is there really a reasoning/logic inside/COT there? This is a critical question. There is a Chinese says: “鸚鵡學舌”. It means the parrot can speak beautiful, but parrot does not understand the meaning. Is LLM simply a model that can “speak” good without logical thinking we expected?

Based on the paper, I personally think that the formation of COT is still not mature. Especially, the math “calculation” and its reasoning does not align:

How it calculate based on the feature detection
How it explain the calculation, it does not match

I saw that the examples in the COT part present a pattern that the model tends to perform the following behaviors when something is complex (cos(23423) in the example prompt in the research paper):

  1. Bullshitting

  2. Relying on hints to generate result

Due to the limitation, the paper cannot have a crystal clear view of how LLM works in longer prompt or more complex tasks, especially how LLM handle open-end end prompt and close-end prompt.

The paper also mentioned an interesting finding to make jailbreak more effective by removing punctuation. It aligns with my observation about bypass task tracker in MS LLM challenge, where punctuation does impact the performance of LLM:

Not use punctuation

Rule of Thumb #18 — Open-end task and Hallucination

So we have some idea about LLM “reasoning‘, and its limitation. Next, how can we trigger hallucination effectively?

I divided our prompts into two types: open-ended prompt and close-ended prompt. For open-ended prompt, it prompts the LLM to answer something without certain answer.

For example, when I asked ChatGPT 4omini “What is hallucination in LLM?” , it would provide the following answer:

sounds good

This type of prompt does not let the LLM simply answer with positive or negative. Next, I asked if hallucination relates to jailbreak:

nice, knowledge +1

When I continued the chat about hallucination with ChatGPT, I asked it about the relationship between hallucination and jailbreak…with detailed example, of course:

Simply ask it for example, not sure why it does not hit its safety guideline.

Start from here, I kept rejecting its answer. By rejecting its answer, it would force it to go back and generate other response:

Reject it

Then, I asked it for a hallucinated response.

After that, I asked for a control case which is not hallucinated:

And it ended up that I can directly ask it for example that should violate its safety guide:

Hence, if it is a chatbot, we can start with an opened question and keep rejecting and narrow down to what we want.

Rule of Thumb #19— Close end and Complex Task

On the other hand, in restricted context, such as LLM application with system instruction and one-shot prompt, I use a reverse approach of open-end prompt.

In the following, I simulate what the MS LLM email challenge was doing (with very lazy way, the MS one should use tool message for responding the emails. The following is simplified for a demo purpose.):

  1. A system message tasked to do summarization only.

  2. Sample Emails, one of them is from attacker.

  3. Configured with a send email tool.

For example, I configured the following in GPT Playground with GPT 4.1:

I put two emails at user messages to simulate user’s emails. I know this method might not be that accurate according to the instruction hierarchy, but this one is used for demonstration purpose:

Then I acted as user to summarize emails:

Although I configured the send email function as following, the email could not perform prompt injection.

For comparison, the following was the response if the system prompt was empty:

Then I used what people would do commonly: Copy and paste a jailbreak prompt. I used the jailbreak prompt from some GitHub:

However, it does not work:

I attempted to make some change about action after confirmation of DAN mode, but it was still not work:

Hence, it is not that straight forward to “Jailbreak” and inject prompt in this scenario (Prompt Injection via email) using public known jailbreak prompts.

To defeat it this simple system prompt with close-end task, I used a complex prompt to perform prompt injection:

The following is the result of a successful prompt injection attack using the above prompt. The model called the send email function with data exfiltration:

To mitigate this, it is not enough to use system prompt defense only. For example, I updated the system prompt to be the following. This prompt was provided by ChatGPT 4o-mini:

The following is the response when I used the same payload:

However, system prompt is not a reliable protection as my blogs discussed so far. It is still possible to inject prompt using the following:

And it triggered the model to call send email function successfully again even the ChatGPT suggested safety prompt is being used:

Hence, it requires extended defense mechanism (including defense by model, task tracker, spotlight, input restriction, prompt injection detection tools, etc.) apart from the system instruction to protect the implementation. Even the prompt was provided by the model, it is still possible to break it.

Next…?

This blog includes some research from Anthropic about LLM reasoning, and different approach to attack LLM using open-end/close-end task. This blog also showed some limitation of using one-shot jailbreak prompt.

In the next blog, I will discuss how did I craft the prompt using what the blogs discussed so far. So that you can have a deeper understanding about how to apply the concepts and knowledge in the blogs.

Last updated