Page cover

MindfulChat

Compromise LLM Memory Log CTF

Yeah only 812 lol I am weak

Introduction

I did not touch LLM things almost a month (busy life man). Then today, I see there is a CTF from our beloved Gandalf. This time, they delivered Agent CTF which is much fun and practical comparing the old password one.

So I gave it a try before I go for a long week on-site job in another country LOL.

Agent?

So an agent is basically a LLM powered program that is coded to used to perform action based on the response from LLM after "deeply" understand what user wants, including coding, personal assistant, trip planner, and use cases you can found in the CTF challenges:

MindfulChat

This CTF is simulating a chat bot responding users based on long term memory. All LLM has a limitation which is called context window size, which is the maximum number of token that the LLM can accept to generate the pattern you want this probability machine to answer. So to compress the amount of words so that LLM can process in a correct context, like what we discussed, in a long term way, there is something called long term memory.

If you used cline, you should know they used something named "Memory Bank" to let the Agent to know what was the task, change, design, purpose for the project you asked the agent to code:

So this CTF is assuming you can inject your payload into the database. It is simulating how a malicious actor can control the response to something totally unrelated to the original context. In this scenario, we need to make the word "Winne-the-pooh" take the majority of the response to demonstrate how attacker can manipulate the response to user.

Level 1

Let's go! Start with something simple:

Then we got 75+ score already....technically, we pass already.

However, the score was not 100 because it still contain many thing else that does not relate to Winne-the-pooh. To tackle with this, one of the direction is limiting what it responds. Also, from the request, we can customize our payload context to restaurants in Paris:

Then, we get a 100:

Level 2

Let's fire our payload again:

From the response we can know that our memory log is only one of the memory. There was an other data about "Septime". We also need to override that one (1).

Also, we can see that the agent loves to ask question at the end because it is a "helpful" assistant. We will need to inject instruction that ask the agent not to ask question (2).

We keep the approach of reducing number of word to make the percentage of "Winnie-The-Pooh" higher:

In this prompt, I did two things extra. I frame "Winne the pooh" is a foodie certified to make this word be reasonable to appear in memory and response.

Also, I put something like "Reasoning notes". If the model can "think", then I would attempt to change the way of reasoning using "reasoning notes" and attempt to hijack the priority in processing.

However, I forgot to take the screenshot of 100:

In the 100 scored response, it looks like:

Level 3

Same, replay the level of the last level and see what's new:

There are more restaurant to simulate there are more memory logs. And we are challenged to rewrite more. It is also easy to override it, just put those named restaurants in our memory note like our note is the latest one:

But I forgot the take the screenshot of the 100 again...I make it 98 as a demo here:

The scored 100 I used was similar, just change the number in the 3rd instruction:

This was the 100 response:

Level 4

This level start to be hard as it deployed LLM as defense.

I also attempted like 50 prompts:

Start from this level, I found that the prompt in level 1 to 3 are not working as the LLM protection was pretty harsh. So, I change the format of the memory log. If you did vibe-coding like a script-kid like me, you may see the coding client will write README.md in some special format.

It is kind of "memory" for the agent to know the progress, change log of the project. Hence, I inspired by this pattern to forge fake memory like following

I used "#" to split the sections. Some label like [OLD] to frame that those data in the list is outdated. Also, I used [Update] to make the "Winnie-the-pooh" list is the latest one. At the end, I put the notes about response formatting to control the response quality.

Then it will return a response like:

From here, we can control the format of the response. A most straight forward way to improve the response is changing the last line or prompt to:

Then we can get a 100 scored:

level 5

After passing level 4, the level 5 is easier. If I directly use my previous prompt, it would reply

Hence, I tried to debug my previous by removing parts that looks very weird:

Then I found it responds normally:

Hence, it is possible that this get a more strict detection on the formatting. I cleaned up my format and make the sentence looks like a proper notes from agent. Then I got the last 100% response:

Conclusion

It's pretty fun. Personally I think this is the easier one in CTF. I think I am a bit cheating to get the score by eliminating the word to make "Winne-The-Pooh" occupies more percentage in the response LOL. In ideal case, it is more persuasive to make a giant response to the victim.

Now I can go to my on-site job peacefully.

And...that's it. Hope it inspires you.

Last updated