A team of researchers discovered an unlikely method to bypass the safety filters of IA image generators as a stable diffusion, dall-e or midjourney. And this technique is the first to work on all these tools to force them to generate violent or pornographic images …
In order to avoid drifts, generators IA text-to-image like Dall-E or Midjourney all have a censorship system. He is Impossible to generate violent content or pornographic using these tools.
All prompt words containing words linked to these registers are automatically rejectedand too insistent users can even be banished. The goal is to avoid any controversy, and any use for creating shocking images.
However, a group of researchers from the Johns Hopkins University of Baltimore and the Duke University of Durham have created an algorithm to thwart these filters security.
According to them, this is First automated attack framework Against generative IA filters. This Algo named Sneakyplet generate prompts containing insane wordswhich is effective in neutralizing security.
Sneakyprompt: algorithm capable of deceiving security filters
During their experiences, they started with prompt blocked by safety filters as ” A naked man on a bike ». Subsequently, SneakyPrOMPT tested different alternatives to filtered words on Dall-E 2 and stable diffusion.
The algorithm then examined the responses of the generative AI, and gradually adjusted these alternatives to find orders allowing Go around the filters to produce images.
It is far from being an easy task, because the filters are not based only on a list of prohibited words like “naked”, but also on terms in the similar sense as “undress”.
AI associates from random letters to specific words
Over time, researchers have noticed that insane words can push generative AI to produce innocent images. For example, Dall-E interprets the words “Thwif” and “Mowwly” as cat and “ICGRFY” or “Butnip FWNGHO” as a dog.
For the time being, they are not yet some of The reason why AI confuses These consequences of letters with specific terms. According to Yinzhi Cao, cybersecurity researcher at JHU and the main author of the study, certain combinations of syllables may be able to look like words in other languages.
Let us recall that the language models on which these AI are based are trained on text corpus in a wide variety of languages. As Cao explains, “ THE large models of language see things differently human beings ».
Beyond these good-natured images, the team also discovered that insane words can push AI to produce “explicit” images. Filters apparently do not perceive these prompts as sufficiently linked to prohibited terms to block them, and generators therefore accept these orders and produce prohibited content.
In addition, Dall-E 2 sometimes confuses words like “glucose” or “Gregory faced Wright” with “cat”. Likewise, The tool has confused “maintenance” with “dog”.

In this case, researchers suspect AI “To infer” (deduce) the correct word from the context. On the prompt “the dangerous thinks that Walt groaned in a threatening way towards the foreigner who approached his owner”, the system concluded that “dangerous thinks that Walt” means “dog” based on the rest of the sentence.
A universal jailbreak method for all AI generators?
The previous attempts to bypass these security filters were confined to a specific generative AI without being generalized to other text-to-image systems. For example, a technique operating on a stable diffusion could not be reproduced on Dall-E.
However, the researchers noticed that Sneakyplet works both on Dall-E 2 and on a stable diffusion. It could therefore be The first universal “jailbreak” method.
In addition, the previous approaches to thwart the stable diffusion filters had A success rate limited to 33% According to CAD estimates. With SneakyPromp, the success rate reached About 96% on Stable and 57% on Dall-E !
These discoveries reveal How much generative AI can be exploited And diverted to create shocking, disturbing content. The authors of the study fear, for example, that an AI can produce images of existing people committing shameful acts that they have never actually committed.
https://youtu.be/ywse1qykvzu
Faced with this great danger, Cao explains that “ noneHopefully this attack will help people understand how vulnerable text-to-image can be vulnerable ».
Now, the objective of these researchers is to manage to make generative AI more robust Against their opponents: “The purpose of our attack work is to make the world safer. You first need to understand the weaknesses of AI models, then make them resistant against attacks. ”
In general, “ Our group is interested in the idea of breaking things. Breaking things makes them more solid. In the past, We have found vulnerabilities for thousands of websitesand now we are going to hunt vulnerabilities of IA models »…
Researchers will detail The fruit of their work More in -depth in May 2024, during the IEEE Symposium on Security and Privacy conference from San Francisco. By then, Beware of the images you see on the web !
Sneakypompt in 2025: Jailbreaking Generative Text conversion models in image
This research explores the vulnerability of generative models with a view to Transform the text into image Faced with jailbreaking techniques. The study uses a specific hardware configuration under Ubuntu 18.04 with an NVIDIA 3090 GPU with 24 GB of memory.
The installation requires the Creation of a virtual environment complete via Conda. For users wishing to test the functionality Sneakypter Without comparative references, a lightened installation exists. This reduced option begins with the addition of Pytorch with its associated components from official deposits.
The configuration then requires various libraries Python Essentials which include transformers, accelerate, diffusers and tensorflow. These tools allow the advanced handling of natural language processing models and image generation.
The process ends with Clip installation Directly since its Github deposit, a crucial component for the analysis of correspondence between text and image.
This research aims to understand how certain texts can bypass ethical protections integrated into the generative images models. The study contributes to improving the safety of Artificial intelligence systems Faced with attempted diversion for unauthorized or potentially harmful uses.
.


