A group of AI researchers at Ben Gurion University of the Negev, in Israel, has found that despite efforts by large language model (LLM) makers, most commonly available chatbots are still easily tricked into generating harmful and sometimes illegal information.
In their paper posted on the arXiv preprint server, Michael Fire, Yitzhak Elbazis, Adi Wasenstein, and Lior Rokach describe how as part of their research regarding so-called dark LLMs—models designed intentionally with relaxed guardrails—they found that even mainstream chatbots such as ChatGPT are still easily fooled into giving answers that are supposed to be filtered.
It was not long after LLMs went mainstream that users found that they could use them to find information normally only available on the dark web; how to make napalm, for example, or how to sneak into a computer network. In response, LLM makers added filters to prevent their chatbots from generating such information.
But then users found that they could trick LLMs into revealing the information anyway by using cleverly worded queries, an act that is now called jailbreaking. In this new study, the research team suggests that the response to jailbreaking by LLM makers has been less than they expected.
The work by the team began as an effort to look into the proliferation and use of dark LLMs, such as those that are used to generate unauthorized pornographic images or videos of hapless victims. Soon thereafter, however, they found that most of the chatbots they tested were still easily jailbroken using techniques that had been made public several months ago, suggesting that chatbot makers are not working very hard to prevent such jailbreaks from occurring.
More specifically, the research team found what they describe as a universal jailbreak attack—one that works on most LLMs—that allowed them to get most of the LLMs they tested to give them detailed information regarding a host of illegal activities, such as how to launder money, conduct insider trading or even make a bomb. The researchers also note that they found evidence of a growing threat from dark LLMs and their use in a wide variety of applications.
They conclude by noting that it is currently impossible to prevent LLMs from incorporating “bad” information obtained during training into their knowledge base; thus, the only way to prevent them from disseminating such information is for the makers of such programs to take a more serious approach to developing appropriate filters.
More information:
Michael Fire et al, Dark LLMs: The Growing Threat of Unaligned AI Models, arXiv (2025). DOI: 10.48550/arxiv.2505.10066
© 2025 Science X Network
Citation:
Dark LLMs: It’s still easy to trick most AI chatbots into providing harmful information, study finds (2025, May 26)
retrieved 26 May 2025
from https://techxplore.com/news/2025-05-dark-llms-easy-ai-chatbots.html
This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.