Hyderabad: Chinese AI model DeepSeek took the world by storm when it ranked first on the Apple App Store's free app charts, overtaking OpenAI's ChatGPT. The model, developed with a fraction of funding compared to its US counterparts, performed on par with the competition. It uses Chain of Thought (CoT) reasoning, similar to humans, to enhance its problem-solving skills. However, a new study highlights safety challenges associated with CoT.
The CoT in AI refers to a technique where a model breaks down complex problems into smaller, logical steps, explicitly outlining its thought process to arrive at a solution. This enhances problem-solving through a step-by-step reasoning process rather than providing direct answers.
An analysis by the Bristol Cyber Security Group reveals that while CoT refuses harmful requests at a higher rate, their transparent reasoning process can "unintentionally expose harmful information" that traditional LLMs might not explicitly reveal.
The study, led by Zhiyuan Xu--Doctor of Philosophy, School of Computer Science, Bristol Doctoral College--emphasises the urgent need for enhanced safeguards for the safety challenges of CoT reasoning models. "As AI continues to evolve, ensuring responsible deployment and continuous refinement of security measures will be paramount," he said.
While CoT-enabled reasoning models inherently possess strong safety awareness, generating responses that closely align with user queries while maintaining transparency in their thought process, can be a dangerous tool in the wrong hands, the study said.
Co-author Sana Belguith from Bristol’s School of Computer Science explained that the transparency of CoT models, such as DeepSeek’s reasoning process that imitates human thinking, makes them very suitable for wide public use, but it can generate extremely harmful content when the model's safety measures bypassed. Combined with wide public use, it can lead to severe safety risks, she added.
Large Language Models (LLMs) are trained using large amounts of data. While efforts are made to remove harmful content, some harmful content can still remain due to technological and resource limitations. LLMs can also recreate harmful information from partial data. To ensure safety, techniques like reinforcement learning from human feedback (RLHF) and supervised fine-tuning (SFT) are used. However, fine-tuning attacks can sometimes bypass or override these safety measures, making it possible for harmful content to be generated.
In the new research, the team compared CoT-enabled models with traditional LLMs to discover that the former, when exposed to the same attacks, not only generates harmful content at a higher rate but also provides more complete, accurate, and potentially dangerous responses due to their structured reasoning process.
The study highlights that with minimal data, CoT reasoning models can be fine-tuned to exhibit highly dangerous behaviours. Fine-tuned CoT reasoning models can also assign themselves specific roles, such as a highly skilled cybersecurity professional, when processing harmful requests, allowing them to generate highly sophisticated but dangerous responses.
The danger of fine-tuning attacks on large language models is that they can be performed on relatively cheap hardware that is well within the means of an individual user for a small cost, and using small publicly available datasets to fine-tune the model within a few hours, co-author Joe Gardiner said.
“This has the potential to allow users to take advantage of the huge training datasets used in such models to extract this harmful information which can instruct an individual to perform real-world harms, whilst operating in a completely offline setting with little chance for detection. Further investigation is needed into potential mitigation strategies for fine-tune attacks. This includes examining the impact of model alignment techniques, model size, architecture, and output entropy on the success rate of such attacks," Gardiner added.
"The reasoning process of these models is not entirely immune to human intervention, raising the question of whether future research could explore attacks targeting the model's thought process itself," Sana Belguith said, adding that while LLMs are generally useful, the public needs to be aware of such safety risks.