July 25, 2024
Latest News

Researchers discover that even after sensitive data is “deleted,” LLMs like ChatGPT continue to produce it.

A triumvirate of scholars hailing from the University of North Carolina at Chapel Hill recently unveiled their preprint on artificial intelligence (AI) research. In this work, they underscore the formidable challenge associated with purging sensitive data from extensive language models such as OpenAI’s ChatGPT and Google’s Bard. As elucidated in the researchers’ paper, the endeavor of “deleting” information from these large language models proves to be possible. However, it is equally as arduous to ascertain whether the information has been successfully expunged as it is to carry out the actual removal process. This quandary is rooted in the design and training methodology of these language models. They undergo a pretraining phase on comprehensive databases, followed by fine-tuning to ensure coherent output generation (GPT, in this context, stands for “generative pretrained transformer”). Once a model completes its training, its creators lack the capability to, for instance, access the database retroactively and selectively eliminate specific files to prevent related output. Essentially, all the information assimilated during a model’s training persists within its intricate weights and parameters, rendering them inscrutable without generating outputs. This is the enigmatic “black box” of AI. A vexing issue arises when these models, trained on massive datasets, inadvertently generate sensitive information, such as personally identifiable data or financial records, which can be potentially detrimental and undesirable. In a hypothetical scenario where an extensive language model was trained on sensitive banking information, traditional means of identifying and removing this data are typically futile. Instead, AI developers resort to protective measures, such as hardcoded prompts that curb particular behaviors, or they implement reinforcement learning through human feedback (RLHF). Under the RLHF paradigm, human assessors engage with these models with the objective of eliciting both desired and undesired behaviors. When the model produces favorable outputs, it receives feedback to fine-tune its behavior further. Conversely, when it manifests undesired behaviors, corrective feedback is provided to mitigate such tendencies in future outputs. Notably, despite the data being “deleted” from a model’s weights, it is still feasible to elicit specific information, albeit through rephrased prompts. Image source: Patil, et al., 2023 However, as underscored by the researchers from the University of North Carolina, this approach relies on humans identifying all potential shortcomings exhibited by a model. Even when it proves effective, it does not truly “delete” the information from the model’s cognitive repository. A more profound limitation of RLHF lies in the possibility that a model may retain knowledge of sensitive information. While the nature of what a model genuinely “knows” is a subject of debate, the prospect of a model being capable, for instance, of describing how to create a bioweapon while abstaining from answering queries related to this topic presents ethical concerns. In their conclusive findings, the researchers from the University of North Carolina contend that even cutting-edge model-editing techniques, such as Rank-One Model Editing, fall short of entirely erasing factual information from large language models. Factual data can still be extracted, accounting for 38% of the time through whitebox attacks and 29% of the time through blackbox attacks. The model employed by the research team in their study is known as GPT-J. While GPT-3.5, one of the foundational models powering ChatGPT, boasts a formidable 170 billion parameters, GPT-J operates with a more modest 6 billion. In essence, this means that the task of identifying and eliminating unwanted data in a language model as extensive as GPT-3.5 is exponentially more challenging than doing so in a smaller model. The researchers have managed to devise innovative defense mechanisms to shield language models from “extraction attacks” – deliberate efforts by malicious actors to manipulate prompts in order to extract sensitive information from the model’s output. However, as the researchers emphasize, the issue of expunging sensitive information may be an ongoing battle, with defense strategies perpetually racing to keep pace with emerging attack methodologies.