Pennsylvania [US]: Concerns about plagiarism are raised when language models, presumably including ChatGPT, paraphrase and reuse concepts from training data without citing the original source. Before finishing their next assignment with a chatbot, students might want to give it some thought.
According to a research team led by Penn University that undertook the first study to specifically look at the topic, language models that generate text in response to user prompts plagiarise content in more ways than one. "Plagiarism comes in different flavours," said Dongwon Lee, professor of information sciences and technology at Penn State. "We wanted to see if language models not only copy and paste but resort to more sophisticated forms of plagiarism without realizing it."
The researchers focused on identifying three forms of plagiarism: verbatim, or directly copying and pasting content; paraphrasing, or rewording and restructuring content without citing the original source; and idea, or using the main idea from a text without proper attribution. They constructed a pipeline for automated plagiarism detection and tested it against OpenAI's GPT-2 because the language model's training data is available online, allowing the researchers to compare generated texts to the 8 million documents used to pre-train GPT-2.
The scientists used 210,000 generated texts to test for plagiarism in pre-trained language models and fine-tuned language models, or models trained further to focus on specific topic areas. In this case, the team fine-tuned three language models to focus on scientific documents, scholarly articles related to COVID-19, and patent claims. They used an open-source search engine to retrieve the top 10 training documents most similar to each generated text and modified an existing text alignment algorithm to better detect instances of verbatim, paraphrase and idea plagiarism.
The team found that the language models committed all three types of plagiarism and that the larger the dataset and parameters used to train the model, the more often plagiarism occurred. They also noted that fine-tuned language models reduced verbatim plagiarism but increased instances of paraphrasing and idea plagiarism. In addition, they identified instances of the language model exposing individuals' private information through all three forms of plagiarism. The researchers will present their findings at the 2023 ACM Web Conference, which takes place from April 30-May 4 in Austin, Texas.