Understanding AIGC and Its Impact on Academic Integrity

Graduation Season

As graduation season approaches, are you still struggling to write your thesis?

However, writing the thesis is just the first step; it must also pass plagiarism checks. Various methods to reduce similarity rates have been employed, such as translation, synonym replacement, and rearranging sentence structures.

After successfully lowering the similarity rate, do you think you’re done? Not quite! Some schools have introduced AIGC detection to prevent AI from being used to write theses.

What is AIGC?

You may not have heard of AIGC, but you have certainly used it. AIGC stands for “Artificial Intelligence Generated Content,” which refers to the use of AI technology to generate various forms of content, including text, music, images, and videos. Thus, when we use tools like ChatGPT or Deepseek to generate text, we are utilizing AIGC technology.

AIGC is considered a new form of content production that uses AI technology to automatically generate content, following “Professionally Generated Content (PGC)” and “User Generated Content (UGC).” Its emergence marks a new era in AI development. AIGC mainly consists of three key components: data, hardware, and algorithms. High-quality data, such as audio, text, and images, is the cornerstone for training algorithms. The size of the dataset directly affects the accuracy of the trained model; typically, the larger the sample size, the more precise the model. This requires hardware systems capable of processing massive amounts of data and complex algorithms with millions of parameters. High-performance chips and cloud computing platforms are integrated to provide the necessary computational power.

The performance of algorithms directly determines the quality of generated content. The widespread application of AIGC today is due to advancements in machine learning, deep learning, and Generative Adversarial Networks (GANs). Here are some of the main algorithms used in AIGC:

Based on Generative Adversarial Networks (GAN)

GAN technology enables AI to generate realistic images, audio, and text. A GAN consists of two “competing” neural networks—the generator and the discriminator. The generator creates content by taking a set of random noise vectors and outputting data that resembles real data distributions. The discriminator evaluates the authenticity of the generated data, trying to distinguish between real and generated data. The training process between the generator and discriminator is a game where the generator continuously improves to create data that can deceive the discriminator, while the discriminator optimizes to enhance its distinguishing ability, resulting in increasingly realistic content from the generator.

Based on Autoencoders

An autoencoder is a type of neural network that uses backpropagation to ensure that the output equals the input. It consists of an encoder and a decoder. The encoder compresses input data into a lower-dimensional latent representation, while the decoder reconstructs the latent representation back to the original data, achieving data generation and reconstruction. Autoencoders primarily serve two purposes: data denoising and visualizing data dimensionality reduction.

Based on Transformers

Transformer models are widely used in natural language processing (NLP) tasks such as text generation and machine translation. In recent years, transformer architectures have also been applied to image generation and other multimodal tasks. Its core lies in the self-attention mechanism, which captures dependencies between features at different positions in the input sequence, rather than just local contexts. This allows transformers to excel at processing long sequence data. Transformers typically consist of an encoder and a decoder, where the encoder converts the input sequence into a hidden representation, and the decoder generates the output sequence based on the hidden information.

How is AIGC Detection Done?

Given the powerful capabilities of AIGC, one might wonder if writing papers with it would be effortless. To prevent such academic misconduct, many platforms have begun to offer AI-generated content detection features, and some universities have made AIGC detection results a requirement for thesis approval. How do computers determine whether a text is AI-generated or human-written?

First, it’s essential to understand that no AI detection method can guarantee 100% accuracy in identifying whether a piece of text was written by a machine or a human. Typically, an AIGC score is provided, indicating the likelihood that a segment of text was generated by AI.

Current AIGC Detection Algorithms

The current AIGC detection algorithms can be categorized into three main types:

Trained Classifiers (fine-tuned pre-trained models on human vs. machine text): This method is based on deep learning binary classification models and is the mainstream approach for AIGC detection. It involves collecting a large amount of AI-generated text and human-written text, feeding them into the same model for training. Through continuous optimization and iteration, a classifier is obtained. By inputting a segment of text into the classifier, it can output the probability that the text was generated by AI. Since the detector does not know which AI model was used for generation, this is considered a black-box detection method with performance limited by the coverage of the training data. If the training data encompasses multiple models and domains, the detection accuracy and generalization will be stronger; otherwise, data bias may lead to missed detections or false positives.
Zero-Shot Detectors (self-detection using inherent properties of large language models): As the name suggests, zero-shot detection does not require a large dataset to train the discriminator but instead utilizes the inherent differences between AI-generated text and human-written text, allowing the detector to classify without training. Its advantage lies in not needing additional data collection and model adjustments, significantly enhancing the model’s adaptability to new data distributions. AI-generated text often exhibits statistical differences in language style, sentence complexity, and repetition rates. AIGC detection leverages these distinguishing features. AI-generated text tends to have structured sentences but lacks flexibility, high local repetition rates, and low information entropy, often using templated expressions like “In summary,” “Based on the above analysis,” etc.
Watermarking Techniques (embedding traceable identifiers in generated text): We are familiar with watermarking images, but text can also be watermarked. The watermark here is not human-readable; it is a statistical pattern. For example, the frequency distribution of a word’s occurrence in the text can serve as a watermark. However, in practice, watermark algorithm design is more complex. One key challenge is embedding the watermark without distorting the original text’s meaning or readability. Traditional methods, such as synonym replacement, syntactic tree manipulation, and paragraph restructuring, often struggle to maintain semantic integrity while modifying the text. The advent of large language models (LLMs) has changed this situation. Their core advantage lies in achieving a balance between semantic preservation and watermark embedding through deep learning. Depending on the target, watermarking can be categorized into two types: embedding watermarks into existing text and embedding watermarks into large models. Currently, text watermarking technology is widely used in copyright protection, maintaining academic integrity, and detecting fake news.

Is AIGC Detection Reliable?

As AI develops, workers in various fields have begun to use AI as a work assistant, and students using AI tools for thesis writing has become a pressing issue for universities. Consequently, many institutions have introduced AIGC assessment standards for graduation theses. Many well-known thesis detection agencies, such as CNKI, Wanfang, Weipu, and Turnitin, have launched AIGC detection features.

Is AIGC detection truly reliable? Some students have reported that their purely handwritten theses had an AI similarity rate as high as 60%, forcing them to alter logically coherent sentences into awkward ones to meet graduation requirements. Some even tested classic works like Zhu Ziqing’s “Moonlight Over the Lotus Pond” and Liu Cixin’s “The Wandering Earth,” finding that the AIGC detection indicated AI-generated likelihoods of 62.88% and 52.88%, respectively. Such results have led to widespread concerns among students about being misjudged as having AI-generated content. On social media, the topic of “ridiculously high AI rates in theses” has become a hot discussion.

As mentioned earlier, no AIGC detection can guarantee 100% accuracy in distinguishing machine-written from human-written text. If your thesis contains many standard expressions or your writing style closely resembles AI patterns, it may be falsely flagged. Conversely, if AI-generated text is cleverly polished, it may lead to missed detections. Here are some tips to reduce AI detection rates; please ensure compliance with the “Degree Law” to maintain the authenticity of data, charts, and text in your thesis.

Translation Method

Simply put, translate your written text into another language and then back to the original language. If the results are unsatisfactory, you can increase the number of intermediate translations. After several translation conversions, the AIGC detection rate can be significantly lowered.

Change Sentence Structures

AI-generated content often exhibits certain similar characteristics in sentence structure. You will notice that AI tends to use words like “regardless,” “along with,” “in addition,” “in summary,” and “at the same time,” and prefers a format of numbered points + title + colon + response, with similar lengths for each short sentence and paragraph. To reduce AI detection, avoid using common vocabulary and sentence patterns from AI models, merge unnecessary short sentences and paragraphs, or use inverted sentences, questions, or colloquial expressions to enhance variability.

Enrich Text Content

AI-written theses often appear reasonable but lack substantial content and concrete examples. Therefore, to lower AI detection rates, include valuable insights and examples to ensure the paper does not resemble AI-generated text.

Use AI to Reduce AI

AI understands the principles behind AI detection better than humans. Using magic to defeat magic is an interesting approach. As for the effectiveness of using AI to reduce AI detection, I have not tried it, but interested students can give it a shot.

Conclusion

The application of AI-generated content detection technology in thesis review remains a contentious issue. The original intention of this technology is to uphold academic integrity, but its practical effectiveness is often unsatisfactory. The fundamental goal of education is to cultivate students’ innovative thinking and problem-solving abilities, rather than overly focusing on the means of tool usage. In the rapidly evolving landscape of artificial intelligence, we need to establish mechanisms to prevent the misuse of technology while fundamentally reconstructing the educational evaluation system to create a diversified assessment standard that genuinely reflects students’ academic qualities.