Enhancing Authorship Attribution through LLM-Based Textual Mimicry

This project investigates the ability of large language models (LLMs) to generate textual forgeries capable of deceiving a robust authorship classifier. By employing reinforcement learning techniques, specifically Proximal Policy Optimization (PPO), we iteratively enhance the LLM’s ability to mimic an author’s writing style while concurrently testing and analyzing the classifier’s resilience. This study uses texts from Jane Austen’s novels and synthetic samples generated by Llama3-8b as part of a larger effort to explore the boundaries of LLM-generated text for authorship attribution challenges.

Goals

The primary objective of this experiment is to enhance an LLM's capability to generate deceptive textual forgeries that convincingly mimic the writing style of a given author. Throughout the process, we aim to evaluate the classifier’s performance across multiple fine-tuning iterations to determine whether its accuracy declines as the LLM’s mimicry capabilities improve. By blending LLM-based text generation with reinforcement learning, this project provides deeper insights into both the vulnerabilities of authorship attribution models and the potential of LLMs to refine their stylistic imitation through systematic feedback loops.

Phase 1: Training and Evaluation

Corpus Preparation

To establish a comprehensive dataset, the project begins by collecting the full texts of six canonical Jane Austen novels: Emma, Mansfield Park, Northanger Abbey, Persuasion, Sense and Sensibility, and Pride and Prejudice. The texts are meticulously segmented into two-sentence passages, each serving as a prompt for the LLM. This segmentation provides manageable text samples that are easy to handle and analyze. Alongside each passage, we preserve vital metadata such as the author’s name, title of the work, and the location of the passage within the novel. This organization ensures that all generated outputs can be effectively tracked and evaluated.

Data Splitting

For training and evaluation purposes, the corpus is divided into two sets: four novels are used for training, and two novels are reserved for evaluation. The Llama3-8b model is prompted with these text samples to generate stylistic continuations that attempt to mimic Austen’s distinct writing style. The generated texts are labeled as "mimic," while the original Austen texts are marked as "author." This labeled dataset is crucial for training the classifier to distinguish between authentic and generated text.

Classifier Training

A logistic regression classifier is employed to differentiate between authentic Austen passages and LLM-generated forgeries. The classifier’s performance is first evaluated on a held-out portion of the training data, where prediction probabilities are calculated for both authentic and mimic texts. This initial baseline performance allows us to understand the classifier’s ability to identify genuine texts and serves as a reference point for evaluating the classifier’s robustness throughout subsequent phases of fine-tuning.

Phase 2: PPO Reinforcement Learning (Repeatable)

Classifier-Guided Fine-Tuning

To enhance the LLM’s ability to deceive the classifier, we introduce reinforcement learning. The classifier’s prediction probabilities are used to rank generated texts: mimic texts that are more likely to be classified as authentic are treated as preferred responses, while those with lower probabilities are labeled as non-preferred. This feedback is then used to fine-tune the Llama3-8b model through Direct Preference Optimization (DPO) and Proximal Policy Optimization using Hugging Face’s reinforcement learning framework. The fine-tuned model is expected to generate increasingly deceptive texts with each iteration, mimicking the target author more effectively.

Iterative Experimentation

After each fine-tuning round, the updated model generates new mimic texts for both the training and evaluation datasets. These newly generated texts are reintroduced into the dataset, and the classifier is retrained to assess whether its performance has degraded. We track the classifier’s accuracy across iterations, allowing us to monitor whether the LLM’s improving mimicry results in a decline in classifier precision, ultimately examining the adversarial interaction between the generator (LLM) and discriminator (classifier).

Phase 3: Iterative Classifier Improvement

In addition to testing the classifier's ability to detect LLM-generated forgeries, this phase explores methods for improving classifier robustness over time. We iteratively retrain the classifier using the best deceptive samples from previous iterations, enhancing its capacity to discern between real and generated texts. This iterative approach creates an adversarial loop, similar to a Generative Adversarial Network (GAN), where the LLM continuously refines its ability to imitate the target author while the classifier adapts to become more adept at distinguishing authentic texts from forgeries.

Results

Pending

Code

Link to GitHub Repository

This project integrates cutting-edge linguistic feature engineering with modern large language models and reinforcement learning techniques to tackle the complex problem of authorship attribution. Inspired by the real-world challenge of identifying the authors of *"Familiarity is the Kingdom of the Lost"* by Dugmore Boetie and Barney Simon, the authorship attribution pipeline combines traditional classification techniques with advanced generative modeling and feature interpretation. At the core of the system is a dynamic model loader that facilitates integration with a variety of LLMs hosted on Hugging Face, including LLaMA v2/v3, Google’s Gemma, GPT-2, and DeepSeek. This loader simplifies the model deployment process, handling tokenization and managing a seamless text generation pipeline for downstream tasks.

For authorship classification, the project uses a custom PipelineWrapper built on scikit-learn’s Pipeline and LogisticRegression. The wrapper encodes input text samples into linguistic and statistical features using a dedicated FeatureEncoder, which extracts metrics such as punctuation frequency, lexical richness, and syntactic structures. The classifier outputs authorship predictions alongside the top contributing features for each decision, enabling transparent interpretation. This interpretability allows the pipeline to trace back to which linguistic cues influenced the classification.

In addition to predictive modeling, the project introduces a fine-tuning module that incorporates reinforcement learning with the classifier's feedback to further refine the LLM’s ability to mimic an author’s writing style. Using PPO (Proximal Policy Optimization) through the trl library, the model is trained to produce increasingly authentic writing samples, guided by feedback from the classifier. This creates a feedback loop where the classifier’s predictions directly inform the LLM’s training, enhancing its ability to generate text that mirrors the target author’s voice.

The pipeline is built using a modern stack of technologies, including PyTorch, Hugging Face Transformers, scikit-learn, Pandas, and the trl RLHF framework, supporting efficient GPU acceleration via Hugging Face’s accelerate library.

Future Directions

This research demonstrates the power of reinforcement learning to improve the mimicry capabilities of LLMs, highlighting potential vulnerabilities in current authorship attribution models. Future work will explore the incorporation of more sophisticated transformer-based classifiers, refine feature selection for improved accuracy, and expand the methodology to other domains of text generation, such as academic writing or poetry. The insights gained from this work could be valuable for advancing authorship detection, improving LLM customization, and expanding the applications of reinforcement learning in text generation and analysis.