Small Corpus Sarcasm Detection
Overview
Sarcasm detection poses a unique challenge in sentiment analysis, as even humans sometimes struggle to recognize sarcastic tones. This project aims to address that difficulty by constructing a specialized dataset tailored for sarcasm detection. By compiling data from multiple existing English corpora, including a corpus explicitly labeled for sarcasm, we seek to enhance model performance in recognizing sarcastic expressions.
Once the dataset is assembled, we will compare logistic regression's ability to classify sarcasm both with and without this dataset, evaluating the impact of high-quality training data on model accuracy.
Novelty of the Project
This project introduces several novel contributions:
- A newly created sarcasm detection dataset sourced from various internet platforms, including blogs, Reddit, and other popular websites. This ensures the dataset remains timely and reflective of contemporary sarcastic expressions.
- The dataset includes sarcastic phrases commonly found in social media and online forums, making it a valuable resource for training machine-learning models.
- A comparative analysis of logistic regression performance with and without the new dataset demonstrates how targeted data collection improves sarcasm detection in sentiment analysis.
Motivation
Sarcasm is notoriously difficult to detect, as it often depends on contextual cues that are not explicitly stated in text. Our project seeks to improve sarcasm detection by investigating which types of text and features enhance sentiment analysis models. By leveraging internet-based corpora, we aim to identify patterns that improve sarcasm classification.
Challenges
- Limited Availability of Up-to-Date Corpora: Many large-scale corpora (e.g., Brown Corpus) are outdated and do not reflect modern sarcasm usage. Supplementing our dataset with web-scraped data was necessary to ensure relevance.
- Choosing an Optimal Model for Sarcasm Detection: While some models perform well in sentiment analysis, it is unclear whether they effectively detect sarcasm. We evaluate whether a traditionally weaker model in sentiment analysis, such as logistic regression, improves significantly with better training data.
- Feature Selection: Sarcasm is not always linked to specific syntactic structures, vocabulary, or other explicit linguistic markers. Context plays a crucial role, raising the question of how best to incorporate contextual features into our models.
Results
| Model | Accuracy | Precision | Recall | F1 Score |
|---|---|---|---|---|
| Logistic Regression | 63.16% | 61.06% | 65.80% | 63.34% |
| Logistic Regression + BERTVectorizer | 62.41% | 61.38% | 60.10% | 60.73% |
| Pretrained BERT Model | 56.64% | 62.72% | 32.87% | 43.13% |
Without the BERTVectorizer, logistic regression achieved an accuracy of 63.16%, a precision of 61.06%, a recall of 65.80%, and an F1 score of 63.34%. Surprisingly, adding BERT embeddings slightly reduced performance across all metrics, suggesting redundancy or interference with existing feature representations.
The pretrained BERT model performed significantly worse, with an accuracy of 56.64% and an F1 score of 43.13%. This result suggests that general-purpose pretrained models may struggle with sarcasm detection unless fine-tuned on domain-specific data. The low recall (32.87%) indicates that BERT frequently failed to detect sarcastic expressions, likely due to the model's lack of conversational and contextual understanding.
Our dataset was balanced (50% sarcastic, 50% non-sarcastic), ensuring that our results were not skewed by class imbalances. However, the challenges encountered highlight the need for dataset-specific adaptations in sarcasm detection tasks.
Error Analysis
An in-depth error analysis on the test set reveals why certain models struggled. The pretrained BERT model exhibited a high false negative rate, often failing to detect sarcasm when external knowledge or tone was required. It also produced false positives when exaggerated language or hyperbole led to misclassification.
Similarly, the feature-union-based logistic regression model did not significantly reduce these errors. While traditional TF-IDF and POS-based features contributed valuable signals, their integration with BERT embeddings may have diluted their effectiveness, leading to minor performance degradation.
One interesting finding is that certain words or phrases in the dataset, such as "Actually", appeared frequently in both sarcastic and non-sarcastic contexts. Although 24 sarcastic sentences beginning with "Actually," 27 non-sarcastic ones began the same way. This overlap may have confused the models, highlighting the difficulty of sarcasm detection without deeper contextual cues.
Future Improvements
To enhance sarcasm detection, future research should focus on:
- Incorporating contextual embeddings to better capture the meaning behind sarcastic statements.
- Exploring handcrafted features designed specifically for sarcasm detection.
- Expanding the dataset with more diverse examples and longer conversational contexts to improve model generalization.
Code with Description
Grid Search for Model Optimization
grid_searcher performs hyperparameter tuning for a sarcasm detection model. The function accepts training and testing data, as well as an output CSV filename. It first initializes a SarcasmLogisticRegression model, assuming this class contains a scikit-learn compatible model. The logistic regression model is then fine-tuned using the GridSearchCV method from scikit-learn. The param_grid_logisticRegression defines a range of hyperparameters to explore, such as the regularization strength (C), penalty types (l1, l2, etc.), solvers (liblinear, saga, etc.), maximum iterations, class weights, and the elastic net mixing ratio (l1_ratio). The grid search uses cross-validation (cv=5) and optimizes for the F1 score (scoring='f1'). Once the grid search is completed, it extracts the best parameters and model, saving the grid search results to a CSV file for further analysis. Finally, it evaluates the best model's performance on the test set, printing accuracy and F1 score.
Fetching Data from Reddit
reddit_scrapper.py is responsible for fetching data from Reddit using the PRAW (Python Reddit API Wrapper) library. After setting up Reddit API credentials, the script fetches a batch of posts from Reddit's hot front page, specifying a limit of 50 posts. For each post, it retrieves all comments by replacing the "more comments" placeholders and extracts the comment text. It then creates a dictionary containing the selftext of the post and the list of comments. This data is converted into a pandas DataFrame, which is subsequently saved as a CSV file (reddit_all_subreddits_posts.csv) for later use.
Sarcasm Detection with Transformers
The SarcasmDetector class is defined to predict sarcasm using a pre-trained BERT model. The SarcasmDetector is initialized with a specific pre-trained BERT model (jkhan447/sarcasm-detection-Bert-base-uncased-newdata) for sarcasm detection. The class includes a method predict_sarcasm that takes input text and predicts whether the text is sarcastic (label 1) or not (label 0). It uses the tokenizer and model from the Hugging Face Transformers library. Additionally, the method computes the top-k tokens from the BERT model's hidden states that contribute most to the classification decision. These tokens and their associated coefficients are returned along with the predicted label. Another method, process_csv, allows processing a CSV file containing text, predicting sarcasm for each row, and saving the results, including prediction labels, top-k tokens, and coefficients, to a new CSV file. It also evaluates the model's performance using common classification metrics such as accuracy, precision, recall, and F1 score, if ground truth labels are available in the input file. Lastly, the get_predictions method is defined to retrieve only the predictions from the CSV file.
Text Preprocessing and Exploratory Analysis
sentiment_analysis.py focuses on text preprocessing and exploratory data analysis (EDA). It uses spaCy and NLTK to clean and tokenize text data. The preprocess_text function takes raw text, converts it to lowercase, removes punctuation, and lemmatizes the tokens while filtering out stopwords. It then loads two datasets, reddit_comments.csv and reddit_posts.csv, and applies the preprocessing function to clean the text in both datasets. The cleaned data is saved back into CSV files (reddit_comments_cleaned.csv and reddit_posts_cleaned.csv). The script then performs some basic EDA, such as identifying the most common words in the combined datasets. It uses the Counter class to count word frequencies and plots a bar chart of the top 20 most frequent words.
Logistic Regression Classification with BERT-based feature vector
sarcasm_logistic_regression.py prepares the text data for training a BERT-based classifier and Logistic Regression classifier using the Hugging Face Transformers library. It loads a pre-trained BERT tokenizer and tokenizes the cleaned text from the combined dataset. The dataset is split into training, validation, and test sets using train_test_split. Each set is converted into a format suitable for BERT by extracting the input_ids and attention_mask from the tokenized data and converting them into tensor format. A custom SentimentDataset class is defined to handle the data in the format required by the PyTorch model. The model itself is a BERT model for sequence classification (BertForSequenceClassification), initialized with 3 output labels. Training parameters such as the number of epochs, batch size, and learning rate are set using the TrainingArguments class. A Trainer is instantiated to manage the training process. The model is then trained and evaluated using the validation and test datasets, and evaluation results are printed for metrics such as accuracy and loss.
Contributors
- Julian Rambob
- Jennifer Haliewicz
This research underscores the difficulty of sarcasm detection and the limitations of standard NLP models in handling nuanced expressions. While our dataset improved logistic regression performance, the results highlight the need for advanced contextual learning methods to fully capture sarcasm's complexities.
