Prompt-Based Data Augmentation with Large Language Models for Indonesian Gender-Based Hate Speech Detection
- 1 Department of Computer Science, School of Computer Science, Bina Nusantara University, Jakarta, Indonesia
- 2 Department of Mathematics, School of Computer Science, Bina Nusantara University, Jakarta, Indonesia
Abstract
The increasing amount of content on social media content makes the use of automatic moderation crucial for preserving a healthy online community and reducing the spread of offensive and abusive content, such as hate speech based on gender. Developing automated social media moderation using machine learning demands a large and balanced dataset. However, difficulties such as data scarcity and class imbalance have hindered the development of gender-based hate speech detection on Indonesian Twitter communities. Creating and annotating a new dataset would be time-consuming and costly. One practical alternative is to use data augmentation methods to help address the minority class imbalance in datasets. This study investigates how prompt-based data augmentation may be used with a large language model to provide organic tweet samples for gender-based hate speech detection. Furthermore, the study investigates the preservation of labels in augmented Twitter samples. In comparison to the benchmark back translation approach, the results show that prompt-based data augmentation using a large language model may generate new and organic Twitter samples while keeping labels preserved and avoiding memorization. In conventional machine learning models, prompt-based data augmentation with a large language model shows competitive performance compared to back translation in terms of accuracy metrics. According to these results, using prompting for data augmentation on large language models is an alternative strategy that can provide new, less memorization tweet samples that maintain label integrity while achieving competitive accuracy results.
DOI: https://doi.org/10.3844/jcssp.2024.819.826
Copyright: © 2024 Muhammad Amien Ibrahim, Faisal, Zefanya Delvin Sulistiya and Tora Sangputra Yopie Winarto. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
- 878 Views
- 291 Downloads
- 0 Citations
Download
Keywords
- Hate Speech Detection
- Data Augmentation
- Large Language Models