Abstract:
This thesis investigates the complex task of gender detection in text analysis, focusing on identifying an author's gender through linguistic and stylistic analysis. The study emphasizes the role of gender detection in enhancing the precision and relevance of information processing systems, which is pivotal for more personalized content strategies and combating gender biases in various sectors such as social media, and AI-driven analytics. The research conducts an exhaustive evaluation of diverse methodologies, encompassing a range of preprocessing techniques and feature selection strategies, and assesses the effectiveness of both traditional and advanced language models like BERT, particularly in analyzing tweets. Our study's key findings show that username-based data splitting in social media, as opposed to random splitting, enhances model performance and generalization, and prevents data leakage. Integrating word and character N-Grams, along with combining linguistic and textual features, proved highly effective. BERT emerged as a superior performer among large language models, though it did not outperform traditional models. This work not only advances the understanding of gender detection but also contributes significantly to the development of more sophisticated and equitable text analysis tools in the field of computational linguistics.