An optimal approach for text feature selection

Loading...
Thumbnail Image

Date

Journal Title

Journal ISSN

Volume Title

Publisher

Academic Press

Abstract

Traditionally, feature selection is conducted by first deriving a candidate list of features, then ranking and selecting the top features based on predefined threshold. These methods are highly dependent on the choice of the threshold, and therefore lead to sub-optimal text categorization results. In this paper, we address the selection problem by suggesting a one-step method designed to optimally select the subset of features. The selection is formulated mathematically as an optimization problem with the objective of maximizing classification accuracy while simultaneously deriving and choosing the most discriminative features. Our method, MFX, is applicable to many of the conventional methods, with two distinguishing aspects. First, it is based on considering all documents from the same category as one extended document, instead of analyzing individual documents. Second, it considers choosing the most discriminative terms that are frequent and common across all documents of the same category, and minimally present in other categories. Moreover, MFX is language-independent. It was tested on the well-known benchmark Reuters RCV1 dataset. To showcase its language independence, MFX was also tested on Arabic datasets extracted from Arabic news sources. The results indicated that MFX always performed similar to or better than other well-known feature selection methods. MFX with a Support Vector Machine (SVM) classifier was also shown to outperform recent text classification algorithms based on neural networks and word embeddings. © 2022 Elsevier Ltd

Description

Keywords

Arabic text mining, Data mining, Feature selection, Text categorization, Text mining, Classification (of information), Support vector machines, Text processing, Arabic texts, Candidate list, Feature-based, Features selection, Optimal approaches, Selection problems, Text feature selections, Text-mining, Feature extraction

Citation

Endorsement

Review

Supplemented By

Referenced By