Volltext-Downloads (blau) und Frontdoor-Views (grau)

Linguistic Driven Feature Selection for Text Classification as Stop Word Replacement

  • The common corpus optimization method “stop words removal” is based on the assumption that text tokens with high occurrence frequency can be removed without affecting classification performance. Linguistic information regarding sentence structure is ignored as well as preferences of the classification technology. We propose the Weighted Unimportant Part-of-Speech Model (WUP-Model) for token removal in the pre-processing of text corpora. The weighted relevance of a token is determined using classification relevance and classification performance impact. The WUP-Model uses linguistic information (part of speech) as grouping criteria. Analogous to stop word removal, we provide a set of irrelevant part of speech (WUP-Instance) for word removal. In a proof-of-concept we created WUP-Instances for several classification algorithms. The evaluation showed significant advantages compared to classic stop word removal. The tree-based classifier increased runtime by 65% and 25% in performance. The performance of the other classifiers decreased between 0.2% and 2.4%, their runtime improved between −4.4% and −24.7%. These results prove beneficial effects of the proposed WUP-Model.

Export metadata

Additional Services

Search Google Scholar

Statistics

frontdoor_oas
Metadaten
Author:Daniel SchönleORCiD, Christoph ReichORCiDGND, Djaffar Ould-Abdeslam
URN:https://urn:nbn:de:bsz:fn1-opus4-98259
DOI:https://doi.org/10.12720/jait.14.4.796-802
ISSN:1798-2340
Parent Title (English):Journal of Advances in Information Technology
Document Type:Article (peer-reviewed)
Language:English
Year of Completion:2023
Release Date:2023/08/22
Volume:14.2023
Issue:4
First Page:796
Last Page:802
Open-Access-Status: Open Access 
 Gold 
Licence (German):License LogoCreative Commons - CC BY-NC-ND - Namensnennung - Nicht kommerziell - Keine Bearbeitungen 4.0 International