OPUS 4 | Search

2 search hits

1 to 2

Sort by

Year
Year
Title
Title
Author
Author

Linguistic Driven Feature Selection for Text Classification as Stop Word Replacement (2023)

Schönle, Daniel ; Reich, Christoph ; Ould-Abdeslam, Djaffar

The common corpus optimization method “stop words removal” is based on the assumption that text tokens with high occurrence frequency can be removed without affecting classification performance. Linguistic information regarding sentence structure is ignored as well as preferences of the classification technology. We propose the Weighted Unimportant Part-of-Speech Model (WUP-Model) for token removal in the pre-processing of text corpora. The weighted relevance of a token is determined using classification relevance and classification performance impact. The WUP-Model uses linguistic information (part of speech) as grouping criteria. Analogous to stop word removal, we provide a set of irrelevant part of speech (WUP-Instance) for word removal. In a proof-of-concept we created WUP-Instances for several classification algorithms. The evaluation showed significant advantages compared to classic stop word removal. The tree-based classifier increased runtime by 65% and 25% in performance. The performance of the other classifiers decreased between 0.2% and 2.4%, their runtime improved between −4.4% and −24.7%. These results prove beneficial effects of the proposed WUP-Model.

Comparable Machine Learning Efficiency : Balanced Metrics for Natural Language Processing (2023)

Schönle, Daniel ; Reich, Christoph ; Ould-Abdeslam, Djaffar

As machine learning becomes increasingly pervasive, its resource demands and financial implications escalate, necessitating energy and cost optimisations to meet stakeholder demands. Quality metrics for predictive machine learning models are abundant, but efficiency metrics remain rare. We propose a framework for efficiency metrics, that enables the comparison of distinct efficiency types. A quality-focused efficiency metric is introduced that considers resource consumption, computational effort, and runtime in addition to prediction quality. The metric has been successfully tested for usability, plausibility, and compensation for dataset size and host performance. This framework enables informed decisions to be made about the use and design of machine learning in an environmentally responsible and cost-effective manner.

1 to 2

Author(s)
Title
Additional Person(s)
Referee(s)
Abstract
Fulltext

Open Access

Refine

Author

Year of publication

Document type

Language

Has full text

Is part of the Bibliography

Keywords

2 search hits