The Evaluation of Effects of Oversampling and Word Embedding on Sentiment Analysis

Nur Heri Cahyana; Yuli Fauziah; Wisnalmawati Wisnalmawati; Agus Sasmito Aribowo; Shoffan Saifullah

doi:10.20895/infotel.v17i1.1077

view PDF

Published Apr 17, 2025

DOI https://doi.org/10.20895/infotel.v17i1.1077

Nur Heri Cahyana

Univesitas Pembangunan Nasional â€œVeteranâ€, Indonesia

Yuli Fauziah

Univesitas Pembangunan Nasional â€œVeteranâ€, Indonesia

Wisnalmawati Wisnalmawati

Univesitas Pembangunan Nasional â€œVeteranâ€, Indonesia

Agus Sasmito Aribowo

Universitas Pembangunan Nasional "Veteran", Indonesia

Shoffan Saifullah

AGH University of Krakow, Poland

Abstract

Generally, opinion datasets for sentiment analysis are in an unbalanced condition. Unbalanced data tends to have a bias in favor of classification in the majority class. Data balancing by adding synthetic data to the minority class requires an oversampling strategy. This research aims to overcome this imbalance by combining oversampling and word embedding (Word2Vec or FastText). We convert the opinion dataset into a sentence vector, and then an oversampling method is applied here. We use 5 (five) datasets from comments on YouTube videos with several differences in terms, number of records, and imbalance conditions. We observed increased sentiment analysis accuracy with combining Word2Vec or FastText with 3 (three) oversampling methods: SMOTE, Borderline SMOTE, or ADASYN. Random Forest is used as machine learning in the classification model, and Confusion Matrix is used for validation. Model performance measurement uses accuracy and F-measure. After testing with five datasets, the performance of the Word2Vec method is almost equal to FastText. Meanwhile, the best oversampling method is Borderline SMOTE. Combining Word2Vec or FastText with Borderline SMOTE could be the best choice because of its accuracy score and F-measure reaching 91.0% - 91.3%. It is hoped that the sentiment analysis model using Word2Vec or FastText with Borderline SMOTE can become a high-performance alternative model.

Downloads

Download data is not yet available.

How to Cite

[1]

N. Cahyana, Y. Fauziah, W. Wisnalmawati, A. Aribowo, and S. Saifullah, “The Evaluation of Effects of Oversampling and Word Embedding on Sentiment Analysis”, INFOTEL, vol. 17, no. 1, pp. 54-67, Apr. 2025.

Issue

Vol 17 No 1 (2025): February 2025

Section

Informatics

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Authors who publish with this journal agree to the following terms:

Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work

Article Sidebar

Main Article Content

Abstract

Downloads

Article Details