The Evaluation of Effects of Oversampling and Word Embedding on Sentiment Analysis

Main Article Content

Nur Heri Cahyana
Yuli Fauziah
Wisnalmawati Wisnalmawati
Agus Sasmito Aribowo
Shoffan Saifullah

Abstract

Generally, opinion datasets for sentiment analysis are in an unbalanced condition. Unbalanced data tends to have a bias in favor of classification in the majority class. Data balancing by adding synthetic data to the minority class requires an oversampling strategy. This research aims to overcome this imbalance by combining oversampling and word embedding (Word2Vec or FastText). We convert the opinion dataset into a sentence vector, and then an oversampling method is applied here. We use 5 (five) datasets from comments on YouTube videos with several differences in terms, number of records, and imbalance conditions. We observed increased sentiment analysis accuracy with combining Word2Vec or FastText with 3 (three) oversampling methods: SMOTE, Borderline SMOTE, or ADASYN. Random Forest is used as machine learning in the classification model, and Confusion Matrix is used for validation. Model performance measurement uses accuracy and F-measure. After testing with five datasets, the performance of the Word2Vec method is almost equal to FastText. Meanwhile, the best oversampling method is Borderline SMOTE. Combining Word2Vec or FastText with Borderline SMOTE could be the best choice because of its accuracy score and F-measure reaching 91.0% - 91.3%. It is hoped that the sentiment analysis model using Word2Vec or FastText with Borderline SMOTE can become a high-performance alternative model.

Downloads

Download data is not yet available.

Article Details

How to Cite
[1]
N. Cahyana, Y. Fauziah, W. Wisnalmawati, A. Aribowo, and S. Saifullah, “The Evaluation of Effects of Oversampling and Word Embedding on Sentiment Analysis”, INFOTEL, vol. 17, no. 1, pp. 54-67, Apr. 2025.
Section
Informatics