The Evaluation of Effects of Oversampling and Word Embedding on Sentiment Analysis
Main Article Content
Abstract
Generally, opinion datasets for sentiment analysis are in an unbalanced condition. Unbalanced data tends to have a bias in favor of classification in the majority class. Data balancing by adding synthetic data to the minority class requires an oversampling strategy. This research aims to overcome this imbalance by combining oversampling and word embedding (Word2Vec or FastText). We convert the opinion dataset into a sentence vector, and then an oversampling method is applied here. We use 5 (five) datasets from comments on YouTube videos with several differences in terms, number of records, and imbalance conditions. We observed increased sentiment analysis accuracy with combining Word2Vec or FastText with 3 (three) oversampling methods: SMOTE, Borderline SMOTE, or ADASYN. Random Forest is used as machine learning in the classification model, and Confusion Matrix is used for validation. Model performance measurement uses accuracy and F-measure. After testing with five datasets, the performance of the Word2Vec method is almost equal to FastText. Meanwhile, the best oversampling method is Borderline SMOTE. Combining Word2Vec or FastText with Borderline SMOTE could be the best choice because of its accuracy score and F-measure reaching 91.0% - 91.3%. It is hoped that the sentiment analysis model using Word2Vec or FastText with Borderline SMOTE can become a high-performance alternative model.
Downloads
Article Details

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work