A Random Oversampling and BERT-based Model Approach for Handling Imbalanced Data in Essay Answer Correction

Dian Ahkam Sani

doi:10.20895/infotel.v16i4.1224

View PDF

Published Dec 20, 2024

DOI https://doi.org/10.20895/infotel.v16i4.1224

Dian Ahkam Sani

Universitas Merdeka Pasuruan, Indonesia

Abstract

The task of automated essay scoring has long been plagued by the challenge of imbalanced datasets, where the distribution of scores or labels is skewed towards certain categories. This imbalance can lead to poor performance of machine learning models, as they tend to be biased towards the majority class. One potential solution to this problem is the use of oversampling techniques, which aim to balance the dataset by increasing the representation of the minority class. In this paper, we propose a novel approach that combines random oversampling with a BERT-base uncased model for essay answer correction. This research explores various scenario of text pre-processing techniques to optimize model accuracy. Using a dataset of essay answers obtained from eighth-grade middle school students in Indonesian language, our approach demonstrates good performance in terms of precision, recall, F1-score and accuracy compared to traditional methods such as Backpropagation Neural Network, NaÃ¯ve Bayes and Random Forest Classifier using FastText word embedding with Wikipedia 300 vector size pretrained model. The best performance was obtained using the BERT-base uncased model with 2e-5 learning rate and a simplified pre-processing approach. By retaining punctuation, numbers, and stop words, the model achieved a precision of 0.9463, recall of 0.9377, F1-score of 0.9346, and an accuracy of 94%.

Downloads

Download data is not yet available.

How to Cite

[1]

D. Sani, “A Random Oversampling and BERT-based Model Approach for Handling Imbalanced Data in Essay Answer Correction”, INFOTEL, vol. 16, no. 4, pp. 729-739, Dec. 2024.

Issue

Vol 16 No 4 (2024): November 2024

Section

Informatics

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Authors who publish with this journal agree to the following terms:

Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work

References

[1] D. Ifenthaler, â€œAutomated Essay Scoring Systems,â€ Handbook of Open, Distance and Digital Education, pp. 1057â€“1071, 2023, doi: 10.1007/978-981-19-2080-6_59.
[2] Z. Ke, â€œAutomated essay scoring: A survey of the state of the art,â€ IJCAI International Joint Conference on Artificial Intelligence, vol. 2019, pp. 6300â€“6308, 2019, doi: 10.24963/ijcai.2019/879.
[3] S. Datta and A. Arputharaj, â€œAn Analysis of Several Machine Learning Algorithms for Imbalanced Classes,â€ in 2018 5th International Conference on Soft Computing & Machine Intelligence (ISCMI), Nov. 2018, pp. 22â€“27. doi: 10.1109/ISCMI.2018.8703244.
[4] A. H. Filho, â€œImbalanced learning techniques for improving the performance of statistical models in automated essay scoring,â€ Procedia Comput Sci, vol. 159, pp. 764â€“773, 2019, doi: 10.1016/j.procs.2019.09.235.
[5] Y. geun Kim, Y. Kwon, and M. C. Paik, â€œValid oversampling schemes to handle imbalance,â€ Pattern Recognit Lett, vol. 125, pp. 661â€“667, Jul. 2019, doi: 10.1016/J.PATREC.2019.07.006.
[6] M. Mujahid et al., â€œData oversampling and imbalanced datasets: an investigation of performance for machine learning and feature engineering,â€ J Big Data, vol. 11, no. 1, Dec. 2024, doi: 10.1186/s40537-024-00943-4.
[7] S. N. and Y. Y. and I. F. and P. A. A. and S. E. N. and H. T. Riston Theodorus and Suherman, â€œOversampling Methods for Handling Imbalance Data in Binary Classification,â€ in Computational Science and Its Applications â€“ ICCSA 2023 Workshops, B. and R. A. M. A. C. and G. C. and S. F. and K. Y. and T. C. M. Gervasi Osvaldo and Murgante, Ed., Cham: Springer Nature Switzerland, 2023, pp. 3â€“23.
[8] P. A. Perwira and N. I. Widiastuti, â€œImbalance Dataset in Aspect-Based Sentiment Analysis on Game Genshin Impact Review,â€ JURNAL INFOTEL, vol. 16, no. 1, Feb. 2024, doi: 10.20895/infotel.v16i1.984.
[9] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, â€œBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,â€ CoRR, vol. abs/1810.04805, 2018, [Online]. Available: http://arxiv.org/abs/1810.04805
[10] U. U. Acikalin, B. Bardak, and M. Kutlu, â€œTurkish Sentiment Analysis Using BERT,â€ 2020 28th Signal Processing and Communications Applications Conference, SIU 2020 - Proceedings, Oct. 2020, doi: 10.1109/SIU49456.2020.9302492.
[11] A. K. Jayaraman, A. Murugappan, T. E. Trueman, G. Ananthakrishnan, and A. Ghosh, â€œImbalanced aspect categorization using bidirectional encoder representation from transformers,â€ Procedia Comput Sci, vol. 218, pp. 757â€“765, 2023, doi: https://doi.org/10.1016/j.procs.2023.01.056.
[12] R. Mifsud, L. Deka, and I. Lahiri, â€œAn Optimised BERT Pretraining Approach for Identification of Targeted Offensive Language: Data Imbalance and Potential Solutions,â€ 2023 4th International Conference on Computing and Communication Systems, I3CS 2023, 2023, doi: 10.1109/I3CS58314.2023.10127515.
[13] S. Wada et al., â€œOversampling effect in pretraining for bidirectional encoder representations from transformers (BERT) to localize medical BERT and enhance biomedical BERT,â€ Artif Intell Med, vol. 153, p. 102889, Jul. 2024, doi: 10.1016/J.ARTMED.2024.102889.
[14] D. A. Sani and M. Z. Sarwani, â€œKoreksi Jawaban Esai Berdasarkan Persamaan Makna Menggunakan Fasttext dan Algoritma Backpropagation,â€ Jurnal Nasional Pendidikan Teknik Informatika (JANAPATI), vol. 11, no. 2, pp. 92â€“111, Aug. 2022, doi: 10.23887/janapati.v11i2.49192.
[15] A. Jaiswal and E. Milios, â€œBreaking the Token Barrier: Chunking and Convolution for Efficient Long Text Classification with BERT,â€ Oct. 2023, Accessed: Aug. 13, 2024. [Online]. Available: https://arxiv.org/abs/2310.20558v1
[16] H. Saragih and J. Manurung, â€œLeveraging the BERT model for enhanced sentiment analysis in multicontextual social media content.â€
[17] M. Hayaty, S. Muthmainah, and S. M. Ghufran, â€œRandom and Synthetic Over-Sampling Approach to Resolve Data Imbalance in Classification,â€ International Journal of Artificial Intelligence Research, vol. 4, no. 2, p. 86, Jan. 2021, doi: 10.29099/ijair.v4i2.152.
[18] Z. Gao, A. Feng, X. Song, and X. Wu, â€œTarget-dependent sentiment classification with BERT,â€ IEEE Access, vol. 7, pp. 154290â€“154299, 2019, doi: 10.1109/ACCESS.2019.2946594.
[19] R. Behera Santosh Kumar and Dash, â€œFine-Tuning of a BERT-Based Uncased Model for Unbalanced Text Classification,â€ in Advances in Intelligent Computing and Communication, S. Mohanty Mihir Narayan and Das, Ed., Singapore: Springer Nature Singapore, 2022, pp. 377â€“384.
[20] M. P. Geetha and D. Karthika Renuka, â€œImproving the performance of aspect based sentiment analysis using fine-tuned Bert Base Uncased model,â€ International Journal of Intelligent Networks, vol. 2, pp. 64â€“69, Jan. 2021, doi: 10.1016/j.ijin.2021.06.005.

Article Sidebar

Main Article Content

Abstract

Downloads

Article Details

References