Optimization of software defects prediction in imbalanced class using a combination of resampling methods with support vector machine and logistic regression
Main Article Content
Abstract
The main problem in producing high accuracy software defect prediction is if the data set has an imbalance class and dichotomous characteristics. The imbalanced class problem can be solved using a data level approach, such as resampling methods. While the problem of software defects predicting if the data set has dichotomous characteristics can be approached using the classification method. This study aimed to analyze the performance of the proposed software defect prediction method to identify the best combination of resampling methods with the appropriate classification method to provide the highest accuracy. The combination of the proposed methods first is the resampling process using oversampling, under-sampling, or hybrid methods. The second process uses the classification method, namely the Support Vector Machine (SVM) algorithm and the Logistic Regression (LR) algorithm. The proposed, tested model uses five NASA MDP data sets with the same number attributes of 37. Based on the t-test, the < = 0.0344 < 0.05 and the > = 3.1524 > 2.7765 which indicates that the combination of the proposed methods is suitable for classifying imbalanced class. The performance of the classification algorithm has also improved with the use of the resampling process. The average increase in AUC values using the resampling in the SVM algorithm is 17.19%, and the LR algorithm is at 7.26% compared to without the resampling process. Combining the three resampling methods with the SVM algorithm and the LR algorithm shows that the best combining method is the oversampling method with the SVM algorithm to software defects prediction in imbalanced class with an average accuracy value of 84.02% and AUC 91.65%.
Downloads
Article Details
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work
References
[2] A. Iqbal et al., “Performance Analysis of Machine Learning Techniques on Software Defect Prediction using NASA Datasets,” Int. J. Adv. Comput. Sci. Appl., vol. 10, no. 5, hal. 300–308, 2019, doi: 10.14569/IJACSA.2019.0100538.
[3] M. A. Memon, M.-U.-R. Magsi, M. Memon, dan S. Hyder, “Defects Prediction and Prevention Approaches for Quality Software Development,” Int. J. Adv. Comput. Sci. Appl., vol. 9, no. 8, hal. 451–457, 2018, doi: 10.14569/IJACSA.2018.090857.
[4] D. Bowes, T. Hall, dan J. Petrić, “Software defect prediction: do different classifiers find the same defects,” Softw. Qual. J., vol. 26, no. 2, hal. 525–552, Jun 2018, doi: 10.1007/s11219-016-9353-3.
[5] Y. Shao, B. Liu, S. Wang, dan G. Li, “A novel software defect prediction based on atomic class-association rule mining,” Expert Syst. Appl., vol. 114, hal. 237–254, Des 2018, doi: 10.1016/j.eswa.2018.07.042.
[6] X. Jing, F. Wu, X. Dong, dan B. Xu, “An Improved SDA Based Defect Prediction Framework for Both Within-Project and Cross-Project Class-Imbalance Problems,” IEEE Trans. Softw. Eng., vol. 43, no. 4, hal. 321–339, Apr 2017, doi: 10.1109/TSE.2016.2597849.
[7] R. S. Wahono, “A Systematic Literature Review of Software Defect Prediction : Research Trends , Datasets , Methods and Frameworks,” J. Softw. Eng., vol. 1, no. 1, hal. 1–16, 2015.
[8] N. Gayatri, S. Nickolas, dan A. V Reddy, “Feature Selection Using Decision Tree Induction in Class level Metrics Dataset for Software Defect Predictions,” in Proceedings of the World Congress on Engineering and Computer Science (WCECS) 2010, 2010, vol. I, hal. 124–129.
[9] Y. Peng, G. Wang, dan H. Wang, “User preferences based software defect detection algorithms selection using MCDM,” Inf. Sci. (Ny)., vol. 191, hal. 3–13, 2012, doi: 10.1016/j.ins.2010.04.019.
[10] Z. Sun, Q. Song, dan X. Zhu, “Using Coding Based Ensemble Learning to Improve Software Defect Prediction,” IEEE Trans. Syst. Man, Cybern. Part C (Applications Rev., vol. 42, no. 6, hal. 1806–1817, 2012, doi: 10.1109/TSMCC.2012.2226152.
[11] Ö. F. Arar dan K. Ayan, “Software defect prediction using cost-sensitive neural network,” Appl. Soft Comput., vol. 33, hal. 263–277, Agu 2015, doi: 10.1016/j.asoc.2015.04.045.
[12] G. Fan, X. Diao, H. Yu, K. Yang, dan L. Chen, “Software Defect Prediction via Attention-Based Recurrent Neural Network,” Sci. Program., vol. 2019, hal. 1–14, Apr 2019, doi: 10.1155/2019/6230953.
[13] B. Turhan, G. Kocak, dan A. Bener, “Data mining source code for locating software bugs: A case study in telecommunication industry,” Expert Syst. Appl., vol. 36, no. 6, hal. 9986–9990, Agu 2009, doi: 10.1016/j.eswa.2008.12.028.
[14] R. Batuwita dan V. Palade, “FSVM-CIL : Fuzzy Support Vector Machines for Class Imbalance Learning,” IEEE Trans. Fuzzy Syst., vol. 18, no. 3, hal. 558–571, 2010, doi: 10.1109/TFUZZ.2010.2042721.
[15] S. S. Maddipati dan M. Srinivas, “An Hybrid Approach for Cost Effective Prediction of Software Defects,” Int. J. Adv. Comput. Sci. Appl., vol. 12, no. 2, hal. 145–152, 2021, doi: 10.14569/IJACSA.2021.0120219.
[16] K. Sahu dan R. K. Srivastava, “Soft computing approach for prediction of software reliability,” ICIC Express Lett., no. March 2019, 2021, doi: 10.24507/icicel.12.12.1213.
[17] A. R. P. Periasamy dan A. Mishbahulhuda, “Data Mining Techniques in Software Defect Prediction,” Int. J. Adv. Res. Comput. Sci. Softw. Eng., vol. 7, no. 3, hal. 301–303, Mar 2017, doi: 10.23956/ijarcsse/V7I3/0173.
[18] G. Denaro, “Estimating software fault-proneness for tuning testing activities,” in Proceedings of the 22nd international conference on Software engineering - ICSE ’00, 2000, hal. 704–706, doi: 10.1145/337180.337592.
[19] R. Shatnawi dan W. Li, “An Empirical Investigation of Predicting Fault Count, Fix Cost and Effort Using Software Metrics,” Int. J. Adv. Comput. Sci. Appl., vol. 7, no. 2, hal. 484–491, 2016, doi: 10.14569/IJACSA.2016.070264.
[20] D. Gray, D. Bowes, N. Davey, Y. Sun, dan B. Christianson, “Software defect prediction using static code metrics underestimates defect-proneness,” in The 2010 International Joint Conference on Neural Networks (IJCNN), Jul 2010, hal. 1–7, doi: 10.1109/IJCNN.2010.5596650.
[21] Haibo He dan E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowl. Data Eng., vol. 21, no. 9, hal. 1263–1284, Sep 2009, doi: 10.1109/TKDE.2008.239.
[22] T. M. Khoshgoftaar, K. Gao, dan N. Seliya, “Attribute Selection and Imbalanced Data: Problems in Software Defect Prediction,” in 2010 22nd IEEE International Conference on Tools with Artificial Intelligence, Okt 2010, vol. 1, hal. 137–144, doi: 10.1109/ICTAI.2010.27.
[23] K. Teh, P. Armitage, S. Tesfaye, D. Selvarajah, dan I. D. Wilkinson, “Imbalanced learning: Improving classification of diabetic neuropathy from magnetic resonance imaging,” PLoS One, vol. 15, no. 12, hal. 1–15, Des 2020, doi: 10.1371/journal.pone.0243907.
[24] X. Sheng, Z. Junhai, W. Xiaolan, dan Y. Ming, “A new resampling method of imbalanced large data based on class boundary,” in 2015 International Conference on Machine Learning and Cybernetics (ICMLC), Jul 2015, vol. 2, hal. 826–831, doi: 10.1109/ICMLC.2015.7340660.
[25] D. Zhang, W. Liu, X. Gong, dan H. Jin, “A novel improved SMOTE resampling algorithm based on fractal,” J. Comput. Inf. Syst., vol. 7, no. 6, hal. 2204–2211, 2011.
[26] F. Charte, A. J. Rivera, J. María, dan F. Herrera, “Knowledge-Based Systems MLSMOTE : Approaching imbalanced multilabel learning through synthetic instance generation,” KNOWLEDGE-BASED Syst., vol. 89, hal. 385–397, 2015, doi: 10.1016/j.knosys.2015.07.019.
[27] T. Hall, S. Beecham, D. Bowes, D. Gray, dan S. Counsell, “A Systematic Literature Review on Fault Prediction Performance in Software Engineering,” IEEE Trans. Softw. Eng., vol. 38, no. 6, hal. 1276–1304, Nov 2012, doi: 10.1109/TSE.2011.103.
[28] F. Cheng, G. Fu, X. Zhang, dan J. Qiu, “Multi-objective evolutionary algorithm for optimizing the partial area under the ROC curve,” Knowledge-Based Syst., vol. 170, hal. 61–69, Apr 2019, doi: 10.1016/j.knosys.2019.01.029.
[29] M. Bach, A. Werner, J. Żywiec, dan W. Pluskiewicz, “The study of under- and over-sampling methods’ utility in analysis of highly imbalanced data on osteoporosis,” Inf. Sci. (Ny)., vol. 384, hal. 174–190, Apr 2017, doi: 10.1016/j.ins.2016.09.038.
[30] U. R. Salunkhe dan S. N. Mali, “Classifier Ensemble Design for Imbalanced Data Classification: A Hybrid Approach,” Procedia Comput. Sci., vol. 85, hal. 725–732, 2016, doi: 10.1016/j.procs.2016.05.259.
[31] G. K. Armah, G. Luo, K. Qin, dan A. S. Mbandu, “Applying Variant Variable Regularized Logistic Regression for Modeling Software Defect Predictor,” Lect. Notes Softw. Eng., vol. 4, no. 2, hal. 107–115, Mei 2016, doi: 10.7763/LNSE.2016.V4.234.
[32] K. Ghazvini, M. Yousefi, F. Firoozeh, dan S. Mansouri, “Predictors of tuberculosis: Application of a logistic regression model,” Gene Reports, vol. 17, hal. 1–4, Des 2019, doi: 10.1016/j.genrep.2019.100527.
[33] F. Gorunescu, Data Mining: Concepts, models and techniques, vol. 12. Springer-Verlag Berlin Heidelberg, 2011.