Optimization of software defects prediction in imbalanced class using a combination of resampling methods with support vector machine and logistic regression

Windyaning Ustyannie; Emy Setyaningsih; Catur Iswahyudi

doi:10.20895/infotel.v13i4.726

View PDF

Published Dec 9, 2021

DOI https://doi.org/10.20895/infotel.v13i4.726

Windyaning Ustyannie

Institut Sains & Teknologi AKPRIND

Emy Setyaningsih

Institut Sains & Teknologi AKPRIND

Catur Iswahyudi

Institut Sains & Teknologi AKPRIND

Abstract

The main problem in producing high accuracy software defect prediction is if the data set has an imbalance class and dichotomous characteristics. The imbalanced class problem can be solved using a data level approach, such as resampling methods. While the problem of software defects predicting if the data set has dichotomous characteristics can be approached using the classification method. This study aimed to analyze the performance of the proposed software defect prediction method to identify the best combination of resampling methods with the appropriate classification method to provide the highest accuracy. The combination of the proposed methods first is the resampling process using oversampling, under-sampling, or hybrid methods. The second process uses the classification method, namely the Support Vector Machine (SVM) algorithm and the Logistic Regression (LR) algorithm. The proposed, tested model uses five NASA MDP data sets with the same number attributes of 37. Based on the t-test, the < = 0.0344 < 0.05 and the > = 3.1524 > 2.7765 which indicates that the combination of the proposed methods is suitable for classifying imbalanced class. The performance of the classification algorithm has also improved with the use of the resampling process. The average increase in AUC values using the resampling in the SVM algorithm is 17.19%, and the LR algorithm is at 7.26% compared to without the resampling process. Combining the three resampling methods with the SVM algorithm and the LR algorithm shows that the best combining method is the oversampling method with the SVM algorithm to software defects prediction in imbalanced class with an average accuracy value of 84.02% and AUC 91.65%.

Downloads

Download data is not yet available.

How to Cite

[1]

W. Ustyannie, E. Setyaningsih, and C. Iswahyudi, “Optimization of software defects prediction in imbalanced class using a combination of resampling methods with support vector machine and logistic regression”, INFOTEL, vol. 13, no. 4, pp. 176-184, Dec. 2021.

Issue

Vol 13 No 4 (2021): November 2021

Section

Informatics

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Authors who publish with this journal agree to the following terms:

Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work

References

[1] A. S. Andreou dan S. P. Chatzis, â€œSoftware defect prediction using doubly stochastic Poisson processes driven by stochastic belief networks,â€ J. Syst. Softw., vol. 122, hal. 72â€“82, Des 2016, doi: 10.1016/j.jss.2016.09.001.
[2] A. Iqbal et al., â€œPerformance Analysis of Machine Learning Techniques on Software Defect Prediction using NASA Datasets,â€ Int. J. Adv. Comput. Sci. Appl., vol. 10, no. 5, hal. 300â€“308, 2019, doi: 10.14569/IJACSA.2019.0100538.
[3] M. A. Memon, M.-U.-R. Magsi, M. Memon, dan S. Hyder, â€œDefects Prediction and Prevention Approaches for Quality Software Development,â€ Int. J. Adv. Comput. Sci. Appl., vol. 9, no. 8, hal. 451â€“457, 2018, doi: 10.14569/IJACSA.2018.090857.
[4] D. Bowes, T. Hall, dan J. PetriÄ‡, â€œSoftware defect prediction: do different classifiers find the same defects,â€ Softw. Qual. J., vol. 26, no. 2, hal. 525â€“552, Jun 2018, doi: 10.1007/s11219-016-9353-3.
[5] Y. Shao, B. Liu, S. Wang, dan G. Li, â€œA novel software defect prediction based on atomic class-association rule mining,â€ Expert Syst. Appl., vol. 114, hal. 237â€“254, Des 2018, doi: 10.1016/j.eswa.2018.07.042.
[6] X. Jing, F. Wu, X. Dong, dan B. Xu, â€œAn Improved SDA Based Defect Prediction Framework for Both Within-Project and Cross-Project Class-Imbalance Problems,â€ IEEE Trans. Softw. Eng., vol. 43, no. 4, hal. 321â€“339, Apr 2017, doi: 10.1109/TSE.2016.2597849.
[7] R. S. Wahono, â€œA Systematic Literature Review of Software Defect Predictionâ€¯: Research Trends , Datasets , Methods and Frameworks,â€ J. Softw. Eng., vol. 1, no. 1, hal. 1â€“16, 2015.
[8] N. Gayatri, S. Nickolas, dan A. V Reddy, â€œFeature Selection Using Decision Tree Induction in Class level Metrics Dataset for Software Defect Predictions,â€ in Proceedings of the World Congress on Engineering and Computer Science (WCECS) 2010, 2010, vol. I, hal. 124â€“129.
[9] Y. Peng, G. Wang, dan H. Wang, â€œUser preferences based software defect detection algorithms selection using MCDM,â€ Inf. Sci. (Ny)., vol. 191, hal. 3â€“13, 2012, doi: 10.1016/j.ins.2010.04.019.
[10] Z. Sun, Q. Song, dan X. Zhu, â€œUsing Coding Based Ensemble Learning to Improve Software Defect Prediction,â€ IEEE Trans. Syst. Man, Cybern. Part C (Applications Rev., vol. 42, no. 6, hal. 1806â€“1817, 2012, doi: 10.1109/TSMCC.2012.2226152.
[11] Ã–. F. Arar dan K. Ayan, â€œSoftware defect prediction using cost-sensitive neural network,â€ Appl. Soft Comput., vol. 33, hal. 263â€“277, Agu 2015, doi: 10.1016/j.asoc.2015.04.045.
[12] G. Fan, X. Diao, H. Yu, K. Yang, dan L. Chen, â€œSoftware Defect Prediction via Attention-Based Recurrent Neural Network,â€ Sci. Program., vol. 2019, hal. 1â€“14, Apr 2019, doi: 10.1155/2019/6230953.
[13] B. Turhan, G. Kocak, dan A. Bener, â€œData mining source code for locating software bugs: A case study in telecommunication industry,â€ Expert Syst. Appl., vol. 36, no. 6, hal. 9986â€“9990, Agu 2009, doi: 10.1016/j.eswa.2008.12.028.
[14] R. Batuwita dan V. Palade, â€œFSVM-CILâ€¯: Fuzzy Support Vector Machines for Class Imbalance Learning,â€ IEEE Trans. Fuzzy Syst., vol. 18, no. 3, hal. 558â€“571, 2010, doi: 10.1109/TFUZZ.2010.2042721.
[15] S. S. Maddipati dan M. Srinivas, â€œAn Hybrid Approach for Cost Effective Prediction of Software Defects,â€ Int. J. Adv. Comput. Sci. Appl., vol. 12, no. 2, hal. 145â€“152, 2021, doi: 10.14569/IJACSA.2021.0120219.
[16] K. Sahu dan R. K. Srivastava, â€œSoft computing approach for prediction of software reliability,â€ ICIC Express Lett., no. March 2019, 2021, doi: 10.24507/icicel.12.12.1213.
[17] A. R. P. Periasamy dan A. Mishbahulhuda, â€œData Mining Techniques in Software Defect Prediction,â€ Int. J. Adv. Res. Comput. Sci. Softw. Eng., vol. 7, no. 3, hal. 301â€“303, Mar 2017, doi: 10.23956/ijarcsse/V7I3/0173.
[18] G. Denaro, â€œEstimating software fault-proneness for tuning testing activities,â€ in Proceedings of the 22nd international conference on Software engineering - ICSE â€™00, 2000, hal. 704â€“706, doi: 10.1145/337180.337592.
[19] R. Shatnawi dan W. Li, â€œAn Empirical Investigation of Predicting Fault Count, Fix Cost and Effort Using Software Metrics,â€ Int. J. Adv. Comput. Sci. Appl., vol. 7, no. 2, hal. 484â€“491, 2016, doi: 10.14569/IJACSA.2016.070264.
[20] D. Gray, D. Bowes, N. Davey, Y. Sun, dan B. Christianson, â€œSoftware defect prediction using static code metrics underestimates defect-proneness,â€ in The 2010 International Joint Conference on Neural Networks (IJCNN), Jul 2010, hal. 1â€“7, doi: 10.1109/IJCNN.2010.5596650.
[21] Haibo He dan E. A. Garcia, â€œLearning from Imbalanced Data,â€ IEEE Trans. Knowl. Data Eng., vol. 21, no. 9, hal. 1263â€“1284, Sep 2009, doi: 10.1109/TKDE.2008.239.
[22] T. M. Khoshgoftaar, K. Gao, dan N. Seliya, â€œAttribute Selection and Imbalanced Data: Problems in Software Defect Prediction,â€ in 2010 22nd IEEE International Conference on Tools with Artificial Intelligence, Okt 2010, vol. 1, hal. 137â€“144, doi: 10.1109/ICTAI.2010.27.
[23] K. Teh, P. Armitage, S. Tesfaye, D. Selvarajah, dan I. D. Wilkinson, â€œImbalanced learning: Improving classification of diabetic neuropathy from magnetic resonance imaging,â€ PLoS One, vol. 15, no. 12, hal. 1â€“15, Des 2020, doi: 10.1371/journal.pone.0243907.
[24] X. Sheng, Z. Junhai, W. Xiaolan, dan Y. Ming, â€œA new resampling method of imbalanced large data based on class boundary,â€ in 2015 International Conference on Machine Learning and Cybernetics (ICMLC), Jul 2015, vol. 2, hal. 826â€“831, doi: 10.1109/ICMLC.2015.7340660.
[25] D. Zhang, W. Liu, X. Gong, dan H. Jin, â€œA novel improved SMOTE resampling algorithm based on fractal,â€ J. Comput. Inf. Syst., vol. 7, no. 6, hal. 2204â€“2211, 2011.
[26] F. Charte, A. J. Rivera, J. MarÃa, dan F. Herrera, â€œKnowledge-Based Systems MLSMOTEâ€¯: Approaching imbalanced multilabel learning through synthetic instance generation,â€ KNOWLEDGE-BASED Syst., vol. 89, hal. 385â€“397, 2015, doi: 10.1016/j.knosys.2015.07.019.
[27] T. Hall, S. Beecham, D. Bowes, D. Gray, dan S. Counsell, â€œA Systematic Literature Review on Fault Prediction Performance in Software Engineering,â€ IEEE Trans. Softw. Eng., vol. 38, no. 6, hal. 1276â€“1304, Nov 2012, doi: 10.1109/TSE.2011.103.
[28] F. Cheng, G. Fu, X. Zhang, dan J. Qiu, â€œMulti-objective evolutionary algorithm for optimizing the partial area under the ROC curve,â€ Knowledge-Based Syst., vol. 170, hal. 61â€“69, Apr 2019, doi: 10.1016/j.knosys.2019.01.029.
[29] M. Bach, A. Werner, J. Å»ywiec, dan W. Pluskiewicz, â€œThe study of under- and over-sampling methodsâ€™ utility in analysis of highly imbalanced data on osteoporosis,â€ Inf. Sci. (Ny)., vol. 384, hal. 174â€“190, Apr 2017, doi: 10.1016/j.ins.2016.09.038.
[30] U. R. Salunkhe dan S. N. Mali, â€œClassifier Ensemble Design for Imbalanced Data Classification: A Hybrid Approach,â€ Procedia Comput. Sci., vol. 85, hal. 725â€“732, 2016, doi: 10.1016/j.procs.2016.05.259.
[31] G. K. Armah, G. Luo, K. Qin, dan A. S. Mbandu, â€œApplying Variant Variable Regularized Logistic Regression for Modeling Software Defect Predictor,â€ Lect. Notes Softw. Eng., vol. 4, no. 2, hal. 107â€“115, Mei 2016, doi: 10.7763/LNSE.2016.V4.234.
[32] K. Ghazvini, M. Yousefi, F. Firoozeh, dan S. Mansouri, â€œPredictors of tuberculosis: Application of a logistic regression model,â€ Gene Reports, vol. 17, hal. 1â€“4, Des 2019, doi: 10.1016/j.genrep.2019.100527.
[33] F. Gorunescu, Data Mining: Concepts, models and techniques, vol. 12. Springer-Verlag Berlin Heidelberg, 2011.

Article Sidebar

Main Article Content

Abstract

Downloads

Article Details

References