Stability of Software Defect Prediction in Relation to Levels of Data Imbalance
Tihana Galinac Grbac and Goran Mausa, University of Rijeka
Bojana Dalbelo–Basic, University of Zagreb
Основная часть
Введение
О сложности данных прогнозирования дефектов ПО
Экспериментальный подход
Несбалансированность данных
Проблемы массива данных
Оценочные метрики
Пример применения стратегии
Заключение
В статье рассмотрена задача повышения качества автоматического обнаружения дефектов в программном обеспечении на основе метрик статического анализа. Одна из основных проблем в этой задаче это несбалансированность данных. Авторы статьи исследуют зависимость устойчивости классификаторов склонности модулей к дефектам от степени несбалансированности данных.
Список литературы
- C. Andersson and P. Runeson. A replicated quantitative analysis of fault distributions in complex software systems. IEEE Trans. Softw. Eng., 33(5):273–286, May 2007.
- A. Andrews and C. Stringfellow. Quantitative analysis of development defects to guide testing: A case study. Software Quality Control, 9:195–214, November 2001.
- D. Banthia and A. Gupta. Investigating fault prediction capabilities of five prediction models for software quality. In Proceedings of the 27th Annual ACM Symposium on Applied Computing, SAC ’12, pages 1259–1261, New York, NY, USA, 2012. ACM.
- V. R. Basili, L. C. Briand, and W. L. Melo. A validation of object-oriented design metrics as quality indicators. IEEE Trans. Software Engineering, 22(10):751–761, October 1996.
- G. E. A. P. A. Batista, R. C. Prati, and M.C. Monard. A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor. Newsl., 6(1):20–29, 2004.
- L. C. Briand, J. W. Daly, V. Porter, and J. Wust. A comprehensive empirical validation of product measures for object-oriented systems, 1998.
- L. C. Briand, J. Wust, J. W. Daly, and D. V. Porter. Exploring the relationship between design measures and software quality in object-oriented systems. J. Syst. Softw., 51:245–273, May 2000.
- A. Brooks. Meta Analysis–A Silver Bullet for Meta-Analysts. Empirical Softw. Engg., 2(4):333–338, 1997.
- T. Fawcett. An introduction to ROC analysis. Pattern Recogn. Lett., 27(8):861–874, Aug. 2006.
- N. E. Fenton and N. Ohlsson. Quantitative analysis of faults and failures in a complex software system. IEEE Trans. Softw. Eng., 26(8):797–814, Aug. 2000.
- K. Gao and T. M. Khoshgoftaar. Software defect prediction for high-dimensional and class-imbalanced data. In SEKE, pages 89–94. Knowledge Systems Institute Graduate School, 2011.
- E. Giger, M. Pinzger, and H. C. Gall. Comparing fine-grained source code changes and code churn for bug prediction. In Proceedings of the 8th Working Conference on Mining Software Repositories, MSR ’11, pages 83–92, New York, NY, USA, 2011. ACM.Stability of Software Defect Prediction
- D. Gray, D. Bowes, N. Davey, Y. Sun, and B. Christianson. The misuse of the nasa metrics data program data sets for automated software defect prediction. Processing, pages 96–103, 2011.
- D. Gray, D. Bowes, N. Davey, Y. Sun and B. Christianson. Reflections on the NASA MDP data sets. IET Software, pages 549, 5583, 2012.
- T. Galinac Grbac, P. Runeson, and D. Huljenic. A second replicated quantitative analysis of fault distributions in complex software systems. IEEE Transactions on Software Engineering, 39(4):462–476, 2013.
- T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell. A systematic literature review on fault prediction performance in software engineering. Software Engineering, IEEE Transactions on, 38(6):1276–1304, 2012.
- J. Han and M. Kamber. Data mining: concepts and techniques. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2006.
- T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning: data mining, inference and prediction. Springer, 2 edition, 2009.
- H. He and E. A. Garcia. Learning from Imbalanced Data. IEEE Trans. Knowledge and Data Engineering, 21(9):1263-1284, 2009.
- J. Hulse, T. Khoshgoftaar, A. Napolitano. Experimental perspectives on learning from imbalanced data. In in Proc. 24th international conference on Machine learning (ICML ’07), pages 935–942. 2007.
- Y. Jiang, B. Cukic, and Y. Ma. Techniques for evaluating fault prediction models. Empirical Softw. Engg., 13:561–595, October 2008.
- Y. Kamei, A. Monden, S. Matsumoto, T. Kakimoto, K. Matsumoto. The Effects of Over and Under Sampling on Fault-prone Module Detection. In in Proc. ESEM 2007, First International Symposium on Empirical Software Engineering and Measurement, pages 196–201. IEEE Computer Society Press, 2007.
- I. Kaur and A. Kaur. Empirical study of software quality estimation. In Proceedings of the Second International Conference on Computational Science, Engineering and Information Technology, CCSEIT ’12, pages 694–700, New York, NY, USA, 2012. ACM.
- T. M. Khoshgoftaar, E. B. Allen, R. Halstead, and G. P. Trio. Detection of fault-prone software modules during a spiral life cycle.
- In Proceedings of the 1996 International Conference on Software Maintenance, ICSM ’96, pages 69–76, Washington, DC, USA, 1996. IEEE Computer Society.
- T. M. Khoshgoftaar and N. Seliya. Comparative assessment of software quality classification techniques: An empirical case study. Empirical Softw. Engg., 9(3):229–257, Sept. 2004.
- T. M. Khoshgoftaar, N. Seliya, K. Gao. Detecting noisy instances with the rule-based classification model. Intell. Data Anal., 9(4):347–364, 2005.
- T. M. Khoshgoftaar, K. Gao, N. Seliya. Attribute Selection and Imbalanced Data: Problems in Software Defect Prediction In Proceedings: the 22nd IEEE International Conference on Tools with Artificial Intelligence, 137-144, 2010.
- H. Liu, L. Yu. Toward Integrating Feature Selection Algorithms for Classification and Clustering. IEEE Trans. on Knowl. And Data Eng., 17(4):491–502, 2005.
- S. Lessmann, B. Baesens, C. Mues, and S. Pietsch. Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Transactions on Software Engineering, 34(4):485–496, 2008.
- G. Mausa, T. Galinac Grbac, and B. Basic. Multivariate logistic regression prediction of fault-proneness in software modules. In MIPRO, 2012 Proceedings of the 35th International Convention, pages 698–703, 2012.
- T.J. McCabe. 1976. A complexity measure. IEEE Transactions on Software Engineering, 2:308–320, 1976.
- N. Ohlsson, M. Zhao, and M. Helander. Application of multivariate analysis for software fault prediction. Software Quality Control, 7:51–66, May 1998.
- F. Provost. Machine Learning from Imbalanced Data Sets 101. In Proc. Learning from Imbalanced Data Sets: Papers from the Am. Assoc. for Artificial Intelligence Workshop, Technical Report WS-00-05, 2000.
- S. J. Raudys, A. K. Jain. Small Sample Size Effects in Statistical Pattern Recognition: Recommendations for Practitioners. IEEE Trans. Pattern Anal. Mach. Intell., 13(3):252–264, May 1991.
- P. Runeson, M. C. Ohlsson, and C. Wohlin. A classification scheme for studies on fault-prone components. In Proceedings of the Third International Conference on Product Focused Software Process Improvement, PROFES ’01, pages 341–355, London, UK, 2001. Springer-Verlag.
- M. Shepperd and G. Kadoda. Comparing software prediction techniques using simulation. IEEE Trans. Softw. Eng., 27(11):1014–1022, Nov. 2001.
- M. Shepperd, Q. Song, Z. Sun, C. Mair Data Quality: Some Comments on the NASA Software Defect Data Sets. IEEE Trans. Softw. Eng., http://doi.ieeecomputersociety.org/10.1109/TSE.2013.11, Nov. 2013.1:10
- T. Galinac Grbac, G. Mauˇsa and B. Dalbelo–Baˇsi ́c H. Wang, T. M. Khoshgoftaar, and A. Napolitano. An Empirical Study on the Stability of Feature Selection for Imbalanced Software Engineering Data. In Proceedings of the 2012 11th International Conference on Machine Learning and Applications - Volume 01, ICMLA ’12, pages 317–323, Washington, DC, USA, 317–323.
- S. Wang and X. Yao. Using Class Imbalance Learning for Software Defect Prediction. IEEE Transactions on Reliability, 62(2):434-443, 2012.
- G.M. Weiss. Mining with rarity: a unifying framework. In SIGKDD Explor. Newsl., 6(1):7–19, 2004.
- T. Zimmermann and N. Nagappan. Predicting defects using network analysis on dependency graphs. In Proceedings of the 30th international conference on Software engineering, ICSE ’08, pages 531–540, New York, NY, USA, 2008. ACM.