Evaluation of machine learning methods in medicine: real data application

machine learning methods in medicine

Authors

DOI:

https://doi.org/10.37609/srinmed.25

Keywords:

Binary Logistic Regression, Random Forest, Support Vector Machine, Naive Bayes, Decision Tree, Real Data Sets

Abstract

Objective: One of the aims of a health study is to identify risk factors associated with the disease or to obtain predictive models for classification such as healthy / diseased. When the aim of a health study is classification, machine learning methods are widely used. The aim of this study was to evaluate the performance of the machine learning method, for different sample size, prevalence and determination coefficient in real data sets.Method: The data were randomly split into 70% training and 30% test set, and Logistic regression, Decision tree, Random Forest, Support Vector Machine, and Naive Bayes were applied to the training set. The performance measure (Accuracy, Area Under Curve and Adjusted F Measure) of the methods in the test set were saved. These procedures were performed in the R 3.5 1.Results: When all variables in the data are categorical, and R2 is low with a moderate sample size, the Naive Bayes (NB) method exhibited higher performance. When all variables in the data are continuous, and R2 is moderate with a low sample size, support vector machines (SVM) method demonstrated superior performance. In cases where the dataset has a high number of categorical variables and a high R2, the Naive Bayes (NB) method outperformed others. The Random Forest (RF) method showed higher performance when R2 is high, and the sample size is moderate.Conclusion: This study provides valuable insights for researchers dealing with classification problems, guiding them to choose the most effective machine learning based on the characteristics of the datasets.

References

Sharma S, Agrawal J, Sharma S. Classification Through Machine Learning Technique: C4.5 Algorithm based on Various Entropies. IJCA. 2013; 82: 20-27.

Ashari A, Paryudi I, Tjoa AM. Performance Comparison between Naive Bayes, Decision Tree and k-Nearest Neighbor in Searching Alternative Design in an Energy Simulation Tool. IJACSA. 2013; 4: 33-39.

Podgorelec V, Kokol P, Stiglic B, Rozman I. Decision Trees: an overview and their use in medicine. J. Med. System. 2002; 26:445–463.

Yoo W, Ference BA, Cote ML, Schwartz A. A Comparison of Logistic Regression, Logic Regression, Classification Tree, and Random Forests to Identify Effective Gene-Gene and Gene-Environmental Interactions. Int J Appl Sci Technol. 2012; 2: 268.

Zhang Z. Naive Bayes classification in R. Annals of Translational Medicine. 2016; 4: 241.

Vapnik VN. An overview statistical learning theory. IEEE transactions on neural networks. 1999; 10: 988-999.

Hosmer DW, Lemeshow S. Introduction to the logistic regression model. 2th ed. New York; 2000

Wang Y, Xia ST, Wu JA. Less-greedy Two-term Tsallis Entropy Information Metric Approach for Decision Tree Classification. Knowledge-Based Systems. 2016; 20: 2-28.

Nachiappan MR, Sugumaran V, Elangovan M. Performance of Logistic Model Tree Classifier using Statistical Features for Fault Diagnosis of Single Point Cutting Tool. INDJST. 2016; 9: 1-8.

Zhang Q, Sun J, Zhong G Dong J. Random multi-graphs: a semi-supervised learning framework for classification of high dimensional data. Image and Vision Computing. 2017; 60: 30–37.

Breiman L. Random forests. Machine Learning. 2001; 45: 5–32.

Polianchik DE, Grigor’ev VY, Sandakov GI, Yarkov AV, BachurinSO Raevskii. Binary Classification of Cns and Pns Drugs. Pharmaceutical Chemistry. 2017; 50: 800-804.

Pashaei E, Ozen M, Aydın N. Splice site identification in human genome using random forest. Health Technol. 2017; 7: 141-152.

Shelestov A, Lavreniuk M, Kussul N, Novikov A, Skakun S. Exploring Google Earth Engine Platform for Big Data Processing: Classification of Multi-Temporal Satellite Imagery for Crop Mapping. Front. Earth Sci . 2017; 5: 1-10.

Rish I. An emprical study of the Naive Bayes classifier. Work Empir Methods Artif Intell. 2001; 3: 41-46.

Liua M, Wang M, Wang J, Li D. Comparison of random forest, support vector machine and back propagation neural network for electronic tongue data classification: Application to the recognition of orange beverage and Chinese vinegar. Sensors and Actuators B. 2013: 970-980.

Tien Bui D, Anh TuanT, Klempe H, Pradhan B, Revhaug I. Spatial prediction models for shallow landslide hazards: a comparative assessment of the efficacy of support vector machines, artificial neural networks, kernel logistic regression, and logistic model tree. Landslides. 2016; 13: 361-378.

Schlimmer JC. Concept acquisition through representational adjustment. Doctoral dissertation, Department of Information and Computer Science, 1987. University of California, Irvine, CA.

Wolberg WH, Mangasarian OL. Multisurface method of pattern separation for medical diagnosis applied to breast cytology. Proc Natl Acad Sci USA. 1990; 87: 9193-9196.

Smith, JW, Everhart JE, Dickson WC, Knowler WC, Johannes, R.S. Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Symposium on Computer Applications and Medical Care. 1988. (pp. 261--265). IEEE Computer Society Press.

Kahn M. Diabetes [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5T59G.

Janosi A, Steinbrunn W, Pfisterer M, Detrano, R. Heart Disease [Dataset]. UCI Machine Learning Repository. 1989. https://doi.org/10.24432/C52P4X.

Rubini L, Soundarapandian P, Eswaran P. Chronic Kidney Disease [Dataset]. UCI Machine Learning Repository. 2015. https://doi.org/10.24432/C5G020.

Arasakumar M, Sudhakar P. An Effective Dynamic Weight Based Grey Wolf Optimization Algorithm with Support Vector Machine for Classification in Healthcare Industry. Science, Technology and Development. 2020; 9: 125-146

Gokiladevi M, Santhoshkumar SH. Gas Optimization Algorithm with Deep Learning based Chronic Kidney Disease Detection and Classification Model. International Journal of Intelligent Engineering & Systems; 2024:17(2).

Yu S, Li X, Wang H, Zhang X, Chen S. BIDI: A classification algorithm with instance difficulty invariance. Expert Systems With Applications. 2021; 165.

Downloads

Published

2025-01-09

How to Cite

Binokay, H. (2025). Evaluation of machine learning methods in medicine: real data application: machine learning methods in medicine. Scientific Reports in Medicine, 1(3). https://doi.org/10.37609/srinmed.25

Issue

Section

Research Articles