Evaluation of machine learning methods in medicine: real data application
machine learning methods in medicine
DOI:
https://doi.org/10.37609/srinmed.25Keywords:
Binary Logistic Regression, Random Forest, Support Vector Machine, Naive Bayes, Decision Tree, Real Data SetsAbstract
Objective: One of the aims of a health study is to identify risk factors associated with the disease or to obtain predictive models for classification such as healthy / diseased. When the aim of a health study is classification, machine learning methods are widely used. The aim of this study was to evaluate the performance of the machine learning method, for different sample size, prevalence and determination coefficient in real data sets.Method: The data were randomly split into 70% training and 30% test set, and Logistic regression, Decision tree, Random Forest, Support Vector Machine, and Naive Bayes were applied to the training set. The performance measure (Accuracy, Area Under Curve and Adjusted F Measure) of the methods in the test set were saved. These procedures were performed in the R 3.5 1.Results: When all variables in the data are categorical, and R2 is low with a moderate sample size, the Naive Bayes (NB) method exhibited higher performance. When all variables in the data are continuous, and R2 is moderate with a low sample size, support vector machines (SVM) method demonstrated superior performance. In cases where the dataset has a high number of categorical variables and a high R2, the Naive Bayes (NB) method outperformed others. The Random Forest (RF) method showed higher performance when R2 is high, and the sample size is moderate.Conclusion: This study provides valuable insights for researchers dealing with classification problems, guiding them to choose the most effective machine learning based on the characteristics of the datasets.
References
Sharma S, Agrawal J, Sharma S. Classification Through Machine Learning Technique: C4.5 Algorithm based on Various Entropies. IJCA. 2013; 82: 20-27.
Ashari A, Paryudi I, Tjoa AM. Performance Comparison between Naive Bayes, Decision Tree and k-Nearest Neighbor in Searching Alternative Design in an Energy Simulation Tool. IJACSA. 2013; 4: 33-39.
Podgorelec V, Kokol P, Stiglic B, Rozman I. Decision Trees: an overview and their use in medicine. J. Med. System. 2002; 26:445–463.
Yoo W, Ference BA, Cote ML, Schwartz A. A Comparison of Logistic Regression, Logic Regression, Classification Tree, and Random Forests to Identify Effective Gene-Gene and Gene-Environmental Interactions. Int J Appl Sci Technol. 2012; 2: 268.
Zhang Z. Naive Bayes classification in R. Annals of Translational Medicine. 2016; 4: 241.
Vapnik VN. An overview statistical learning theory. IEEE transactions on neural networks. 1999; 10: 988-999.
Hosmer DW, Lemeshow S. Introduction to the logistic regression model. 2th ed. New York; 2000
Wang Y, Xia ST, Wu JA. Less-greedy Two-term Tsallis Entropy Information Metric Approach for Decision Tree Classification. Knowledge-Based Systems. 2016; 20: 2-28.
Nachiappan MR, Sugumaran V, Elangovan M. Performance of Logistic Model Tree Classifier using Statistical Features for Fault Diagnosis of Single Point Cutting Tool. INDJST. 2016; 9: 1-8.
Zhang Q, Sun J, Zhong G Dong J. Random multi-graphs: a semi-supervised learning framework for classification of high dimensional data. Image and Vision Computing. 2017; 60: 30–37.
Breiman L. Random forests. Machine Learning. 2001; 45: 5–32.
Polianchik DE, Grigor’ev VY, Sandakov GI, Yarkov AV, BachurinSO Raevskii. Binary Classification of Cns and Pns Drugs. Pharmaceutical Chemistry. 2017; 50: 800-804.
Pashaei E, Ozen M, Aydın N. Splice site identification in human genome using random forest. Health Technol. 2017; 7: 141-152.
Shelestov A, Lavreniuk M, Kussul N, Novikov A, Skakun S. Exploring Google Earth Engine Platform for Big Data Processing: Classification of Multi-Temporal Satellite Imagery for Crop Mapping. Front. Earth Sci . 2017; 5: 1-10.
Rish I. An emprical study of the Naive Bayes classifier. Work Empir Methods Artif Intell. 2001; 3: 41-46.
Liua M, Wang M, Wang J, Li D. Comparison of random forest, support vector machine and back propagation neural network for electronic tongue data classification: Application to the recognition of orange beverage and Chinese vinegar. Sensors and Actuators B. 2013: 970-980.
Tien Bui D, Anh TuanT, Klempe H, Pradhan B, Revhaug I. Spatial prediction models for shallow landslide hazards: a comparative assessment of the efficacy of support vector machines, artificial neural networks, kernel logistic regression, and logistic model tree. Landslides. 2016; 13: 361-378.
Schlimmer JC. Concept acquisition through representational adjustment. Doctoral dissertation, Department of Information and Computer Science, 1987. University of California, Irvine, CA.
Wolberg WH, Mangasarian OL. Multisurface method of pattern separation for medical diagnosis applied to breast cytology. Proc Natl Acad Sci USA. 1990; 87: 9193-9196.
Smith, JW, Everhart JE, Dickson WC, Knowler WC, Johannes, R.S. Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Symposium on Computer Applications and Medical Care. 1988. (pp. 261--265). IEEE Computer Society Press.
Kahn M. Diabetes [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5T59G.
Janosi A, Steinbrunn W, Pfisterer M, Detrano, R. Heart Disease [Dataset]. UCI Machine Learning Repository. 1989. https://doi.org/10.24432/C52P4X.
Rubini L, Soundarapandian P, Eswaran P. Chronic Kidney Disease [Dataset]. UCI Machine Learning Repository. 2015. https://doi.org/10.24432/C5G020.
Arasakumar M, Sudhakar P. An Effective Dynamic Weight Based Grey Wolf Optimization Algorithm with Support Vector Machine for Classification in Healthcare Industry. Science, Technology and Development. 2020; 9: 125-146
Gokiladevi M, Santhoshkumar SH. Gas Optimization Algorithm with Deep Learning based Chronic Kidney Disease Detection and Classification Model. International Journal of Intelligent Engineering & Systems; 2024:17(2).
Yu S, Li X, Wang H, Zhang X, Chen S. BIDI: A classification algorithm with instance difficulty invariance. Expert Systems With Applications. 2021; 165.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Scientific Reports in Medicine
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Copyright Notice
Scientific Reports in Medicine is an open access scientific journal. Open access means that all content is freely available without charge to the user or his/her institution on the principle that making research freely available to the public supports a greater global exchange of knowledge. The Journal and content of this website is licensed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) License. This is in accordance with the Budapest Open Access Initiative (BOAI) definition of open access.
The Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) allows users to copy, distribute and transmit an article, adapt the article and make noncommercial use of the article. The CC BY-NC-ND license permits non-commercial re-use of an open access article, as long as the author is properly attributed.
Scientific Reports in Medicine requires the author as the rights holder to sign and submit the journal's agreement form prior to acceptance. The authors transfer all financial rights, especially processing, reproduction, representation, printing, distribution, and online transmittal to Academician Publishing with no limitation whatsoever, and grant Academician Publishing for its publication. This ensures both that The Journal has the right to publish the article and that the author has confirmed various things including that it is their original work and that it is based on valid research.
Authors who publish with this journal agree to the following terms:
*Authors transfer copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
*Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
*Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.
Self Archiving Policy
*The Journal allows authors to self-archive their articles in an open access repository. The Journal considers publishing material where a pre-print or working paper has been previously mounted online. The Journal does not consider this an exception to our policy regarding the originality of the paper (not to be published elsewhere), since the open access repository doesn't have a publisher character, but an archiving system for the benefit of the public.
The Journal's policy regarding the accepted articles requires authors not to mention, in the archived articles in an open access repository, their acceptance for publication in the journal until the article is final and no modifications can be made. Authors are not allowed to submit the paper to another publisher while is still being evaluated for the Journal or is in the process of revision after the peer review decision.
The Journal does allow the authors to archive the final published article, often a pdf file, in an open access repository, after authors inform the editorial office. The final version of the article and its internet page contains information about copyright and how to cite the article. Only this final version of the article is uploaded online, on the Journal's official website, and only this version should be used for self-archiving and should replace the previous versions uploaded by authors in the open access repository.