Integrating Data Analysis Methods with Machine Learning Algorithms for Mixed Data Types: Does This Combination Improve Predictive Models' Accuracy?

Authors

  • Nikolaos Papafilippou Laboratory of Agronomy, Aristotle University of Thessaloniki, Greece https://orcid.org/0009-0003-3148-7229
  • Zacharenia Kyrana Laboratory of Agronomy, Aristotle University of Thessaloniki, Greece https://orcid.org/0000-0001-9269-0675
  • Emmanouil Pratsinakis Laboratory of Agronomy, Aristotle University of Thessaloniki, Greece https://orcid.org/0000-0002-3725-3525
  • Efstratios Kiranas Department of Nutritional Sciences and Dietetics, International Hellenic University, Greece https://orcid.org/0009-0002-6358-2071
  • Alexandra-Maria Michailidou Department of Food Science and Technology, Aristotle University of Thessaloniki, Greece
  • Angelos Markos Department of Primary Education, Democritus University of Thrace, Greece
  • George Menexes Laboratory of Agronomy, Aristotle University of Thessaloniki, Greece https://orcid.org/0000-0002-1034-7345

DOI:

https://doi.org/10.47852/bonviewJDSIS52023906

Keywords:

multivariate data, principal component analysis, categorical principal component analysis, machine learning algorithms, SVC, random forest classifier, multinomial logistic regression

Abstract

In this study, we examined the potential of integrating multivariate data analysis methods as a preliminary stage for machine learning techniques to augment their predictive power. These methods encompass principal component analysis, multiple correspondence analysis, and non-linear categorical principal component analysis with optimal scaling. The machine learning approaches evaluated include Support Vector Machines, Stochastic Gradient Descent, Naïve Bayes, K-Nearest Neighbor, Decision Trees, Random Forests, Adaptive Boosting, and Multinomial Logistic Regression. We conducted experiments using data from a nationwide survey, comprising a total sample of 42,593 adolescents who answered more than 155 questions related to their eating habits. The dependent variable, body mass index (BMI), was measured and employed in the analysis as both a quantitative and qualitative variable. The index values were initially classified based on the World Health Organization’s recommendations. The results indicated that predictions are more reliable when utilizing the BMI as a qualitative variable within a four-class structure. Implementing a multivariate data analysis strategy before applying machine learning algorithms not only conserves time but also facilitates the selection of the most effective predictive model. Although dimensionality reduction may not consistently enhance the models’ predictive abilities, it contributes to the "interpretability" of the results.

 

Received: 22 July 2024 | Revised: 23 September 2024 | Accepted: 2 January 2025 

 

Conflicts of Interest

The authors declare that they have no conflicts of interest to this work.

 

Data Availability Statement

Data available on request from the corresponding author upon reasonable request.

 

Author Contribution Statement

Nikolaos Papafilippou: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Data curation, Writing – original draft, Writing – review & editing, Visualization. Zacharenia Kyrana: Resources. Emmanouil Pratsinakis: Conceptualization. Efstratios Kiranas: Conceptualization, Resources. Alexandra-Maria Michailidou: Resources. Angelos Markos: Conceptualization, Methodology, Software, Formal analysis, George Menexes: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Data curation, Writing – review & editing, Supervision, Project administration.


Downloads

Published

2025-02-10

Issue

Section

Research Articles

How to Cite

Papafilippou, N., Kyrana, Z. ., Pratsinakis, E., Kiranas, E., Michailidou, A.-M. ., Markos, A., & Menexes, G. . (2025). Integrating Data Analysis Methods with Machine Learning Algorithms for Mixed Data Types: Does This Combination Improve Predictive Models’ Accuracy?. Journal of Data Science and Intelligent Systems. https://doi.org/10.47852/bonviewJDSIS52023906