FormulAI: Designing Rule-Based Datasets for Interpretable and Challenging Machine Learning Tasks

Hegler Tissot

doi:10.47852/bonviewAIA42021781

Authors

Hegler Tissot Department of Information Science, Drexel University, United States

DOI:

https://doi.org/10.47852/bonviewAIA42021781

Keywords:

synthetic datasets, rule-based datasets, pattern recognition, interpretability and explainability, class imbalanced

Abstract

In an era marked by the transformative impact of machine learning (ML) algorithms across various disciplines, challenges in achieving model interpretability persist. Existing evaluation datasets often lack transparency, thereby obscuring the decision-making process of ML models, particularly in complex deep learning architectures. This opacity raises concerns across sectors like healthcare, emphasizing the pivotal role of explainability in fostering trust and adhering to non-supervisory norms. While progress has been made through the development of interpretable models, the absence of formalized, interpretable datasets hampers the validation and comparison of techniques. Rule-based datasets, distinct from general synthetic datasets, provide an avenue to simulate real-world challenges while maintaining interpretability. This paper introduces FormulAI, a framework for generating comprehensive rule-grounded datasets encompassing categorical and continuous features, calibrated noise, and imbalanced class distribution. Emphasizing scalability and reproducibility, these datasets serve as a robust standard, fostering exploration in interpretability and robustness.

Received: 23 September 2023 | Revised: 4 March 2024 | Accepted: 15 March 2024

Conflicts of Interest

The author declares that he has no conflicts of interest to this work.

Data Availability Statement

The data that support the findings of this study are openly available at https://github.com/hextrato/FormulAI.

Author Contribution Statement

Hegler Tissot: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Resources, Data curation, Writing - original draft, Writing - review & editing, Visualization, Supervision, Project administration.