Feature balancing of demographic data using SMOTE
Cuncong Zhong
The research investigates the utilization of Synthetic Minority Oversampling Techniques (SMOTE) in the context of machine learning models applied to biomedical datasets, particularly focusing on mitigating demographic data disparities. The study is most relevant to underrepresented demographic data. The primary objective is to enhance the SMOTE methodology, traditionally designed for addressing class imbalances, to specifically tackle ethnic imbalances within feature representation. In contrast to conventional approaches that merely exclude race as a fundamental or additive factor without rectifying misrepresentation, this work advocates an innovative modification of the original SMOTE framework, emphasizing dataset augmentation based on participants' demographic backgrounds. The predominant aim of the project is to enhance and reshape the distribution to optimize model performance for unspecified demographic subgroups during training. However, the outcomes indicate that despite the application of feature balancing in this adapted SMOTE method, no statistically significant enhancement in accuracy was discerned. This observation implies that while rectifying imbalances is crucial, it may not independently suffice to overcome challenges associated with heterogeneity in species representation within machine learning models applied to biomedical databases. Consequently, further research endeavors are necessary to identify novel methodologies aimed at enhancing sampling accuracy and fairness within diverse populations.