Optimizing Protein Particle Classification: A Study on Smoothing Techniques and Model Performance
Hossein Saiedian
Prajna Dhar
This thesis investigates the impact of smoothing techniques on enhancing classification accuracy in protein particle datasets, focusing on both binary and multi-class configurations across three datasets. By applying methods including Averaging-Based Smoothing, Moving Average, Exponential Smoothing, Savitzky-Golay, and Kalman Smoothing, we sought to improve performance in Random Forest, Decision Tree, and Neural Network models. Initial baseline accuracies revealed the complexity of multi-class separability, while clustering analyses provided valuable insights into class similarities and distinctions, guiding our interpretation of classification challenges.
These results indicate that Averaging-Based Smoothing and Moving Average techniques are particularly effective in enhancing classification accuracy, especially in configurations with marked differences in surfactant conditions. Feature importance analysis identified critical metrics, such as IntMean and IntMax, which played a significant role in distinguishing classes. Cross-validation validated the robustness of our models, with Random Forest and Neural Network consistently outperforming others in binary tasks and showing promising adaptability in multi-class classification. This study not only highlights the efficacy of smoothing techniques for improving classification in protein particle analysis but also offers a foundational approach for future research in biopharmaceutical data processing and analysis.