Toward Trustworthy Biomedical AI: Efficient Protein Language Models and Privacy-Aware Clinical Representations
Tamzidul Hoque
Cuncong Zhong
Bishnu Sarker
Michael Hageman
Accurate biological sequence annotation and privacy-aware clinical modeling are central challenges in modern computational biology and biomedical AI. This dissertation presents scalable and interpretable deep learning frameworks spanning protein family classification, metal-ion binding prediction, and privacy-preserving electrocardiogram (ECG) representation learning. First, we introduce GPCR-SLM, a lightweight transformer-based framework for high-resolution classification of G-protein coupled receptors (GPCRs), one of the largest and most pharmacologically important protein families, targeted by approximately 35% of FDA-approved drugs. Unlike traditional homology-based tools such as BLAST and HMMER, which struggle to distinguish closely related families with low sequence similarity, our knowledge-distilled small language model achieves 99% accuracy across 86 GPCR families. The framework significantly outperforms BLAST (86.4%) and HMMER (91%) while delivering a 33.5Ă— computational speedup compared to large protein language models, enabling scalable functional annotation as protein databases continue to expand.
Second, we present an end-to-end deep learning pipeline for protein–metal-ion binding prediction. Binding site annotation is traditionally labor-intensive and limited by handcrafted features or predefined residue sets. We systematically evaluate five state-of-the-art protein language models and incorporate positional encoding to capture long-range residue dependencies. Our approach achieves a Matthews Correlation Coefficient (MCC) of 0.89 with precision, recall, and F1 scores exceeding 95% for six major metal ions under 10-fold cross-validation, demonstrating robust predictive performance and improved biological interpretability. Finally, we address fairness and privacy in clinical AI through a variational autoencoder (VAE) framework for ECG representation learning. Because ECGs inherently encode sensitive soft biometrics such as sex, age, and race, we design a dual-discriminator architecture that suppresses demographic information while preserving clinically relevant signals. The reconstructed ECGs substantially reduce demographic identifiability while maintaining strong predictive performance for reduced left ventricular ejection fraction, left ventricular hypertrophy, and 5-year mortality.
Collectively, this work advances parameter-efficient, scalable, and privacy-conscious deep learning methodologies for both molecular and clinical domains, bridging computational protein science and trustworthy biomedical AI.