Technical Skills and Methods

Machine Learning & Model Development

Gradient Boosting Frameworks: Implemented LightGBM (LGBMRegressor, LGBMClassifier) for regression and classification tasks on high-dimensional datasets with 450K-850K features
Hyperparameter Optimization: Utilized Optuna for automated hyperparameter tuning to maximize model performance
Model Interpretability: Applied SHAP (SHapley Additive exPlanations) values for feature importance analysis and model explanation
Cross-Validation: Performed k-fold cross-validation to ensure robust model generalization and prevent overfitting

Data Processing & Feature Engineering

High-Dimensional Data Management: Engineered efficient chunking strategies to handle datasets with 485K-893K features while managing computational constraints
Missing Data Handling: Developed preprocessing pipelines to address datasets with 368K-510K missing values across multiple features
Feature Selection: Extracted top contributing features (500 CpG sites) from large feature spaces using SHAP-based importance ranking
Data Integration: Merged multi-platform datasets (Illumina 450K and 850K arrays) while managing platform-specific missing data

Statistical Analysis

Correlation Analysis: Conducted Pearson correlation coefficient analysis to assess relationships between methylation patterns and target variables
Linear Regression: Performed regression analysis on individual features to validate biological relevance
P-value Interpretation: Applied statistical significance testing for feature validation

Data Visualization

Model Performance Metrics: Created confusion matrices, ROC curves (AUC), and evaluation metric tables
Feature Analysis Plots: Developed SHAP summary plots, scatter plots with regression lines, and distribution visualizations
Data Distribution: Generated boxplots and age distribution analyses across multi-study datasets

Bioinformatics & Domain Knowledge

Genomic Data Analysis: Processed DNA methylation β-values from Illumina BeadChip arrays (450K, 850K platforms)
Gene Annotation: Utilized R packages (IlluminaHumanMethylation450kanno, IlluminaHumanMethylationEPICanno) for CpG-to-gene mapping
Public Database Navigation: Retrieved and processed datasets from NCBI Gene Expression Omnibus (GEO) database
Biological Validation: Connected machine learning findings to established biological markers and literature

Programming & Tools

Python: Primary language for data processing, modeling, and analysis
R: Used for genomic annotation and methylation data processing
Version Control: Maintained project repository on GitHub for reproducibility

Model Evaluation

Regression Metrics: Mean Absolute Error (MAE), Median Absolute Error (MedAE)
Classification Metrics: AUC-ROC, F1 Score, Accuracy, Precision, Recall, False Positive/Negative Rates
Performance Benchmarking: Compared model results against published epigenetic clock studies

Research Methodology

Dataset Curation: Systematically identified and selected 11 studies meeting specific inclusion criteria from public repositories
Multi-Study Integration: Combined 4,303 samples across 9 studies for age prediction and 823 samples across 2 studies for disease classification
Experimental Design: Developed separate pipelines for age prediction (regression) and Alzheimer's classification (binary classification) tasks

Key Questions and Key Findings

Key Questions

1. Can DNA methylation patterns accurately predict chronological age from blood samples?

Finding: Yes. The age prediction model achieved a Mean Absolute Error (MAE) of 3.01 years and Median Absolute Error (MedAE) of 2.40 years using 500 optimized CpG sites from 3,277 blood samples spanning ages 0-103 years. This performance is competitive with established epigenetic clocks, outperforming Horvath's multi-tissue clock (MedAE: 3.6 years) and Hannum's blood-based clock (MedAE: 3.9 years), despite being trained on fewer samples.

2. Which genetic markers are most strongly associated with aging?

Finding: The model identified ELOVL2, FHL2, KLF14, and TRIM59 as key aging-associated genes, with multiple CpG sites mapping to these genes appearing in the top 20 most important features. Thirteen of the top 20 CpG sites showed strong correlation with chronological age (r ≥ 0.7), with the strongest performer (cg23500537) achieving r = 0.912 (p < 0.001). These genes have been consistently validated in previous epigenetic clock studies, confirming the model's ability to identify biologically relevant biomarkers.

3. Can DNA methylation profiles distinguish Alzheimer's disease patients from healthy controls?

Finding: Yes, with good discriminatory power. The classification model achieved an AUC-ROC of 0.844, indicating strong predictive ability, with 79% overall accuracy. The model correctly identified 161 of 247 Alzheimer's cases (63.3% recall) and 488 of 576 healthy controls, resulting in a false positive rate of 13.9% and false negative rate of 36.6%. While performance was lower than multi-modal approaches combining brain imaging and cognitive tests, it demonstrates the viability of blood-based methylation as a non-invasive screening tool.

4. What genetic biomarkers differentiate Alzheimer's disease from healthy aging?

Finding: The model identified SCGN, FKBP5, GATA4, ORMDL3, IRF3, and SPIDR as genes associated with Alzheimer's-specific methylation changes—all previously linked to Alzheimer's disease in prior literature. Importantly, when analyzing the correlation of these Alzheimer's-associated CpG sites with chronological age, most showed weak or no correlation (8 of 15 sites had r < 0.5), suggesting they capture disease-specific signals independent of normal aging patterns. This distinction is critical for developing biomarkers that differentiate pathological aging from healthy aging.

5. How does feature engineering impact model performance on high-dimensional genomic data?

Finding: Strategic feature engineering was essential for handling 485K-893K features with significant missing data. Initial chunked models achieved MAE of 3.67-3.97 years for age prediction and AUC of 0.65-0.72 for Alzheimer's classification. After SHAP-based feature selection reducing to 500 top features and Optuna hyperparameter optimization, performance improved to MAE of 3.01 years and AUC of 0.844—representing approximately 20% improvement in both tasks. This demonstrates that intelligent dimensionality reduction not only reduces computational costs but also enhances model generalization and prevents overfitting on high-dimensional biological data.

Key Questions and Key Findings

Key Questions

1. Can DNA methylation patterns accurately predict chronological age from blood samples?

2. Which genetic markers are most strongly associated with aging?

3. Can DNA methylation profiles distinguish Alzheimer's disease patients from healthy controls?

4. What genetic biomarkers differentiate Alzheimer's disease from healthy aging?

5. How does feature engineering impact model performance on high-dimensional genomic data?

more projects

Spotify Case Study

DATA SCIENCE / MACHINE LEARNING

Spotify Case Study

DATA SCIENCE / MACHINE LEARNING

Spotify Case Study

DATA SCIENCE / MACHINE LEARNING

TRIAGE AI

SOFTWARE ENGINEERING / MACHINE LEARNING

TRIAGE AI

SOFTWARE ENGINEERING / MACHINE LEARNING

TRIAGE AI

SOFTWARE ENGINEERING / MACHINE LEARNING

Machine Learning Prediction of Age and Alzheimer's Disease from DNA Methylation Profiles

Technical Skills and Methods

Machine Learning & Model Development

Data Processing & Feature Engineering

Statistical Analysis

Data Visualization

Bioinformatics & Domain Knowledge

Programming & Tools

Model Evaluation

Research Methodology

Key Questions and Key Findings

Key Questions

1. Can DNA methylation patterns accurately predict chronological age from blood samples?

2. Which genetic markers are most strongly associated with aging?

3. Can DNA methylation profiles distinguish Alzheimer's disease patients from healthy controls?

4. What genetic biomarkers differentiate Alzheimer's disease from healthy aging?

5. How does feature engineering impact model performance on high-dimensional genomic data?

Key Questions and Key Findings

Key Questions

1. Can DNA methylation patterns accurately predict chronological age from blood samples?

2. Which genetic markers are most strongly associated with aging?

3. Can DNA methylation profiles distinguish Alzheimer's disease patients from healthy controls?

4. What genetic biomarkers differentiate Alzheimer's disease from healthy aging?

5. How does feature engineering impact model performance on high-dimensional genomic data?

more projects

more projects

more projects

Spotify Case Study

Spotify Case Study

Spotify Case Study

TRIAGE AI

TRIAGE AI

TRIAGE AI