I have applied a range of statistical methods and developed software platforms and analytical tools to process and analyze large-scale sequencing datasets. These tools facilitate the discovery of patterns and similarities across diverse omics datasets, enabling the construction of statistical models that support robust hypothesis generation.
In various projects, I have integrated data from multiple sources, including:
Examples consist of:
Survival analysis: I applied common methods such as the Kaplan-Meier Estimator and the Cox Proportional Hazards Model to analyze sequencing and clinical data, aiming to understand factors that influence patient survival, disease progression, and likelihood of treatment response. Key applications included patient prognosis, biomarker discovery, assessment of the impact of mutations or specific genes, and evaluation of treatment response.
Multi-task Regression Methods: These methods are used to simultaneously model multiple related outputs, enabling efficient handling of data from various sources. In the context of omics data, multi-task regression can capture shared patterns across related datasets, improving predictive accuracy and providing insights into interconnected biological processes.
Classification Methods: I employ a variety of classical classification techniques, such as logistic regression, decision trees, and support vector machines (SVMs), to categorize data based on features derived from omics datasets. These methods are essential for distinguishing between different biological states, such as healthy vs. diseased samples or cell type classifications. By applying these well-established algorithms, I can create predictive models that provide insights into underlying biological processes and help identify potential biomarkers. Additionally, I incorporated positive-unlabeled learning when working with partially labeled data, allowing for the effective classification of samples in scenarios where only a subset has known annotations.
Unsupervised Methods, e.g., PCA, MOFA, Autoencoders: Unsupervised methods aim to uncover hidden structures and patterns within high-dimensional data without predefined labels. Autoencoders, a type of deep learning model, can often outperform PCA and MOFA in specific contexts due to their capacity to learn more complex, nonlinear relationships in data. In omics analysis, I utilized autoencoders to reveal latent structures and reduce data dimensionality, enabling the identification of complex patterns within biological data, such as transcription factor activity. The embeddings generated by autoencoders are valuable for clustering, relationship discovery, and anomaly detection, providing insights into underlying biological processes.