Researcher profile

Guo-Wei Wei

Guo-Wei Wei contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
26works
0followers
13topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

26 published item(s)

preprint2026arXiv

Persistent Sheaf Laplacian Analysis of Protein Stability and Solubility Changes upon Mutation

Genetic mutations frequently disrupt protein structure, stability, and solubility, acting as primary drivers for a wide spectrum of diseases. Despite the critical importance of these molecular alterations, existing computational models often lack interpretability, and fail to integrate essential physicochemical interaction. To overcome these limitations, we propose SheafLapNet, a unified predictive framework grounded in the mathematical theory of Topological Deep Learning (TDL) and Persistent Sheaf Laplacian (PSL). Unlike standard Topological Data Analysis (TDA) tools such as persistent homology, which are often insensitive to heterogeneous information, PSL explicitly encodes specific physical and chemical information such as partial charges directly into the topological analysis. SheafLapNet synergizes these sheaf-theoretic invariants with advanced protein transformer features and auxiliary physical descriptors to capture intrinsic molecular interactions in a multiscale and mechanistic manner. To validate our framework, we employ rigorous benchmarks for both regression and classification tasks. For stability prediction, we utilize the comprehensive S2648 and S350 datasets. For solubility prediction, we employ the PON-Sol2 dataset, which provides annotations for increased, decreased, or neutral solubility changes. By integrating these multi-perspective features, SheafLapNet achieves state-of-the-art performance across these diverse benchmarks, demonstrating that sheaf-theoretic modeling significantly enhances both interpretability and generalizability in predicting mutation-induced structural and functional changes.

preprint2023arXiv

Integrating Transformer and Autoencoder Techniques with Spectral Graph Algorithms for the Prediction of Scarcely Labeled Molecular Data

In molecular and biological sciences, experiments are expensive, time-consuming, and often subject to ethical constraints. Consequently, one often faces the challenging task of predicting desirable properties from small data sets or scarcely-labeled data sets. Although transfer learning can be advantageous, it requires the existence of a related large data set. This work introduces three graph-based models incorporating Merriman-Bence-Osher (MBO) techniques to tackle this challenge. Specifically, graph-based modifications of the MBO scheme are integrated with state-of-the-art techniques, including a home-made transformer and an autoencoder, in order to deal with scarcely-labeled data sets. In addition, a consensus technique is detailed. The proposed models are validated using five benchmark data sets. We also provide a thorough comparison to other competing methods, such as support vector machines, random forests, and gradient boosting decision trees, which are known for their good performance on small data sets. The performances of various methods are analyzed using residue-similarity (R-S) scores and R-S indices. Extensive computational experiments and theoretical analysis show that the new models perform very well even when as little as 1% of the data set is used as labeled data.

preprint2023arXiv

Machine-learning Analysis of Opioid Use Disorder Informed by MOR, DOR, KOR, NOR and ZOR-Based Interactome Networks

Opioid use disorder (OUD) continuously poses major public health challenges and social implications worldwide with dramatic rise of opioid dependence leading to potential abuse. Despite that a few pharmacological agents have been approved for OUD treatment, the efficacy of said agents for OUD requires further improvement in order to provide safer and more effective pharmacological and psychosocial treatments. Preferable therapeutic treatments of OUD rely on the advances in understanding the neurobiological mechanism of opioid dependence. Proteins including mu, delta, kappa, nociceptin, and zeta opioid receptors are the direct targets of opioids. Each receptor has a large protein-protein interaction (PPI) network, that behaves differently when subjected to various treatments, thus increasing the complexity in the drug development process for an effective opioid addiction treatment. The report below analyzes the work by presenting a PPI-network informed machine-learning study of OUD. We have examined more than 500 proteins in the five opioid receptor networks and subsequently collected 74 inhibitor datasets. Machine learning models were constructed by pairing gradient boosting decision tree (GBDT) algorithm with two advanced natural language processing (NLP)-based molecular fingerprints. With these models, we systematically carried out evaluations of screening and repurposing potential of drug candidates for four opioid receptors. In addition, absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties were also considered in the screening of potential drug candidates. Our study can be a valuable and promising tool of pharmacological development for OUD treatments.

preprint2023arXiv

Topological data analysis hearing the shapes of drums and bells

Mark Kac asked a famous question in 1966 entitled Can one hear the shape of a drum?, a spectral geometry problem that has intrigued mathematicians for the last six decades and is important to many other fields, such as architectural acoustics, audio forensics, pattern recognition, radiology, and imaging science. A related question is how to hear the shape of a drum. We show that the answer was given in the set of 65 Zenghouyi chime bells dated back to 475-433 B.C. in China. The set of chime bells gradually varies their sizes and weights to enable melodies, intervals, and temperaments. The same design principle was used in many other musical instruments, such as xylophones, pan flutes, pianos, etc. We reveal that there is a fascinating connection between the progression pattern of many musical instruments and filtration (or spectral sequence) in topological data analysis (TDA). We argue that filtration-induced evolutionary de Rham-Hodge theory provides a new mathematical foundation for musical instruments. Its discrete counterpart, persistent Laplacians and many other persistent topological Laplacians, including persistent sheaf Laplacians and persistent path Laplacians are briefly discussed.

preprint2022arXiv

CCP: Correlated Clustering and Projection for Dimensionality Reduction

Most dimensionality reduction methods employ frequency domain representations obtained from matrix diagonalization and may not be efficient for large datasets with relatively high intrinsic dimensions. To address this challenge, Correlated Clustering and Projection (CCP) offers a novel data domain strategy that does not need to solve any matrix. CCP partitions high-dimensional features into correlated clusters and then projects correlated features in each cluster into a one-dimensional representation based on sample correlations. Residue-Similarity (R-S) scores and indexes, the shape of data in Riemannian manifolds, and algebraic topology-based persistent Laplacian are introduced for visualization and analysis. Proposed methods are validated with benchmark datasets associated with various machine learning algorithms.

preprint2022arXiv

Machine learning analysis of cocaine addiction informed by DAT, SERT, and NET-based interactome networks

Cocaine addiction is a psychosocial disorder induced by the chronic use of cocaine and causes a large of number deaths around the world. Despite many decades' effort, no drugs have been approved by the Food and Drug Administration (FDA) for the treatment of cocaine dependence. Cocaine dependence is neurological and involves many interacting proteins in the interactome. Among them, dopamine transporter (DAT), serotonin transporter (SERT), and norepinephrine transporter (NET) are three major targets. Each of these targets has a large protein-protein interaction (PPI) network which must be considered in the anti-cocaine addiction drug discovery. This work presents DAT, SERT, and NET interactome network-informed machine learning/deep learning (ML/DL) studies of cocaine addiction. We collect and analyze 61 protein targets out 460 proteins in the DAT, SERT, and NET PPI networks that have sufficient existing inhibitor datasets. Utilizing autoencoder and other ML algorithms, we build ML/DL models for these targets with 115,407 inhibitors to predict drug repurposing potentials and possible side effects. We further screen their absorption, distribution, metabolism, and excretion, and toxicity (ADMET) properties to search for nearly optimal leads for anti-cocaine addiction. Our approach sets up a systematic protocol for artificial intelligence (AI)-based anti-cocaine addiction lead discovery.

preprint2022arXiv

Mathematical artificial intelligence design of mutation-proof COVID-19 monoclonal antibodies

Emerging severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) variants have compromised existing vaccines and posed a grand challenge to coronavirus disease 2019 (COVID-19) prevention, control, and global economic recovery. For COVID-19 patients, one of the most effective COVID-19 medications is monoclonal antibody (mAb) therapies. The United States Food and Drug Administration (U.S. FDA) has given the emergency use authorization (EUA) to a few mAbs, including those from Regeneron, Eli Elly, etc. However, they are also undermined by SARS-CoV-2 mutations. It is imperative to develop effective mutation-proof mAbs for treating COVID-19 patients infected by all emerging variants and/or the original SARS-CoV-2. We carry out a deep mutational scanning to present the blueprint of such mAbs using algebraic topology and artificial intelligence (AI). To reduce the risk of clinical trial-related failure, we select five mAbs either with FDA EUA or in clinical trials as our starting point. We demonstrate that topological AI-designed mAbs are effective to variants of concerns and variants of interest designated by the World Health Organization (WHO), as well as the original SARS-CoV-2. Our topological AI methodologies have been validated by tens of thousands of deep mutational data and their predictions have been confirmed by results from tens of experimental laboratories and population-level statistics of genome isolates from hundreds of thousands of patients.

preprint2022arXiv

Omicron BA.2 (B.1.1.529.2): high potential to becoming the next dominating variant

The Omicron variant of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has rapidly replaced the Delta variant as a dominating SARS-CoV-2 variant because of natural selection, which favors the variant with higher infectivity and stronger vaccine breakthrough ability. Omicron has three lineages or subvariants, BA.1 (B.1.1.529.1), BA.2 (B.1.1.529.2), and BA.3 (B.1.1.529.3). Among them, BA.1 is the currently prevailing subvariant. BA.2 shares 32 mutations with BA.1 but has 28 distinct ones. BA.3 shares most of its mutations with BA.1 and BA.2 except for one. BA.2 is found to be able to alarmingly reinfect patients originally infected by Omicron BA.1. An important question is whether BA.2 or BA.3 will become a new dominating "variant of concern". Currently, no experimental data has been reported about BA.2 and BA.3. We construct a novel algebraic topology-based deep learning model trained with tens of thousands of mutational and deep mutational data to systematically evaluate BA.2's and BA.3's infectivity, vaccine breakthrough capability, and antibody resistance. Our comparative analysis of all main variants namely, Alpha, Beta, Gamma, Delta, Lambda, Mu, BA.1, BA.2, and BA.3, unveils that BA.2 is about 1.5 and 4.2 times as contagious as BA.1 and Delta, respectively. It is also 30% and 17-fold more capable than BA.1 and Delta, respectively, to escape current vaccines. Therefore, we project that Omicron BA.2 is on its path to becoming the next dominating variant. We forecast that like Omicron BA.1, BA.2 will also seriously compromise most existing mAbs, except for sotrovimab developed by GlaxoSmithKline.

preprint2022arXiv

Persistent Path Laplacian

Path homology proposed by S.-T.Yau and his co-workers provides a new mathematical model for directed graphs and networks. Persistent path homology (PPH) extends the path homology with filtration to deal with asymmetry structures. However, PPH is constrained to purely topological persistence and cannot track the homotopic shape evolution of data during filtration. To overcome the limitation of PPH, persistent path Laplacian (PPL) is introduced to capture the shape evolution of data. PPL's harmonic spectra fully recover PPH's topological persistence and its non-harmonic spectra reveal the homotopic shape evolution of data during filtration.

preprint2022arXiv

Topological AI forecasting of future dominating viral variants

The understanding of the mechanisms of SARS-CoV-2 evolution and transmission is one of the greatest challenges of our time. By integrating artificial intelligence (AI), viral genomes isolated from patients, tens of thousands of mutational data, biophysics, bioinformatics, and algebraic topology, the SARS-CoV-2 evolution was revealed to be governed by infectivity-based natural selection. Two key mutation sites, L452 and N501 on the viral spike protein receptor-binding domain (RBD), were predicted in summer 2020, long before they occur in prevailing variants Alpha, Beta, Gamma, Delta, Kappa, Theta, Lambda, Mu, and Omicron. Recent studies identified a new mechanism of natural selection: antibody resistance. AI-based forecasting of Omicron's infectivity, vaccine breakthrough, and antibody resistance was later nearly perfectly confirmed by experiments. The replacement of dominant BA.1 by BA.2 in later March was predicted in early February. On May 1, 2022, persistent Laplacian-based AI projected Omicron BA.4 and BA.5 to become the new dominating COVID-19 variants. This prediction became reality in late June. Topological AI models offer accurate prediction of mutational impacts on the efficacy of monoclonal antibodies (mAbs).

preprint2021arXiv

Methodology-centered review of molecular modeling, simulation, and prediction of SARS-CoV-2

The deadly coronavirus disease 2019 (COVID-19) pandemic caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has gone out of control globally. Despite much effort by scientists, medical experts, and society in general, the slow progress on drug discovery and antibody therapeutic development, the unknown possible side effects of the existing vaccines, and the high transmission rate of the SARS-CoV-2, remind us of the sad reality that our current understanding of the transmission, infectivity, and evolution of SARS-CoV-2 is unfortunately very limited. The major limitation is the lack of mechanistic understanding of viral-host cell interactions, the viral regulation, protein-protein interactions, including antibody-antigen binding, protein-drug binding, host immune response, etc. This limitation will likely haunt the scientific community for a long time and have a devastating consequence in combating COVID-19 and other pathogens. Notably, compared to the long-cycle, highly cost, and safety-demanding molecular-level experiments, the theoretical and computational studies are economical, speedy, and easy to perform. There exists a tsunami of the literature on molecular modeling, simulation, and prediction of SARS-CoV-2 that has become impossible to fully be covered in a review. To provide the reader a quick update about the status of molecular modeling, simulation, and prediction of SARS-CoV-2, we present a comprehensive and systematic methodology-centered narrative in the nick of time. Aspects such as molecular modeling, Monte Carlo (MC) methods, structural bioinformatics, machine learning, deep learning, and mathematical approaches are included in this review. This review will be beneficial to researchers who are looking for ways to contribute to SARS-CoV-2 studies and those who are assessing the current status in the field.

preprint2021arXiv

MLIMC: Machine learning-based implicit-solvent Monte Carlo

Monte Carlo (MC) methods are important computational tools for molecular structure optimizations and predictions. When solvent effects are explicitly considered, MC methods become very expensive due to the large degree of freedom associated with the water molecules and mobile ions. Alternatively implicit-solvent MC can largely reduce the computational cost by applying a mean field approximation to solvent effects and meanwhile maintains the atomic detail of the target molecule. The two most popular implicit-solvent models are the Poisson-Boltzmann (PB) model and the Generalized Born (GB) model in a way such that the GB model is an approximation to the PB model but is much faster in simulation time. In this work, we develop a machine learning-based implicit-solvent Monte Carlo (MLIMC) method by combining the advantages of both implicit solvent models in accuracy and efficiency. Specifically, the MLIMC method uses a fast and accurate PB-based machine learning (PBML) scheme to compute the electrostatic solvation free energy at each step. We validate our MLIMC method by using a benzene-water system and a protein-water system. We show that the proposed MLIMC method has great advantages in speed and accuracy for molecular structure optimization and prediction.

preprint2020arXiv

Characterizing SARS-CoV-2 mutations in the United States

The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has been mutating since it was first sequenced in early January 2020. The genetic variants have developed into a few distinct clusters with different properties. Since the United States (US) has the highest number of viral infected patients globally, it is essential to understand the US SARS-CoV-2. Using genotyping, sequence-alignment, time-evolution, $k$-means clustering, protein-folding stability, algebraic topology, and network theory, we reveal that the US SARS-CoV-2 has four substrains and five top US SARS-CoV-2 mutations were first detected in China (2 cases), Singapore (2 cases), and the United Kingdom (1 case). The next three top US SARS-CoV-2 mutations were first detected in the US. These eight top mutations belong to two disconnected groups. The first group consisting of 5 concurrent mutations is prevailing, while the other group with three concurrent mutations gradually fades out. Our analysis suggests that female immune systems are more active than those of males in responding to SARS-CoV-2 infections. We identify that one of the top mutations, 27964C$>$T-(S24L) on ORF8, has an unusually strong gender dependence. Based on the analysis of all mutations on the spike protein, we further uncover that three of four US SASR-CoV-2 substrains become more infectious. Our study calls for effective viral control and containing strategies in the US.

preprint2020arXiv

Decoding asymptomatic COVID-19 infection and transmission

Coronavirus disease 2019 (COVID-19) is a continuously devastating public health and the world economy. One of the major challenges in controlling the COVID-19 outbreak is its asymptomatic infection and transmission, which are elusive and defenseless in most situations. The pathogenicity and virulence of asymptomatic COVID-19 remain mysterious. Based on the genotyping of 20656 Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) genome isolates, we reveal that asymptomatic infection is linked to SARS-CoV-2 11083G>T mutation, i.e., leucine (L) to phenylalanine (F) substitution at the residue 37 (L37F) of nonstructure protein 6 (NSP6). By analyzing the distribution of 11083G>T in various countries, we unveil that 11083G>T may correlate with the hypotoxicity of SARS-CoV-2. Moreover, we show a global decaying tendency of the 11083G>T mutation ratio indicating that 11083G>T hinders SARS-CoV-2 transmission capacity. Sequence alignment found both NSP6 and residue 37 neighborhoods are relatively conservative over a few coronaviral species, indicating their importance in regulating host cell autophagy to undermine innate cellular defense against viral infection. Using machine learning and topological data analysis, we demonstrate that mutation L37F has made NSP6 energetically less stable. The rigidity and flexibility index and several network models suggest that mutation L37F may have compromised the NSP6 function, leading to a relatively weak SARS-CoV subtype. This assessment is a good agreement with our genotyping of SARS-CoV-2 evolution and transmission across various countries and regions over the past few months.

preprint2020arXiv

Decoding SARS-CoV-2 transmission, evolution and ramification on COVID-19 diagnosis, vaccine, and medicine

Tremendous effort has been given to the development of diagnostic tests, preventive vaccines, and therapeutic medicines for coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Much of this development has been based on the reference genome collected on January 5, 2020. Based on the genotyping of 6156 genome samples collected up to April 24, 2020, we report that SARS-CoV-2 has had 4459 alarmingly mutations which can be clustered into five subtypes. We introduce mutation ratio and mutation $h$-index to characterize the protein conservativeness and unveil that SARS-CoV-2 envelope protein, main protease, and endoribonuclease protein are relatively conservative, while SARS-CoV-2 nucleocapsid protein, spike protein, and papain-like protease are relatively non-conservative. In particular, the nucleocapsid protein has more than half its genes changed in the past few months, signaling devastating impacts on the ongoing development of COVID-19 diagnosis, vaccines, and drugs.

preprint2020arXiv

Generative network complex for the automated generation of druglike molecules

Current drug discovery is expensive and time-consuming. It remains a challenging task to create a wide variety of novel compounds with desirable pharmacological properties and cheaply available to low-income people. In this work, we develop a generative network complex (GNC) to generate new drug-like molecules based on the multi-property optimization via the gradient descent in the latent space of an autoencoder. In our GNC, both multiple chemical properties and similarity scores are optimized to generate and predict drug-like molecules with desired chemical properties. To further validate the reliability of the predictions, these molecules are reevaluated and screened by independent 2D fingerprint-based predictors to come up with a few hundreds of new drug candidates. As a demonstration, we apply our GNC to generate a large number of new BACE1 inhibitors, as well as thousands of novel alternative drug candidates for eight existing market drugs, including Ceritinib, Ribociclib, Acalabrutinib, Idelalisib, Dabrafenib, Macimorelin, Enzalutamide, and Panobinostat.

preprint2020arXiv

Host immune response driving SARS-CoV-2 evolution

The transmission and evolution of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) are of paramount importance to the controlling and combating of coronavirus disease 2019 (COVID-19) pandemic. Currently, near 15,000 SARS-CoV-2 single mutations have been recorded, having a great ramification to the development of diagnostics, vaccines, antibody therapies, and drugs. However, little is known about SARS-CoV-2 evolutionary characteristics and general trend. In this work, we present a comprehensive genotyping analysis of existing SARS-CoV-2 mutations. We reveal that host immune response via APOBEC and ADAR gene editing gives rise to near 65\% of recorded mutations. Additionally, we show that children under age five and the elderly may be at high risk from COVID-19 because of their overreacting to the viral infection. Moreover, we uncover that populations of Oceania and Africa react significantly more intensively to SARS-CoV-2 infection than those of Europe and Asia, which may explain why African Americans were shown to be at increased risk of dying from COVID-19, in addition to their high risk of getting sick from COVID-19 caused by systemic health and social inequities. Finally, our study indicates that for two viral genome sequences of the same origin, their evolution order may be determined from the ratio of mutation type C$>$T over T$>$C.

preprint2020arXiv

Mutations on COVID-19 diagnostic targets

Effective, sensitive, and reliable diagnostic reagents are of paramount importance for combating the ongoing coronavirus disease 2019 (COVID-19) pandemic at a time there is no preventive vaccine nor specific drug available for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). It would be an absolute tragedy if currently used diagnostic reagents are undermined in any manner. Based on the genotyping of 7818 SARS-CoV-2 genome samples collected up to May 1, 2020, we reveal that essentially all of the current COVID-19 diagnostic targets have had mutations. We further show that SARS-CoV-2 has the most devastating mutations on the targets of various nucleocapsid (N) gene primers and probes, which have been unfortunately used by countries around the world to diagnose COVID-19. Our findings explain what has seriously gone wrong with a specific diagnostic reagent made in China. To understand whether SARS-CoV-2 genes have mutated unevenly, we have computed the mutation ratio and mutation $h$-index of all SARS-CoV genes, indicating that the N gene is the most non-conservative gene in the SARS-CoV-2 genome. Our findings enable researchers to target the most conservative SARS-CoV-2 genes and proteins for the design and development of COVID-19 diagnostic reagents, preventive vaccines, and therapeutic medicines.

preprint2020arXiv

Mutations strengthened SARS-CoV-2 infectivity

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infectivity is a major concern in coronavirus disease 2019 (COVID-19) prevention and economic reopening. However, rigorous determination of SARS-COV-2 infectivity is essentially impossible owing to its continuous evolution with over 13752 single nucleotide polymorphisms (SNP) variants in six different subtypes. We develop an advanced machine learning algorithm based on the algebraic topology to quantitatively evaluate the binding affinity changes of SARS-CoV-2 spike glycoprotein (S protein) and host angiotensin-converting enzyme 2 (ACE2) receptor following the mutations. Based on mutation-induced binding affinity changes, we reveal that five out of six SARS-CoV-2 subtypes have become either moderately or slightly more infectious, while one subtype has weakened its infectivity. We find that SARS-CoV-2 is slightly more infectious than SARS-CoV according to computed S protein-ACE2 binding affinity changes. Based on a systematic evaluation of all possible 3686 future mutations on the S protein receptor-binding domain (RBD), we show that most likely future mutations will make SARS-CoV-2 more infectious. Combining sequence alignment, probability analysis, and binding affinity calculation, we predict that a few residues on the receptor-binding motif (RBM), i.e., 452, 489, 500, 501, and 505, have very high chances to mutate into significantly more infectious COVID-19 strains.

preprint2020arXiv

Repositioning of 8565 existing drugs for COVID-19

The coronavirus disease 2019 (COVID-19) pandemic caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has infected near 5 million people and led to over 0.3 million deaths. Currently, there is no specific anti-SARS-CoV-2 medication. New drug discovery typically takes more than ten years. Drug repositioning becomes one of the most feasible approaches for combating COVID-19. This work curates the largest available experimental dataset for SARS-CoV-2 or SARS-CoV main protease inhibitors. Based on this dataset, we develop validated machine learning models with relatively low root mean square error to screen 1553 FDA-approved drugs as well as other 7012 investigational or off-market drugs in DrugBank. We found that many existing drugs might be potentially potent to SARS-CoV-2. The druggability of many potent SARS-CoV-2 main protease inhibitors is analyzed. This work offers a foundation for further experimental studies of COVID-19 drug repositioning.

preprint2020arXiv

Review of COVID-19 Antibody Therapies

Under the global health emergency caused by coronavirus disease 2019 (COVID-19), efficient and specific therapies are urgently needed. Compared with traditional small-molecular drugs, antibody therapies are relatively easy to develop and as specific as vaccines in targeting severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), and thus attract much attention in the past few months. This work reviews seven existing antibodies for SARS-CoV-2 spike (S) protein with three-dimensional (3D) structures deposited in the Protein Data Bank. Five antibody structures associated with SARS-CoV are evaluated for their potential in neutralizing SARS-CoV-2. The interactions of these antibodies with the S protein receptor-binding domain (RBD) are compared with those of angiotensin-converting enzyme 2 (ACE2) and RBD complexes. Due to the orders of magnitude in the discrepancies of experimental binding affinities, we introduce topological data analysis (TDA), a variety of network models, and deep learning to analyze the binding strength and therapeutic potential of the aforementioned fourteen antibody-antigen complexes. The current COVID-19 antibody clinical trials, which are not limited to the S protein target, are also reviewed.

preprint2020arXiv

UMAP-assisted $K$-means clustering of large-scale SARS-CoV-2 mutation datasets

Coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has a worldwide devastating effect. The understanding of evolution and transmission of SARS-CoV-2 is of paramount importance for the COVID-19 control, combating, and prevention. Due to the rapid growth of both the number of SARS-CoV-2 genome sequences and the number of unique mutations, the phylogenetic analysis of SARS-CoV-2 genome isolates faces an emergent large-data challenge. We introduce a dimension-reduced $k$-means clustering strategy to tackle this challenge. We examine the performance and effectiveness of three dimension-reduction algorithms: principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP). By using four benchmark datasets, we found that UMAP is the best-suited technique due to its stable, reliable, and efficient performance, its ability to improve clustering accuracy, especially for large Jaccard distanced-based datasets, and its superior clustering visualization. The UMAP-assisted $k$-means clustering enables us to shed light on increasingly large datasets from SARS-CoV-2 genome isolates.

preprint2020arXiv

Unveiling the molecular mechanism of SARS-CoV-2 main protease inhibition from 92 crystal structures

Currently, there is no effective antiviral drugs nor vaccine for coronavirus disease 2019 (COVID-19) caused by acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Due to its high conservativeness and low similarity with human genes, SARS-CoV-2 main protease (M$^{\text{pro}}$) is one of the most favorable drug targets. However, the current understanding of the molecular mechanism of M$^{\text{pro}}$ inhibition is limited by the lack of reliable binding affinity ranking and prediction of existing structures of M$^{\text{pro}}$-inhibitor complexes. This work integrates mathematics and deep learning (MathDL) to provide a reliable ranking of the binding affinities of 92 SARS-CoV-2 M$^{\text{pro}}$ inhibitor structures. We reveal that Gly143 residue in M$^{\text{pro}}$ is the most attractive site to form hydrogen bonds, followed by Cys145, Glu166, and His163. We also identify 45 targeted covalent bonding inhibitors. Validation on the PDBbind v2016 core set benchmark shows the MathDL has achieved the top performance with Pearson's correlation coefficient ($R_p$) being 0.858. Most importantly, MathDL is validated on a carefully curated SARS-CoV-2 inhibitor dataset with the averaged $R_p$ as high as 0.751, which endows the reliability of the present binding affinity prediction. The present binding affinity ranking, interaction analysis, and fragment decomposition offer a foundation for future drug discovery efforts.

preprint2019arXiv

A review of mathematical representations of biomolecules

Recently, machine learning (ML) has established itself in various worldwide benchmarking competitions in computational biology, including Critical Assessment of Structure Prediction (CASP) and Drug Design Data Resource (D3R) Grand Challenges. However, the intricate structural complexity and high ML dimensionality of biomolecular datasets obstruct the efficient application of ML algorithms in the field. In addition to data and algorithm, an efficient ML machinery for biomolecular predictions must include structural representation as an indispensable component. Mathematical representations that simplify the biomolecular structural complexity and reduce ML dimensionality have emerged as a prime winner in D3R Grand Challenges. This review is devoted to the recent advances in developing low-dimensional and scalable mathematical representations of biomolecules in our laboratory. We discuss three classes of mathematical approaches, including algebraic topology, differential geometry, and graph theory. We elucidate how the physical and biological challenges have guided the evolution and development of these mathematical apparatuses for massive and diverse biomolecular data. We focus the performance analysis on the protein-ligand binding predictions in this review although these methods have had tremendous success in many other applications, such as protein classification, virtual screening, and the predictions of solubility, solvation free energy, toxicity, partition coefficient, protein folding stability changes upon mutation, etc.

preprint2019arXiv

Are 2D fingerprints still valuable for drug discovery?

Recently, molecular fingerprints extracted from three-dimensional (3D) structures using advanced mathematics, such as algebraic topology, differential geometry, and graph theory have been paired with efficient machine learning, especially deep learning algorithms to outperform other methods in drug discovery applications and competitions. This raises the question of whether classical 2D fingerprints are still valuable in computer-aided drug discovery. This work considers 23 datasets associated with four typical problems, namely protein-ligand binding, toxicity, solubility and partition coefficient to assess the performance of eight 2D fingerprints. Advanced machine learning algorithms including random forest, gradient boosted decision tree, single-task deep neural network and multitask deep neural network are employed to construct efficient 2D-fingerprint based models. Additionally, appropriate consensus models are built to further enhance the performance of 2D-fingerprintbased methods. It is demonstrated that 2D-fingerprint-based models perform as well as the state-of-the-art 3D structure-based models for the predictions of toxicity, solubility, partition coefficient and protein-ligand binding affinity based on only ligand information. However, 3D structure-based models outperform 2D fingerprint-based methods in complex-based protein-ligand binding affinity predictions.

preprint2019arXiv

Evolutionary de Rham-Hodge method

The de Rham-Hodge theory is a landmark of the 20$^\text{th}$ Century's mathematics and has had a great impact on mathematics, physics, computer science, and engineering. This work introduces an evolutionary de Rham-Hodge method to provide a unified paradigm for the multiscale geometric and topological analysis of evolving manifolds constructed from a filtration, which induces a family of evolutionary de Rham complexes. While the present method can be easily applied to close manifolds, the emphasis is given to more challenging compact manifolds with 2-manifold boundaries, which require appropriate analysis and treatment of boundary conditions on differential forms to maintain proper topological properties. Three sets of unique evolutionary Hodge Laplacian operators are proposed to generate three sets of topology-preserving singular spectra, for which the multiplicities of zero eigenvalues correspond to exactly the persistent Betti numbers of dimensions 0, 1, and 2. Additionally, three sets of non-zero eigenvalues further reveal both topological persistence and geometric progression during the manifold evolution. Extensive numerical experiments are carried out via the discrete exterior calculus to demonstrate the utility and usefulness of the proposed method for data representation and shape analysis.