Discover the detailed methods utilized by TopLapGBT for predicting protein solubility changes, including spectral graph theory, persistent Laplacian, and transformer features. Gain insights into the computational framework driving this groundbreaking model in molecular biology research.
People Mentioned
Authors:
(1) JunJie Wee, Department of Mathematics, Michigan State University;
(2) Jiahui Chen, Department of Mathematical Sciences, University of Arkansas;
(3) Kelin Xia, Division of Mathematical Sciences, School of Physical and Mathematical Sciences Nanyang Technological University & [email protected];
(4)Guo-Wei Wei, 1Department of Mathematics, Michigan State University, Department of Biochemistry and Molecular Biology, Michigan State University, Department of Electrical and Computer Engineering, Michigan State University & [email protected].
In this section, we endeavor to elucidate key mathematical and computational foundations that are instrumental for the work presented in this study. Specifically, we delve into spectral graph theory, simplicial complex, and persistent Laplacian methods, highlighting their significance in capturing topological and spectral properties essential for the characterization of proteins. Additionally, we discuss machine learning and deep learning paradigms, focusing on their role in processing, analyzing, and interpreting these complex features, especially within the confines of test datasets and validation settings.
5.1 Persistent Laplacian characterization of proteins
Simplicial complex A simplicial complex is made up of a set of simplices and generalises beyond graph networks at higher dimensions [44, 45, 46, 47]. Every simplex is a finite set of vertices which can be interpreted as the atoms in a protein structure. Essentially, simplices can be a point (0-simplex), an edge (1-simplex), a triangle (2-simplex), a tetrahedron (3-simplex), or in higher dimensions, a p-simplex. In other words, a k-simplex σ k = {v0, v1, · · · , vk} is the convex hull formed by k + 1 affinely independent points v0, v1, · · · , vk as follows,
A geometric simplicial complex K is a finite set of geometric simplexes that satisfy two essential conditions. First, any face of a simplex from K is also in K. Second, the intersection of any two simplexes in K is either empty or shares faces. Commonly used methods to construct simplicial complexes are Cech complex, Vietoris-Rips complex, Alpha complex, Clique ˇ complex, Cubic complex, and Morse complex [44, 45, 46, 47].
In the case k = 0, then L0 = B1B⊤ 1 since ∂0 is a zero map
Persistent Laplacian Persistent Laplacian (PL) were first introduced by integrating graph Laplacian and multiscale filtration [25]. Analyzing the spectra of k-combinatorial Laplacian matrix allows both topological and geometric information (i.e. connectivity and robustness of simple graphs) to be obtained. However, this method is genuinely free of metrics or coordinates, which induced too little topological and geometric information that can be used to describe a single configuration.
Therefore, PL was extended to simplicial complexes. This allows a sequence of simplicial complexes from a filtration process to generate persistent Laplacian which is largely inspired by persistent homology and in earlier works in multiscale graphs. For the rest of this section, we introduce mainly on the construction of PL. First, a k-combinatorial Laplacian matrix is symmetric and positive semi-definite. Therefore, its eigenvalues are all real and non-negative. The multiplicity of zero spectra (also called harmonic spectra) reveals the topological information, and the geometric information will be preserved in the non-harmonic spectra.
This nested sequence of simplicial complexes induces a family of chain complexes
In order to illustrate the difference between PL and PH, Figure 5 describes a point cloud, basic simplices, a filtration process and the comparison between persistent Laplacian and persistent homology barcodes of 13 points. The filtration process in Figure 5(c) shows the different stages of a Rips filtration process for the 13 points. Figure 5(d) shows the persistent homology barcodes (in blue) and persistent non-harmonic spectra (in red). It can be seen that the nonharmonic spectra provides the additional homotopic shape evolution that is missing in persistent homology in the later part of the filtration process.
5.2 Persistent Laplacian descriptors
where DE(·, ·) is the Euclidean distance between the two atoms and Loc(·) refers to the atom’s location which is either in the mutation site or in the rest of the protein. Here, we construct two types of simplicial complexes in our PL computation, such as Vietoris-Rips complex (VC) and Alpha complex (AC). Both complexes are used to characterize the first order interactions and higher order patterns respectively. To capture and characterize different types of atomatom interactions, we generate the PL based on different atom subsets by selecting one type of atom in the mutation site and one other atom type in the rest of the protein. Different types of atom-atom interactions characterize the different interactions in proteins. For example, interactions generated from carbon atoms are associated with hydrophobic interactions. Similarly, interactions between nitrogen and oxygen atoms correlate to hydrophilic interactions and/or hydrogen bonds. Both types of interactions are illustrated in Figure 6. Interactive PLs have the capability to unveil additional details about bonding interactions and offer a fresh and distinct representation of molecular interactions in proteins.
For zero dimensions, we consider both the harmonic spectra and non-harmonic spectra information for each persistent Laplacian. Filtration using Rips complex with DI distance is used. The 0-dimensional PL features are generated from 0A to 6 ˚ A with 0.5 ˚ A gridsize. For the ˚ non-harmonic spectra information, we count the number of non-harmonic spectra and calculate seven statistical values of non-harmonic spectra such as sum, minimum, maximum, mean, standard deviation, variance and the sum of eigenvalues squared. This generates eight statistical values for each of the nine atomic pairs. Therefore, the dimension of 0-dimensional PL features for a protein is 72. In total, the 0-dimensional PL-based feature size after concatenating features at different dimensions for wild type and mutant is 1872.
For one or two dimensions, we perform the filtration using Alpha complex with the DE distance. The limited number of atoms in the local protein structure can create only a few high-dimensional simplexes, resulting in minimal alterations in shape. As a result, it suffice to consider features from only harmonic spectra of persistent Laplacians by coding the topological invariants for the high-dimensional interactions. Using GUDHI[51], the persistence of the harmonic spectra can be represented by persistent barcodes. The topological feature vectors are generated by computing the statistics of bar lengths, births and deaths. Bars shorter than 0.1A are excluded as they do not exhibit any clear physical meaning. The remaining bars are ˚ then used for computing the statistics: (1) sum, maximum and mean for lengths of bars; (2) minimum and maximum for the birth values of bars; (3) minimum and maximum for the death values of bars. Each set of point clouds leads to a seven-dimensional vector. These features are calculated on nine single atomic pairs and one heavy atom pair. The dimension of oneand two-dimensional PL feature vectors for a protein is 140. In total, the higher-dimensional PL-based feature size after concatenating features at different dimensions for wild type, mutant and their difference is 420.
5.3 Persistent Homology
Persistent homology is part of the harmonic spectra of PL. The homology groups in PH illustrate the persistence of topological invariants, hence providing the harmonic spectral information in PL. The site- and element-specific PH features are generated in a similar way as compared to PL. Similar to PL, filtration construction is also employed to PH. For the zero dimension, the filtration parameter can be discretized into several equally spaced bins, namely [0, 0.5], (0.5, 1], · · · , (5.5, 6]A. The death value of the bars are summed in each bin resulting in 12 ˚ ×18 features.
For each bin, we count the numbers of persistent bars, resulting in a nine-dimensional vector for each point cloud. Similarly, this is performed for each of the nine single atomic pairs. Hence, the dimension of PH features for a protein is 216. For one or two dimensions, the identical featurization from the statistics of persistent bars in PH is used. The PH embedding combines features at different dimensions as described above and concatenated for wild type, mutant and their difference, resulting in a 648-dimensional vector
5.4 Transformer Features
Recently, we have seen significant advancements in modelling protein properties using largescale protein transformer models trained on hundreds of millions of sequences. These models, like ESM [33] (evolutionary scale modeling) and ProtTrans[34, 35], have demonstrated impressive performance. Moreover, hybrid fine-tuning approaches that leverage both local and global evolutionary data have proven to enhance these models even further. For instance, eUniRep is an improved LSTM-based UniRep model achieved through fine-tuning with knowledge extracted from local multiple sequence alignments (MSAs). Additionally, the ESM model can be fine-tuned using either downstream task data or local MSAs. In our research, we employed the ESM-1b transformer, a model that falls under the transformer architecture. This particular variant was trained on a dataset of 250 million sequences using a masked filling procedure and boasts an architecture comprising 34 layers with a whopping 650 million parameters. The ESM transformer’s primary role in our work was to generate sequence embeddings. At each layer of the ESM model, it encoded a sequence of length L into a matrix sized at 1,280×L, excluding the start and terminal tokens. For our study, we utilized the sequence representation derived from the final (34th) layer and computed the average along the sequence length axis, resulting in a 1,280-component vector.
5.5 Performance Metrics
PPV and NPV assesses the true positive and true negative proportion of the predicted results for each solubility class. PPV and NPV are computed based on TP, TN, FP and FN which represents the true positive, true negative, false positive and false negative values for each solubility class. For each solubility class, PPV and NPV can be computed by: