Simultaneous clustering analysis with molecular docking in network pharmacology for type 2 an diabe c compounds

The database of drug compounds and human proteins plays a very important role in iden fying the protein target and the compound in drug discovery. Recently, a network pharmacology approach was established by upda ng the research paradigm from the current “one disease-one target-one drug” to a new “drug-target-disease network”. Ligand-protein interac ons can be analyzed quan ta vely using simultaneous clustering and molecular docking. The docking method offers the ability to quickly and cheaply predict the ligand-protein binding free energy (∆G) in structure-based virtual screening. Meanwhile, simultaneous clustering was used to find subgroups of compounds that exhibit a high correla on with subgroups of target proteins. This study is focused on the interac on between the 306 compounds from medicinal plants (brotowali Tinospora crispa, ginger Zingiber officinale, pare Momordica charan a, sembung Blumea balsamifera), synthe c drugs (FDAapproved) and the 21 significant human proteins associated with type 2 diabetes. We found that brotowali (B018), sembung (S031), pare (P231), and ginger (J036, J033) were close to the synthe c drugs and can possibly be developed as an diabe c drug candidates. Likewise, the proteins AKT1, WFS1, APOE, EP300, PTH, GCG, and UBC which assemble each other and which have a high associa on with INS can be seen as target proteins that play a role in type 2 diabetes.

Type 2 diabetes is a chronic disease characterized by the body being unable to effectively produce insulin.World Health Organization (2016) reported that the number of people with diabetes has risen from 108 million in 1980 to 422 million in 2014.Diabetes is one of the leading causes of death in the world and it has been rising more rapidly in low-middle income countries such as Indonesia.Indonesia was ranked one of the world's top ten countries for about 10 million adults with diabetes in 2015 (International Diabetes Federation 2015) and was estimated to reach 21.3 million by 2030 (World Health Organization 2016).This suggests that genetic risk factors, obesity, unhealthy diet, and physical inactivity have also increased.This case of diabetes now occurs not only in adults but also in children and adolescents.Thus, diabetes problems need to be handled seriously.
From a metabolic perspective, a disease occurs because the protein function is impaired.In order to make the protein function normally, it needs to be treated with drugs that contain several chemical compounds that can influence or inhibit the activity of proteins or target networks.In recent years, a network pharmacology approach was established by updating the research paradigm from the current "one disease-one target-one drug" to a new "drug-target-disease networks" (Yang et al. 2013).Therefore, determination of the compounds that target the proteins associated with a certain disease is crucial to understanding the molecular mechanisms in drug design.
There are several methods to investigate drug candidate at the molecular level.One of the most common uses in computational technique (in silico) is molecular docking.Molecular docking offers the ability to predict quickly and cheaply the ligand-protein interaction in structurebased virtual screening which results in the binding free energy (∆G) scores.The binding free energy (∆G) scores show the bond strength between molecules or the magnitude of ligand-protein interaction.Lalitha and Sripathi (2011) have shown that Xylitol can be used for antidiabetic using molecular docking.However, the compounds isolated from medicinal plants are safer and more appropriately used for metabolic and degenerative diseases (Greer et al. 1994).Thus, in this study we use the ligand not only from synthetic drugs but also from medicinal plants associated with type 2 diabetes.
The ligand-protein interaction can be analyzed quantitatively by clustering method in order to find subgroups Rifai et al. Indonesian Journal of Biotechnology 22(1), 2017, 43-48 of compounds and subgroups of target proteins that exhibit a high correlation in two-way data analysis.Most of the standard clustering literature focuses on one-sided clustering algorithms, but in bioinformatics, it allows to deal with sparse and high dimensional data matrices.Madeira and Oliveira (2004) have been proposed a survey of simultaneous clustering algorithms on biological data.Simultaneous clustering is a method to avoid some limitations of standard clustering approach.Simultaneous clustering is more robust and more informative than standard clustering, as it involves rows and columns together.Thus, this study is focused on the interaction between the compounds of medicinal plants, synthetic drugs (FDA-approved) and the human proteins associated with type 2 diabetes by using simultaneous clustering with molecular docking approach.

Molecular docking
Molecular docking is a method that predict the most favorable orientation of a ligand when interacting with a protein to form a stable complex (Nogara et al. 2015).Ligand-protein docking is an important type molecular docking in modern structure-based drug design (Huang and Zou 2010).There are two essential components in ligand-protein docking method, namely the search algorithm and the scoring function.The search algorithm is responsible for searching through different ligand conformations and orientations (poses) within a given target protein.The scoring function is responsible for estimating the binding affinities of the generated poses, ranking them, and identifying the most favorable binding modes of the ligand to the given target.Before beginning the docking, drug-likeness for the ligand was analyzed by Lipinski's Rule parameters including hydrogen bond donors ≤ 5, hydrogen bond acceptors ≤ 10, molecular weight ≤ 500 g/mol, and partition coefficient logP ≤ 5 (Lipinski et al. 2001).Molecules violating more than one of these rules may delete because they have problems with bioavailability.Optimization of ligand geometry was conducted by using wash to improve the structure of the ligand and the position of hydrogen atom.Minimization of ligand energy was conducted by using a modified Merck Molecular Forcefield 94 (MMFF94) and gradient root mean square (RMS) 0.001 kkal/Åmol.Whereas, optimization of protein geometry was conducted by adding polar hydrogens, protonation, and partial charges.Minimization of protein energy was conducted using Merck Molecular Forcefield 94x (MMFF94x) and solvation in gas phase with fixed charges, then minimize with gradient root mean square (RMS) 0.05 kkal/Åmol.Ligand-protein docking process used placement Triangle Matcher with retain 5.Then, scoring function used London dG, refinement is Forcefield with retain 1.Molecular docking produces a binding free energy (∆G).Thermodynamically, the ligand-protein interaction occurs when it produces ∆G < 0. The smaller values of ∆G, the ligand-protein bonds are more stable or ligand-protein interactions are stronger.

Simultaneous clustering
Simultaneous clustering performs clustering in the two dimensions simultaneously (Charrad and Ahmed 2011).Simultaneous clustering, usually called by bi-clustering, co-clustering, two-way clustering, or block clustering, is an important technique to find sub-matrices.The submatrices are subgroups of rows and subgroups of columns that exhibit a high correlation in two-way data analysis.In this study, singular value decomposition (SVD) approach was applied for simultaneous clustering analysis.If X is an n × p matrix with n observations in a row and p variables in column then the SVD is (Jolliffe 2002) where U and A are n × r and p × r orthonormal column matrices (U T U = A T A = I r ), L is r × r diagonal matrix, and r is the rank of X.
The SVD in the dimension reduction is relevant to PCA in several respects.The columns of the matrices U and A are eigenvectors of the matrices XX T and X T X, respectively, and the decreasing non-negative entries r in the diagonal matrix L are square roots of the non-zero eigenvalues of XX T and also of X T X.We denote the ith columns of the matrices U and A by u i and a i , respectively.The vectors u i and a i are called the left and right singular vectors of X, and the values l i are called the singular values.
The SVD in clustering is a generalization of the algorithm shows a transformation that permit us to get two matrices from one matrix.Now define L α , for 0 ≥ α ≥ 1, as the diagonal matrix whose elements are l α/2 1 , l α/2 2 , . . ., l α/2 r with a similar definition for L 1−α , and let We used α = 1 2 so G = UL 0.5 and H T = L 0.5 A T .G and H matrices represent the information of compounds and proteins, respectively.Then, these matrices were analyzed by hierarchical clustering and plotted in heat map with two-dimensional dendrograms simultaneously.G matrix produced a column-side dendrogram and H matrix produced a row-side dendrogram, or reversed.A heat map is a literal way of visualizing the binding free energy (∆G) scores with colored cells.We used the Euclidean distance and linkage method both for clustering of compounds and proteins to built a two-dimensional dendrograms.
Euclidean distance: the usual square distance between two vectors, is given by: (3) Single linkage: the distance D ij between two clusters C i and C j is the minimum distance between two points x and y, with x ∈ C i and y ∈ C j , is given by: Complete linkage: the distance D ij between two clusters C i and C j is the maximum distance between two points x and y, with x ∈ C i and y ∈ C j , is given by Average linkage: the distance D ij between two clusters C i and C j is the mean of the distance between the pair of points x and y, where x ∈ C i and y ∈ C j , is given by: Ward: the total within-cluster sum of square (SSE) is computed to determine the next two groups merged at each step of algorithm, where x j is multivariate measurement associated with the jth object and x is the mean of all the object, is given by: Furthermore, we compared the performance of linkage method using heat maps against a number of reasonable benchmarks.

Results and discussion
The results of drug likeness test showed that out of 306 compounds, 199 compounds (185 from medicinal plants and 14 from synthetic drugs) satisfied the Lipinski's Rule properties.The boxplot of ∆G scores from medicinal plant and synthetic drug compounds in Figure 1 are not greatly different.Although the mean of ligand-protein interaction from medicinal plants is lower than synthetic drugs, some of medicinal plant compounds produce the lowest ∆G scores < -15 kJ/mol.The most stable binding complex or strongest interaction for medicinal plants is -16.97 kJ/mol (J156 and INS) whereas for synthetic drugs is -14.56 kJ/mol (DB11 and INS).Therefore, some medicinal plants have a better stability than synthetic drugs and it can possible be developed as antidiabetic drug candidates.
The relative position of the compounds based on the target proteins can be described by plot PCs (Figure 2).The plot of first two PCs in Figure 2 shows that compounds tend to gather in quadrant III and outside quadrant III.The synthetic drug compounds (red color) spread in quadrants I, II, and III.In quadrant I, there is one synthetic drug compound that is close to pare (purple color) and ginger (black color).In quadrant II, there are 4 synthetic drug compounds that tend to gather with ginger and sembung (green color).In quadrant III, there are many synthetic drugs that are close to each other with some medicinal plant compounds of brotowali (blue color), ginger, pare, and sembung.In quadrant IV, there is no synthetic drug compounds.The first two PCs explain 87.29% of the total variance.This plot is an exploration of the compound grouping based on PCs of target proteins.Furthermore, the cluster analysis was performed using simultaneous clustering.
The index values of cluster validity show that the best number of clusters for the compounds is two clusters (Table 1) and for the proteins is two clusters (Table 2).The hierarchical method that produce the most two clusters for the compounds is complete linkage method, whereas for the proteins are complete linkage and Ward method.Therefore, the best hierarchical method used for the simultaneous clustering is complete linkage method.
The two-dimensional dendrogram generated by complete linkage method in Figure 3 is trimmed into two clusters both in row (protein) and column (compound).The strong ligand-protein interactions (light yellow colors) are found horizontally on the right-side (quadrant I and IV) and vertically on the up-side (quadrant I and II).In addition, 13 synthetic drug compounds are more commonly Indonesian Journal of Biotechnology 22(1), 2017, 43-48   those compounds especially pare (P231) that target the protein insulin (INS) can possible be developed as type 2 antidiabetic drug candidates.Besides, the protein AKT1, WFS1, APOE, EP300, PTH, GCG, dan UBC are found as target proteins that play a role in type 2 diabetes.

TABLE 1
Cluster validity of ligand.
a Index values that show the best number of clusters.

TABLE 2
Cluster validity of protein.
a Index values that show the best number of clusters.