jClust: A tool for
clustering analysis
|
|
Affinity Propagation algorithms is a machine learning based algorithm which can be used in interdisciplinary fields to cluster data. This algorithm is fast and efficient and takes as input measures of similarity between pairs of data points and simultaneously considers all data points as potential exemplars. Real-valued messages are exchanged between data points until a high-quality set of exemplars and corresponding clusters gradually emerges. The user does not have to define the number of clusters since they are automatically calculated. This algorithm can also be used in biological cases. In the original paper the authors show how Affinity Propagation can efficiently identify segments of DNA that reflect the expression properties of genes.
Complexity: O((N^2)log(N)) where N is the number of nodes
|
|
Spectral clustering is a supervised clustering method where the user needs to define the number of cluster like in k-Means. Spectral methods allow one to study global properties of a dataset by making only pairwise similarity measurements between data points. Spectral clustering considers partitioning a weighted undirected graph G into a set of discrete clusters. Each node in the graph corresponds to a protein sequence and the weight on each edge corresponds to the similarity between the two protein sequences it connects. Ideally, we are looking for areas in the graph, the clusters, in which the nodes are connected with highly-similar edges; and at the same time the connections between such areas should be weak, constituted by edges with low similarity. Spectral algorithms was extensively tested and compared with other local methods on several subsets of the Structural Classification of Proteins database, a gold standard for protein structure classification. We consistently observed that, the number of clusters that we obtain for a given set of proteins is close to the number of superfamilies in that set; there are fewer singletons; and the method correctly groups most remote homologs. Its biological applicability can be explained extensively in the relevant article.
Complexity: The complexity of this algorithm is not defined yet but it is definitely slower that the rest algorithms described here.
PARAMETERS - Spectral
-------------------------------------------------------------------------------------------------------------
User needs to provide a parameter which shows the number of clusters
|
|
MCL or otherwise Markov Clustering is a very fast scalable and reliable unsupervised algorithm that is widely used in biology. It accepts as an input a list of pairwise weighted interactions. It is unsupervised since the number of clusters can be automatically defined. MCL is widely used to cluster protein sequences and predict protein families. Another example is to be used to predict protein complexes from protein-protein interaction data. BioLayout visualization tools has implemented MCL clustering to cluster data visually in 3D. Following the same approach we would like to offer to the user the opportunity to use MCL and combine it with Medusa visualization tool. The MCL algorithm is a fast and scalable unsupervised clustering algorithm based on simulation of stochastic flow in graphs. The MCL algorithm can detect cluster structures in graphs by a mathematical bootstrapping procedure. The process deterministically computes the probabilities of random walks through a graph, and uses two operators transforming one set of probabilities into another. It does so by using the language of stochastic matrices (also called Markov matrices) which capture the mathematical concept of random walks on a graph.
Complexity: O(nk^2) where n is the number of the nodes, k the number of centers
MCL PARAMETER
--------------------------------------------------------------------------------------------------------
MarkovClustering implements the Markov clustering (MCL) algorithm for graphs, using a HashMap-based sparse representation of a Markov matrix, i.e., an adjacency matrix m that is normalised to one. Elements in a column / node can be interpreted as decision probabilities of a random walker being at that node. Note: whereas we explain the algorithms with columns, the actual implementation works with rows because the used sparse matrix has row-major format.
The basic idea underlying the MCL algorithm is that dense regions in sparse graphs correspond with regions in which the number of k-length paths is relatively large, for small k in N, which corresponds to multiplying probabilities in the matrix appropriately. Random walks of length k have higher probability (product) for paths with beginning and ending in the same dense region than for other paths.
The algorithm starts by creating a Markov matrix from the graph, where first, in the adjacency matrix, diagonal elements are added to include self-loops for all nodes, i.e., probabilities that the random walker stays at a particular node. After this initialisation, the algorithm works by alternating two operations, expansion and inflation, iteratively recomputing the set of transition probabilities. The expansion step corresponds to matrix multiplication (on stochastic matrices), the inflation step corresponds with a parametrized inflation operator Gamma_r, which acts column-wise on (column) stochastic matrices (here, we use row-wise operation, which is analogous).
The inflation operator transforms a stochastic matrix into another one by raising each element to a positive power p and re-normalising columns to keep the matrix stochastic. The effect is that larger probabilities in each column are emphasized and smaller ones deemphasized. On the other side, the matrix multiplication in the expansion step creates new non-zero elements, i.e., edges. The algorithm converges very fast, and the result is an idempotent Markov matrix, M = M * M, which represents a hard clustering of the graph into components.
Expansion and inflation have two opposing effects: While expansion flattens the stochastic distributions in the columns and thus causes paths of a random walker to become more evenly spread, inflation contracts them to favoured paths.
Description is based on the introduction of Stijn van Dongen's thesis Graph Clustering by Flow Simulation (2000); for a mathematical treatment of the algorithm and the associated MCL process, see there.
@author gregor :: arbylon . net
|
|
The RNSC algorithm searches for a low cost clustering by composing first an initial random clustering, then iteratively moving one node from one cluster to another in a randomized fashion to improve the clustering cost. In order to avoid local minima, RNSC makes diversification moves and performs multiple experiments. Furthermore, it maintains a tabu list that prevents cycling back to a previously explored partitioning. Due to the randomness of the algorithm, different runs on the same input data produce different outputs. We showed a biological potential of how this algorithm could be used to detect protein complexes from protein-protein interaction data. Publication: Charalampos N. Moschopoulos, Georgios A. Pavlopoulos, Reinhard Schneider, Spiridon D. Likothanassis and Sophia Kossida, GIBA: A clustering tool for detecting protein complexes, BMC Bioinformatics 2009 (in press).
Complexity: O(nmμ + μ^2) where n is the number of the nodes, m the number of edges and μ the number of cliques
PARAMETERS - RNSC
------------------------------------------------------------------------------------------------------------
-Allow no more than N clusters
Allow no more than "num" clusters. "num" must be between 2 and n,
where n is the number of vertices in the graph. If the -c option is
not specified or an invalid value is given, n clusters are used.
-Tabu length
Set the tabu length to "num". Default value is 1. Note that when the
-T option is used, vertices can appear on the tabu list more than once
and moving them is only forbidden when they are on the tabu list more
than TabuTol times, where TabuTol is the tabu list tolerance.
-Tabu list tolerance
Set the tabu list tolerance to "num". Default value is 1. The tabu
list tolerance is the number of times a vertex must appear on the tabu
list before moving it is forbidden.
-n num
Set the naive stopping tolerance to "num". Default value is 5. This
is the number of steps that the naive scheme will continue without
improving the best cost. If you run the scaled scheme, using a higher
naive stopping tolerance, it is not likely to improve your results.
-N num
Set the scaled stopping tolerance to "num". Default value is 5. This
is the number of steps that the scaled scheme will continue without
improving the best cost. Setting the tolerance to 0 will cause the
algorithm to skip the scaled scheme.
-Number_of_experiments
Run "num" experiments. The best final clustering over all experiments
will be written to file. Default is 1.
|
|
k-Means (MacQueen, 1967) is an algorithm to cluster data. It is necessary for the user to define the number of clusters (k) manually. It is not often used in biological cases but it is a very well known and widely used algorithms that can be used in interdisciplinary fields. k-Means accepts as an input a full all-against-all distance matrix and has a running time complexity of O(nlk) were n is the number of nodes, l is the number of loops and k the number of clusters.The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori. The main idea is to define k centroids, one for each cluster. These centroids shoud be placed in a cunning way because of different location causes different result. So, the better choice is to place them as much as possible far away from each other. The next step is to take each point belonging to a given data set and associate it to the nearest centroid. When no point is pending, the first step is completed and an early groupage is done. At this point we need to re-calculate k new centroids as barycenters of the clusters resulting from the previous step. After we have these k new centroids, a new binding has to be done between the same data set points and the nearest new centroid. A loop has been generated. As a result of this loop we may notice that the k centroids change their location step by step until no more changes are done. In other words centroids do not move any more. It is a suitable algorithms when all of the relationships among all of the data points are known.
Complexity: O(nlk) where n is the number of nodes, k is the number of user predefined clusters and l the number of loops where the algorithm diverges.
PARAMETERS - k-Means
-------------------------------------------------------------------------------------------------------------
User needs to provide k parameter which shows the number of clusters
|
|
Bron-Kerbosch algorithm is supported |
Unlike clustering methods, Bron-Kerbosch is an algorithm for finding cliques in a graph. It accepts as an input a list of pairwise connections and it is suitable for undirected unweighted graphs. In graph theory, a clique in an undirected graph G is a set of vertices V such that for every two vertices in V, there exists an edge connecting the two. Alternatively, a clique is a graph in which every vertex is connected to every other vertex in the graph. This is equivalent to saying that the subgraph induced by V is a complete graph. The size of a clique is the number of vertices it contains. In other words this algorithm can detect strongly connected areas of the graphs where each subset called clique consists of elements that they are all-against-all interconnected between each other. The main advantage of this algorithms is that it allows one node to belong in more that one clusters since one node can be a member of many different cliques. Potential use of this algorithm in biology is to try to detect strongly connected components of heterogenous data like i.e protein chemical interaction data coming from Stitch database.
Complexity:
O(3^(n/3)) where n is the number of nodes
|
|
MULIC is a layered clustering algorithm for interaction networks, which groups proteins by the similarity of their direct neighborhoods. MULIC algorithm can identify locally significant proteins, called mediators, which link different clusters. MULIC is available online and it accepts as an input a list of pairwise connections. It does nor require a fully connected graph and its main purpose is to cluster together proteins with ‘similar’ interaction partners. Similarity implies many interactions with the same proteins and few interactions with different proteins. One of the advantages of MULIC is that clusters are layered and objects clustered at top layers have more similar neighborhoods.
Complexity: The worst case complexity of MULIC is O(N*N), where N is the number of objects.