manta package¶
Submodules¶
manta.cluster module¶
The clustering algorithm works in several steps to generate cluster assignments.
- Generate a scoring matrix using a network flow strategy
- Cluster on the scoring matrix.
- In case the network displays memory effects, define weak nodes.
The scoring matrix is first clustered with the AgglomerativeClustering algorithm. Because the network flow strategy can result in central values being separated from the clusters, agglomerative clustering is repeated on score matrices with removed high-scoring nodes until larger clusters are identified. After this step, clustering assignments are set. However, if the network caused the scoring matrix to enter a flip-flop state, nodes can still be defined as ‘weak’. In this case, manta uses a shortest path strategy to assess whether nodes belong to a cluster or are in conflict with the cluster oscillator.
-
manta.cluster.
cluster_graph
(graph, limit, max_clusters, min_clusters, min_cluster_size, iterations, subset, ratio, edgescale, permutations, verbose)[source]¶ Takes a networkx graph and carries out network clustering. The returned graph contains cluster assignments and weak assignments. If weight is available, this is considered during the diffusion process.
Parameters: - graph – Weighted, undirected networkx graph.
- limit – Percentage in error decrease until matrix is considered converged.
- max_clusters – Maximum number of clusters to evaluate in K-means clustering.
- min_clusters – Minimum number of clusters to evaluate in K-means clustering.
- min_cluster_size – Minimum cluster size as fraction of network size
- iterations – If algorithm does not converge, it stops here.
- subset – Fraction of edges used in subsetting procedure
- ratio – Ratio of scores that need to be positive or negative for a stable edge
- edgescale – Mean edge weight for node removal
- permutations – Number of permutations for partial iterations
- verbose – Verbosity level of function
Returns: NetworkX graph, score matrix and diffusion matrix.
-
manta.cluster.
cluster_hard
(graph, adj_index, rev_index, scoremat, max_clusters, min_clusters, min_cluster_size, verbose)[source]¶ Agglomerative clustering is used to separate nodes based on the scoring matrix. Because the scoring matrix generally results in separation of ‘central’ nodes, these nodes are removed and clustering is repeated until larger clusters are identified. Afterwards, the nodes are assigned to clusters based on a shortest path strategy.
Parameters: - graph – NetworkX weighted, undirected graph
- adj_index – Index matching matrix index to node ID
- rev_index – Index matching node ID to matrix index
- scoremat – Converged diffusion matrix
- max_clusters – Maximum cluster number
- min_clusters – Minimum cluster number
- min_cluster_size – Minimum cluster size as fraction of network size
- verbose – Verbosity level of function
Returns: Dictionary of nodes with cluster assignments
-
manta.cluster.
cluster_weak
(graph, diffs, cluster, edgescale, adj_index, rev_index, verbose)[source]¶ Although clusters can be assigned with cluster_hard, cluster_weak tests whether cluster assignments are in conflict with oscillator nodes present in clusters.
Oscillators can only be defined from flip-flopping states; hence, this function should not be called for scoring matrices that converge to -1 and 1.
Parameters: - graph – NetworkX weighted, undirected graph
- diffs – Diffusion matrices from flip-flop states generated by diffusion
- cluster – Cluster assignment
- edgescale – Mean edge weight for node removal
- adj_index – Node index
- rev_index – Reverse node index
- verbose – Verbosity level of function
Returns: Vector with cluster assignments
-
manta.cluster.
sparsity_score
(graph, clusters, rev_index)[source]¶ Given a graph, and a list of cluster identities, this function calculates how many edges need to be cut to assign these cluster identities. Cuts through positively-weighted edges are penalized, whereas cuts through negatively-weighted edges are rewarded. The lowest sparsity score represents the best clustering. Because this sparsity score rewards cluster assignments that separate out negative hubs. This penalty ensures that nodes with some intra-cluster negative edges are still assigned to clusters where they predominantly co-occur with other cluster members.
Parameters: - graph – NetworkX weighted, undirected graph
- clusters – List of cluster identities
- rev_index – Index matching node ID to matrix index
Returns: Sparsity score
manta.cyjson module¶
Although NetworkX already contains read/write utilities for Cyjson networks, these do not accomodate networks without an index (e.g. those generated with CoNet). Moreover, they do not support inclusion of a network layout. The functions below have been adapted to tackle these issues. They are derived from the original NetworkX cytoscape_graph functions.
NetworkX is distributed with the 3-clause BSD license.
Copyright (C) 2004-2018, NetworkX Developers Aric Hagberg <hagberg@lanl.gov> Dan Schult <dschult@colgate.edu> Pieter Swart <swart@lanl.gov> All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of the NetworkX Developers nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-
manta.cyjson.
read_cyjson
(filename, direction=False)[source]¶ Small utility function for reading Cytoscape json files generated with CoNet. Parameters ———- :param filename: Filepath to location of cyjs file. :param direction: If true, graph is imported as a NetworkX DiGraph :return: NetworkX graph.
-
manta.cyjson.
write_cyjson
(filename, graph, layout=None)[source]¶ Small utility function for writing Cytoscape json files. Also accepts a layout dictionary to add to the file. Parameters ———- :param filename: Filepath to location of cyjs file. :param graph: NetworkX graph to write to disk. :param layout: Dictionary of layout coordinates that is written to cyjs file. :return:
manta.flow module¶
Cluster assignment by manta makes use of a scoring matrix. Functions in this module can generate such a scoring matrix.
First, an empty adjacency matrix is instantiated. Then, the following steps are carried out until an error threshold is reached:
- Raise the matrix to power 2
- Normalize matrix by absolute maximum value
- Add 1 / value for each value in matrix
- Normalize matrix by absolute maximum value
- Calculate error by subtracting previous matrix from current matrix
If a network consists of clusters connected by positive edges, it will frequently fail to reach the error threshold. This can happen because of two reasons: firstly, the scoring matrix enters a flip-flop state. Secondly, the matrix converges to zero. Flip-flop states are detected by comparing the error to the error 2 iterations ago. If this error is 99% identical, manta continues clustering with a partial iteration strategy. Zero convergence is detected when the threshold of the 99th percentile is below 0.00000001. In either of these cases, manta proceeds with a partial iteration strategy. This strategy repeats the same steps as the normal network flow strategy, but does so on a subset of the network. The scoring matrix is then recovered from stable edges; stable edges are defined as edges that have only positive or negative values during a minimum fraction of permutations. Those permutations are summed together to give positions in the scoring matrix.
-
manta.flow.
diffusion
(graph, iterations, limit, verbose, norm=True, inflation=True)[source]¶ Diffusion process for generation of scoring matrix. The implementation of this process is similar to the MCL algorithm. However, the rescaling step in the MCL algorithm has been adjusted to accomodate for negative signs. In the first step, the matrix is scaled; afterwards, every value in the matrix has 1 divided by the value added. Subsequently, the matrix is scaled. The cumulative error, relative to the previous iteration, is calculated by taking the mean of the difference.
If a memory effect is detected, not the outcome matrix but the first iteration is returned. Additionally, 5 iterations after the flip-flop state has been reached are calculated.
Normally, memory effects should not be detected because these only happen if the graph is unbalanced. The Harary algorithm is used to check whether graphs are unbalanced. If they are, this diffusion process should not be run.
Parameters: - graph – NetworkX graph of a microbial assocation network
- iterations – Maximum number of iterations to carry out
- limit – Percentage in error decrease until matrix is considered converged
- verbose – Verbosity level of function
- norm – Normalize values so they converge to -1 or 1
- inflation – Carry out network diffusion with/without inflation
Returns: score matrix, memory effect
-
manta.flow.
harary_balance
(graph)[source]¶ Checks whether a graph is balanced according to Harary’s theorem. Python implementation of the algorithm as described in the article below. Harary’s algorithm quickly finds whether a signed graph is balanced. A signed graph is balanced if the product of edge signs around every circle is positive.
Harary, F., & Kabell, J. A. (1980). A simple algorithm to detect balance in signed graphs. Mathematical Social Sciences, 1(1), 131-136. :param graph: NetworkX graph :return: Returns a dictionary with connected components as keys;
components that are balanced have a True value.
-
manta.flow.
harary_components
(graph, verbose)[source]¶ This wrapper for the balance test can accept graphs that consist of multiple connected components.
Parameters: - graph – NetworkX graph
- verbose – Prints result of test to logger if True
Returns: Returns a dictionary with connected components as keys; components that are balanced have a True value.
-
manta.flow.
partial_diffusion
(graph, iterations, limit, subset, ratio, permutations, verbose)[source]¶ Partial diffusion process for generation of scoring matrix. Some matrices may be unable to reach convergence or enter a flip-flopping state. A partial diffusion process can still discover relationships between unlinked taxa when this is not possible.
The partial diffusion process takes a random subgraph and checks whether this subgraph is balanced. If it is, the full diffusion process is carried out. Otherwise, one iteration is carried out.
Parameters: - graph – NetworkX graph of a microbial assocation network
- iterations – Maximum number of iterations to carry out
- limit – Percentage in error decrease until matrix is considered converged
- subset – Fraction of edges used in subsetting procedure
- ratio – Ratio of positive / negative edges required for edge stability
- permutations – Number of permutations for network subsetting
- verbose – Verbosity level of function
Returns: score matrix, memory effect, initial diffusion matrix
manta.layout module¶
These functions generate a layout for microbial association networks. Instead of using default edge weight, a different edge weight is calculated that can take taxonomic similarity as well as structural similarity into account. Nodes that share taxonomic features will be located more closely together, as well as nodes that share neighbours.
The function uses this alternative edge weight to run the Fruchterman-Reingold force-directed algorithm. This algorithm is run once per cluster. The generated layouts are then rotated depending on the number of clusters and combined.
-
manta.layout.
generate_layout
(graph, tax=None)[source]¶ Generates a dictionary of layout coordinates. The layout is based on the Fruchterman-Reingold force-directed algorithm, where a layout is calculated for each of the clusters specified in the supplied NetworkX graph. These layouts are then shifted and combined into a full layout.
Parameters: - graph – NetworkX graph with cluster IDs
- tax – Filepath to tab-delimited taxonomy table.
Returns: dictionary of layout coordinates
-
manta.layout.
generate_tax_weights
(graph, tax)[source]¶ Returns supplied graph with node similarity as node properties. This node similarity is determined by similarity at different taxonomic levels and by structural similarity. The more similar nodes are, the larger their similarity score is; hence, force-directed layouts will place such nodes closer together. Assumes that species assignments of NA or None are not relevant.
Parameters: - graph – NetworkX graph
- tax – Taxonomy table
Returns:
manta.main module¶
manta: microbial association network clustering toolbox. The script takes a weighted and undirected network as input and uses this to generate network clusters. Moreover, it can generate a Cytoscape-compatible layout (with optional taxonomy input). Detailed explanations are available in the headers of each file.
The arguments for manta can be separated into 4 groups:
- Arguments for importing and exporting data.
- Arguments for network clustering.
- Arguments for network clustering on flip-flopping networks.
- Arguments for network centralities.
The arguments for importing and exporting data include:
- -i Filepath to input network.
- -o Filepath to output network
- -f Filetype for output network
- -tax Filepath to taxonomy table.
- –layout If flagged, a layout is generated
- -dir If a directed network is imported, setting this to True converts the network to undirected.
- -b If flagged, all edge weights are converted to 1 and -1.
manta uses the file extension to import networks. Taxonomy tables should be given as tab-delimited files. These tables can be used to generate a layout for cyjson files. Other file formats do not export layout coordinates.
The arguments for network clustering include:
- -min Minimum cluster number
- -max Maximum cluster number
- -mc Minimum cluster size
- -limit Error limit until convergence is considered reached
- -iter Number of iterations to keep going if convergence is not reached
manta uses agglomerative clustering on a scoring matrix to assign cluster identities. The scoring matrix is generated through a procedure involving network flow. Nodes that cluster separately are removed and combined with identified clusters later on. Hence, manta will not identify clusters of only 1 node. It is highly likely that networks will not converge neatly. In that case, manta will apply the network flow procedure on a subset of the network.
The arguments for network clustering on flip-flopping networks include:
- -perm Number of permutations on network subsets
- -ratio Ratio of edges that need to be positive or negative to consider the edges stable through permutations.
- -scale Threshold for shortest path products to oscillators.
The network flow procedure relies on the following assumption: positions in the scoring matrix that are mostly positive throughout permutations, should have only positive values added. The same is assumed for negative positions. The ratio defines which positions are considered mostly positive or mostly negative.
For demo purposes, we included a network generated from data
manta.reliability module¶
Centrality scores and cluster assignments can be tested for their reliability. The idea behind this test is that random rewiring should, to some extent, preserve the most central structures of the original graph. We cannot know which edges are true positives and which ones are false positives, but we do expect that global network properties are retained despite changes in identified associations.
First, null models are generated from the original graph. These models are rewired copies: edge degree and connectedness are preserved. Weight is assigned randomly by sampling with replacement from the original weight scores. For each of these networks, the diffusion iterations specified in cluster_sparse are repeated as many times as for the original network. The outcome is then a matrix of diffusion scores.
With these matrices, the reliability can be estimated. This error specifies how variable the assignments are expected to be.
-
manta.reliability.
central_edge
(graph, percentile, permutations, error, verbose)[source]¶ The min / max values that are the result of the diffusion process are used as a centrality measure and define positive as well as negative hub associations.
If the permutation number is set to a value above 0, the centrality values are tested against permuted graphs.
The fraction of positive edges and negative edges is based on the ratio between positive and negative weights in the network.
Hence, a network with 90 positive weights and 10 negative weights will have 90% positive hubs returned.
Parameters: - graph – NetworkX graph of a microbial association network.
- percentile – Determines percentile of hub species to return.
- permutations – Number of permutations to carry out. If 0, no permutation test is done.
- error – Fraction of edges to rewire for reliability metric.
- verbose – Verbosity level of function
Returns: Networkx graph with hub ID / p-value as node property.
-
manta.reliability.
central_node
(graph)[source]¶ Given a graph with hub edges assigned (see central_edge), this function checks whether a node is significantly more connected to edges with high scores than expected by chance. The p-value is calculated with a binomial test. Edge sign is ignored; hubs can have both positive and negative edges.
Parameters: graph – NetworkX graph with edge centrality scores assigned Returns: NetworkX graph with hub centrality for nodes
-
manta.reliability.
jaccard_similarity_score
(vector1, vector2)[source]¶ The sklearn implementation of the Jaccard similarity requires vectors to be equal length. This implementation does not. :param vector1: List of strings or numbers. :param vector2: List of strings or numbers. :return:
-
manta.reliability.
perm_clusters
(graph, limit, max_clusters, min_clusters, min_cluster_size, iterations, ratio, partialperms, relperms, error, verbose)[source]¶ Calls the rewire_graph function and robustness function to compute robustness of cluster assignments. Scores close to 1 imply that the scores are robust to perturbation.
Parameters: - graph – NetworkX graph of a microbial association network. Cluster assignment should be a network property.
- limit – Percentage in error decrease until matrix is considered converged.
- max_clusters – Maximum number of clusters to evaluate in K-means clustering.
- min_clusters – Minimum number of clusters to evaluate in K-means clustering.
- min_cluster_size – Minimum cluster size as fraction of network size
- iterations – If algorithm does not converge, it stops here.
- ratio – Ratio of scores that need to be positive or negative for a stable edge
- partialperms – Number of permutations for partial diffusion.
- relperms – Number of permutations for reliability testing.
- error – Fraction of edges to rewire for reliability metric.
- verbose – Verbosity level of function
Returns:
-
manta.reliability.
perm_edges
(graph, permutations, percentile, pos, neg, error)[source]¶ Calls the rewire_graph function; returns reliability scores of edge centrality scores. Scores close to 100 imply that the scores are robust to perturbation. Reliability scores as proposed by: Frantz, T. L., & Carley, K. M. (2017). Reporting a network’s most-central actor with a confidence level. Computational and Mathematical Organization Theory, 23(2), 301-312.
Parameters: - graph – NetworkX graph of a microbial association network.
- permutations – Number of permutations to carry out. If 0, no bootstrapping is done.
- percentile – Determines percentile of hub species to return.
- pos – List of edges in the upper percentile. (e.g. positive hubs)
- neg – List of edges in the lower percentile. (e.g. negative hubs)
- error – Fraction of edges to rewire for reliability metric.
Returns: List of reliability scores.
-
manta.reliability.
rewire_graph
(graph, error)[source]¶ Returns a rewired copy of the original graph. The rewiring procedure preserves degree of each node. The number of rewirings is the square of the node amount. This ensures the network is completely randomized. The weight of the edges also needs to be normalized. Therefore, after the network is randomized, weight is sampled from the original graph and added to the randomized graph. Because this does not take negative / positive hubs into account, the fraction of positive / negative weights per node is not preserved.
Part of the rewire_graph function has been adapted from the original NetworkX double_edge_swap function. The adapted version also swaps edge weights.
NetworkX is distributed with the 3-clause BSD license.
Copyright (C) 2004-2018, NetworkX Developers Aric Hagberg <hagberg@lanl.gov> Dan Schult <dschult@colgate.edu> Pieter Swart <swart@lanl.gov> All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of the NetworkX Developers nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Parameters: - graph – Original graph to rewire.
- error – Fraction of edges to rewire.
Returns: Rewired NetworkX graph
-
manta.reliability.
robustness
(graphclusters, permutations)[source]¶ Compares vectors of cluster assignments to estimate cluster-wise robustness and node-wise robustness. These are returned as dictionaries.
Inspired by reliablity scores as proposed by: Frantz, T. L., & Carley, K. M. (2017). Reporting a network’s most-central actor with a confidence level. Computational and Mathematical Organization Theory, 23(2), 301-312.
Because calculating the accuracy of a cluster assignment is not trivial, the function does not compare cluster labels directly. Instead, this function calculates the Jaccard similarity between cluster assignments.
Parameters: - graphclusters – Dictionary of original cluster assignments
- permutations – Number of permutations to compute robustness.
Returns: Two dictionaries of reliability scores (cluster-wise and node-wise).