Network analysis - node importance

Previously, we looked at hub nodes; the nodes with the largest number of connections. However, there are many other ways to define node centrality or to look at properties of individual nodes. While there are apps available in Cytoscape (e.g. CytoNCA), we are going to analyse our SPIEC-EASI network with the igraph library.

Step 1 - Data

We are going to work with the arctic soils network generated by SpiecEasi. If you have not downloaded the network yet, get the files here and load it into an igraph object. We are going to need the estimated edge weights for some analysis steps. Don’t set the ‘weight’ property of the graph to the edge weights, as this breaks some igraph functions.

library(SpiecEasi)
library(phyloseq)
library(ggplot2)
library(igraph)
library(seqtime)

### If you can't import the file:
### Download the spiec_otus.txt files from https://github.com/ramellose/networktutorials
### They are in the Workshop folder
### This file is already preprocessed!
otus <- read.table("spiec_otus.txt")
tax <- as.matrix(read.table("spiec_tax.txt"))
phyloseqobj.f <- phyloseq(otu_table(otus, taxa_are_rows = TRUE), tax_table(tax))
# use your SPIEC-EASI graph from the previous exercise
# or load the igraph graphml file supplied in the github repo
spiec.graph <- read_graph("spiec_graph.graphml", format="graphml")

Step 2 - Network centralities

One of the simplest network centralities is degree centrality, where the degree is the number of connections a node has. We can use a hist to visualize node degree. If the degree distribution of a network follows a power law, that network is scale-free. We can fit a power law to the network with the fit_power_law function. Do you think this is a scale-free network based on those outcomes?

spiec.deg <- degree(spiec.graph)
hist(spiec.deg)

fit <- fit_power_law(spiec.deg)
fit
## $continuous
## [1] FALSE
## 
## $alpha
## [1] 4.441027
## 
## $xmin
## [1] 7
## 
## $logLik
## [1] -81.70369
## 
## $KS.stat
## [1] 0.07647556
## 
## $KS.p
## [1] 0.9666786

Microbial networks do not always follow a power law. However, the extent to which they do may have important biological implications. Can you explain why a scale-free network would be less robust to failure than other networks? You can find the solutions here. We can generate networks with the igraph and seqtime libraries that have different degree distributions. The Erdős–Rényi model generates completely random networks, while the Klemm-Eguíluz algorithm generates networks that are scale-free. Try removing random edges from these models to see this affects connectivity.

## [1] "Adjusting connectance to 0.1"
## [1] "Initial edge number 10000"
## [1] "Initial connectance 1"
## [1] "Number of edges removed 8910"
## [1] "Final connectance 0.1"
## [1] "Final connectance: 0.1"

To better visualize our association network, we can plot the degree as node size in the network. Because the object returned by plot_network is a regular ggplot object, we can overlay circles that indicate the degree of each node. Species with the largest degree in a network are often called hub species.

plot_network(spiec.graph, phyloseqobj.f, type='taxa', color="Rank3", label=NULL) + geom_point(aes(size=spiec.deg), colour='deepskyblue4', shape=1)

In addition to degree, betweenness centrality is often used. The betweenness centrality is determined by the number of shortest paths passing through a node, where the shortest paths are from all nodes to all other nodes. Try plotting the betweenness centrality yourself; can you explain why there are no nodes with high betweenness centrality but low degree? Try drawing a network that does have these nodes. For solutions, check here.

Like betweenness centrality, closeness centrality uses shortest paths. However, instead of counting the total number of shortest paths passing through a node, this centrality measure uses the shortest paths from the node in question to all other nodes. Therefore, it measures how distant a node is to other nodes. Closeness centrality only makes sense for graphs with a single connected component, as there are no shortest paths from one part of the network to a disconnected part. This is also visible in the histogram and the graph: igraph estimated very low closeness centrality for these disconnected nodes. Try generating a better visualization of the closeness centrality. Do you have an idea of the nodes in the central component that will have the lowest closeness centrality? Answers are here.

spiec.close <- closeness(spiec.graph)

A different type of centrality is eigenvector centrality. This measure takes the set of neighbours of a node and uses those to calculate the node’s eigenvector.

spiec.eigen <- eigen_centrality(spiec.graph)
hist(spiec.eigen$vector)

plot_network(spiec.graph, phyloseqobj.f, type='taxa', color="Rank3", label=NULL) + geom_point(aes(size=spiec.eigen$vector), colour='deepskyblue4', shape=1)

centralities$eigen <- spiec.eigen$vector
ggplot(data=centralities, aes(x=degree, y=eigen)) + geom_point() + geom_smooth(method='auto')

For this network, a higher degree implies that the node will also have greater values for other centrality measures. This is not always the case; nodes can have high values of a specific centrality without having high degree.

In addition to the previously discussed centralities, there are also centrality scores that serve specific purposes. For example, Kleinberg’s authority centrality (also known as the HITS algorithm) can take directedness of nodes into account. In the microbiome, this could mean that there are some taxa that have a large effect on other taxa, whereas there are also taxa that are affected by many taxa. The HITS algorithm could help separate such taxa better than degree centrality. As another example, the Bonacich alpha centrality score is a type of eigenvector centrality that is appropriate for directed graphs. However, few centrality scores have been assessed for their biological relevance. Therefore, we cannot attribute node importance based on the theory behind a centrality without validating these theories. Freilich and colleagues (2018) studied this in detail by comparing a known trophic network to a co-occurrence network. Their results indicate that co-occurrence networks may not reflect ecological networks, although they may be able to reflect the effect of environmental conditions. The species that had the strongest correlations (e.g. hub species) were generalist species, not validated keystones.

Step 3 - Assortativity

We can also calculate the network’s assortativity. The assortativity of a network quantifies whether nodes are more likely to connect to similar nodes or to dissimilar nodes. This can be similarity in degree, edge weight or any other value of interest. In microbial networks, we can actually also calculate phylogenetic assortativity: are nodes more likely to have associations with taxonomically similar nodes? Try calculating if nodes in this network preferentially connect to nodes with a similar number of positive or negative edges, or with a similar taxonomy. Is this a disassortative or assortative network? For an explanation of the assortativity and assortativity_nominal functions, look here. The solutions can be found here.

Step 4 - Network permutation

While it is straightforward to calculate the previously addressed values, we cannot be sure they are robust to errors. After all, Weiss and colleagues (2016) demonstrated that error rates of microbial association network inference tools can be high. Such error rates can have a large impact on some centrality measures (Borgatti and colleagues, 2006). While network permutation is not that commonly adopted for microbial association networks, it can help to identify nodes not robust to errors. We are going to implement a strategy similar to the one described by Frantz and Carley (2016) to test our centrality measures.

igraph contains a function to rewire networks. We will rewire 20% (38) edges while preserving the original degree distribution. 100 permutations should give an idea of the variation in centralities. For each of the permutations, we can generate a similar dataframe as the centralities dataframe generated for the original graph.

permlist <- list()
for (i in 1:100){
  permlist[[i]] <- rewire(spiec.graph, with=keeping_degseq(niter=38))
}

There are multiple ways to compute p-values or other statistics from these permutations. One approach would be to carry out a t-test and assess whether the centralities from the permutations are consistently worse than the centrality values measured for the original graph. However, this may not yield reliable results for some network centralities (e.g. eigenvector). Can you explain why?

Instead of comparing centrality values for the separate networks, we are going to test if network centrality values for the top 1, top 5, top 10 and top 10% (19) nodes are still in the top values for the permuted networks. Try to figure this out yourself and see what effect the top # has on robustness. Solutions are here.