anuran

Demo

Three demo networks have been included with the anuran distribution. These networks were constructed with CoNet from a study on the spatial distribution of taxon abundances:

Meyer, K. M., Memiaghe, H., Korte, L., Kenfack, D., Alonso, A., & Bohannan, B. J. (2018). Why do microbes exhibit weak biogeographic patterns?. The ISME journal, 12(6), 1404.

In this study, the authors used a spatially explicit nested design to collect soil samples in the Rabi Forest Monitoring Plot in Gabon. This permitted them to construct distance-decay curves and assess whether these were different from such curves found for the trees on the plot. The heatmap below visualizes the Rabi_geodist_matrix file included with this publication, and shows how these samples are ordered. As you can see, the diagonal separates in 3 blocks with lower distances. The samples can be separated into 3 groups, with the in-group distance being much lower than the between-group distance. We used this matrix to separate the samples and construct 3 networks, one for each group. These networks were constructed with CoNet; taxa with a minimum count below 10 or a prevalence below 30% were removed. To handle the large diversity of the sequences, the abundances were merged by taxonomic family.

geodist <- read.table("Rabi_geodist_matrix.txt", sep='\t', stringsAsFactors = FALSE)
geodist <- sapply(geodist[2:40, 2:40], as.numeric)
heatmap(geodist, Rowv = NA, Colv = NA)

Since the samples were collected from a single plot, we do not expect major changes in the community composition, and some associations may be preserved across the three blocks. We can look for such conserved associations with anuran. Although many of anuran's features are more applicable to larger numbers of networks, some of these operations can take a lot of time; with this data, you will quickly be able to work through the main features.

As you can see, even though these networks were generated from identical numbers of samples that were taken from a highly similar environment, there is a lot of variation in the network size. Since many network measures (e.g. betweenness centrality) are closely related to network connectivity, they are not directly comparable. With anuran, these differences are taken into account through the generation of null models that closely resemble the original networks.

To access the demo networks, give demo as the input variable. If you want to upload your own networks, replace demo with the complete filepath to the folder containing your networks. anuran will find all files with .graphml, .gml and .txt (edge list) extensions and try to import these. For exporting all files, anuran will use whatever prefix is specified after the output filepath.

anuran -i demo -o outputpath/demo

For a complete explanation of the parameters and their defaults, use the help command.

anuran -h

The following parameters will show whether there are more associations shared by 2 or 3 networks than we would expect.

-size is the size of the intersection; the associations need to be shared by 60% or 100% of the networks.
-perm is the number of null models generated per original network.
-nperm is the number of times that the generated null models are subsampled to infer the expected set size.
-draw is a flag that generates some quick overview figures of the results.

anuran -i demo -o outputpath/demo -size 0.6 1 -perm 5 -nperm 10 -draw

You should find four new files in the filepath you specified: a .csv file with all the set sizes for the networks and the null models, two .graphml files that contain the 0.6 and 1 intersections, and a figure visualizing the data in this .csv file. The figure shows that the difference for the actual networks (labelled 'Input') is smaller than the difference for the null models. Moreover, the two types of null models (Degree and Random) show the effect of the degree distribution on the sets. The degree-preserving null models have a smaller difference than the fully randomized models.

Additionally, there are more associations shared between two networks than we would expect if the associations were random or only the degree distribution was preserved.

Adding null models with a synthetic core requires two additional parameters. Since we do not know the true core, the synthetic cores are generated per input network and the sets computed from a single source network. This introduces a large amount of variation in the set sizes for the synthetic cores, but we can still get an idea of the upper limit. The -cs parameter defines the sizes of the synthetic core as the fraction of the total edge number in the networks, while the -prev parameter controls the fraction of networks where the core associations occur. In this case, the core associations are present in 2 out of 3 networks.

anuran -i demo -o outputpath/demo -size 0.6 1 -perm 10 -nperm 10 -draw -cs 0.1 0.2 0.3 -prev 0.6

Again, the exported image will show the distribution of set sizes. If more associations belong to the core, the difference decreases slightly. The increase in the intersection is difficult to see because there are so many more associations that are unique to each network; however, we do observe that the actual intersection observed falls within the distribution of intersections estimated for the degree-preserving models with a synthetic core.

If we wanted to see the effect of increasing the number of networks, we could add two parameters: -sample and -n. The -sample parameter starts the resampling procedure and defines the maximum number of networks to be sampled per network number, while the -n flag is used to determine what numbers of networks should be resampled. In this case, we can only increase the number of networks from 2 to 3. When you have more than 10 networks, the -n parameter becomes important since the resampling procedure can be very time-consuming. For example, with 20 networks, you could set -n 2 5 10 15 20 so the networks are only resampled -sample times at each of the numbers specified in -n.

anuran -i demo -o outputpath/demo -size 1 -perm 10 -nperm 10 -sample 5 -n 2 3 -draw

The resampling procedure generates a very similar output. Obviously, we only have 2 data points, but there is still an upwards trend visible; adding an extra network increases the total number of unique associations.

We can also use anuran to find taxa that consistently have high centralities. In this case, we again only have 3 networks, so the estimates for the confidence intervals are not robust. However, we can already find some taxa that consistently have high centralities. To generate the centralities, only the -c flag needs to be included.

anuran -i demo -o outputpath/demo -size 1 -perm 10 -nperm 10 -c -draw

If we look at the lower and upper limits for the 95% confidence interval of the closeness centrality, there are many taxa that have a high upper limit and a low lower limit. This suggests that there was a lot of variation in the closeness centrality ranking for these taxa. However, we can also see that there are only a few taxa in the fully randomized model that have a lower limit above 0.8. That this is larger in the degree-preserving model, suggests that there is a strong correlation between the degree centrality and the closeness centrality.

Similarly, we can compare the graph properties of the original networks to the null models with the -net parameter. This can take a little more time to run.

anuran -i demo -o outputpath/demo -size 1 -perm 10 -nperm 10 -net -draw

The distribution of the graph properties suggests that there are a lot of differences in topology among the networks.

While not directly appliccable to this particular set of networks, we can carry out statistics on the properties. Since only 3 networks are available, the statistics are not appropriate on this demo, so anuran will generate a warning. If you want anuran to calculate the statistics but not compute the multiple-testing corrected p-value, use -stats True. Otherwise, specify a multiple-testing method after -stats, for example fdr_bh.

anuran -i demo -o outputpath/demo -size 1 -perm 10 -nperm 10 -stats True

This generates an additional .csv file that contains comparisons of specific properties to the null models. For example, the set sizes of the original networks are compared to the null models with a Z-score test, while the centrality rankings are compared with Mann-Whitney U tests. Many of these tests (and the centrality confidence interval) require that the data is normally distributed. For the Z-score test, anuran will generate a warning if this is not the case; however, the normality test cannot be carried out on only 3 networks. While the confidence interval for the centralities also assumes that the centrality rankings follow the Student's t distribution, no normality test is used here. Hence, these confidence intervals should be interpreted carefully.

The table below shows the example output for the statistical analysis on the set sizes. While the Z-test is not appropriate for this data, the p-value approaches zero, suggesting that the set sizes for the demo networks are truly different from those found for the null models.

X	Group	Comparison	Measure	P.type
2	demo	Random	Difference	Set sizes
4	demo	Random	Intersection 1	Set sizes
0	demo	Random	Intersection 0.6	Set sizes
5	demo	Degree	Intersection 1	Set sizes
3	demo	Degree	Difference	Set sizes
1	demo	Degree	Intersection 0.6	Set sizes

In this demo, we only have one folder with networks. anuran can import networks from multiple folders; in this case, the set operations and centrality estimates are computed separately per folder. If you have a nested design, e.g. two studies with soil samples from different locations, you can expand the statistical analysis to take this into account. In this case, the -compare flag compares properties to other folders in addition to null model comparisons.

anuran -i folder1 folder2 -o outputpath/demo -size 1 -perm 10 -nperm 10 -stats fdr_bh -compare

The default plots generated by anuran are not optimal since they scale to match the largest sets. To generate higher-quality figures, ggplot2 is recommended. You can import the file ending in '_set_differences.csv' and use this to make a figure with ggplot2. The script below is an example of how you could do this.

library(ggplot2)
stats <- read.csv('test2_set_differences.csv')
data <- plyr::ddply(stats, c("Network", "Interval"), plyr::summarise, mean=mean(Set.size), sd=sd(Set.size), sem=(sd(Set.size)/sqrt(length(Set.size))))

# remove some of the positive control models for clarity
names <- c("Input", "Degree", "Random",
           "Degree size: 0.2 prev:1", "Random size: 0.2 prev:1",
           "Degree size: 0.6 prev:1",  "Random size: 0.6 prev:1")
data <- data[data$Network %in% names,]

# Order networks by input / neg control / positive control
data$Network <- factor(data$Network, levels=names)

# Change interval to factor
data$Interval <- factor(data$Interval)

# omit difference, since this makes the figure unclear
data <- data[data$Interval != 1,]

# plot linetype, need to set scale_x and linetyple manually
ggplot(data, aes(x=Interval, y=mean, colour=Network)) +
  geom_line(aes(x=as.numeric(Interval), y=mean, linetype=Network), size=1) + 
  geom_errorbar(aes(x=as.numeric(Interval), ymin=mean-sem, ymax=mean+sem), width=.1, show.legend=FALSE)+
  theme_minimal() + theme(axis.text.x = element_text(angle=70, hjust=1)) + 
  theme(strip.text.x = element_blank(), panel.grid.minor.x = element_blank()) + 
  scale_x_continuous(name = "Difference of intersections",
                     breaks=as.numeric(data$Interval),
                     labels = data$Interval)+
  scale_color_viridis_d(name='Network') + labs(y='Set size') +
  scale_linetype_manual(name='Network', breaks=c("Input", "Degree", "Random",
                                                 "Degree size: 0.2 prev:1", "Random size: 0.2 prev:1",
                                                 "Degree size: 0.6 prev:1",  "Random size: 0.6 prev:1"),
                        values=c('solid', 'dashed', 'dashed', 'dotted', 'dotted', 'dotted', 'dotted'), drop=FALSE)

The figure shows the actual set size in a solid line, the negative control as a dashed line, and the positive controls as dotted lines. The positive controls have 20% and 60% conserved associations (of the total set of assocations across all networks). The x-axis represents the intersection across a fraction of networks; in this case 0.6*3 is 2 networks, so the numbers for 0.6->1 are the total number of associations that are in 2 but not in 3 networks.

The figure clearly shows that the observed data has more associations present in 2 out of 3 networks than we would expect from the negative controls. However, both of the positive controls have more conserved associations than the observed networks, especially at the 1->1 set. This is because the positive control edges were set to occur in all 3 networks (prev: 1).

anuran

Lisa Rottjers

19 november 2019

Demo