Nodes clustering and embedding

Let’s stard with loading some network packages we’ll need (you might need to install them prior to loading).

library(igraph)
library(igraphdata)
library(sna)           # contains the function gplot to plot graphs
library(RColorBrewer)  # nice colors for our plots

We then load a graph in the form of an igraph object

data("karate")
# ?karate
print(karate)
## This graph was created by an old(er) igraph version.
##   Call upgrade_graph() on it to use with the current igraph version
##   For now we convert it on the fly...
## IGRAPH 4b458a1 UNW- 34 78 -- Zachary's karate club network
## + attr: name (g/c), Citation (g/c), Author (g/c), Faction (v/n), name
## | (v/c), label (v/c), color (v/n), weight (e/n)
## + edges from 4b458a1 (vertex names):
##  [1] Mr Hi  --Actor 2  Mr Hi  --Actor 3  Mr Hi  --Actor 4  Mr Hi  --Actor 5 
##  [5] Mr Hi  --Actor 6  Mr Hi  --Actor 7  Mr Hi  --Actor 8  Mr Hi  --Actor 9 
##  [9] Mr Hi  --Actor 11 Mr Hi  --Actor 12 Mr Hi  --Actor 13 Mr Hi  --Actor 14
## [13] Mr Hi  --Actor 18 Mr Hi  --Actor 20 Mr Hi  --Actor 22 Mr Hi  --Actor 32
## [17] Actor 2--Actor 3  Actor 2--Actor 4  Actor 2--Actor 8  Actor 2--Actor 14
## [21] Actor 2--Actor 18 Actor 2--Actor 20 Actor 2--Actor 22 Actor 2--Actor 31
## [25] Actor 3--Actor 4  Actor 3--Actor 8  Actor 3--Actor 9  Actor 3--Actor 10
## + ... omitted several edges

We can first obtain the adjacency matrix and plot the graph via gplot

A <- as_adjacency_matrix(karate, sparse = FALSE)
lay <- gplot(A, usearrows = FALSE, edge.lwd = .1)

Stochastic block model

One good package allowing one to fit SBM to the adjacency matrix in order to perform nodes clustering and model selection (i.e. the number of clusters \(K\)) is blockmodels (several alternatives exist on the CRAN repository)

library(blockmodels)

Since we deal with a binary adjacency matrix, we rely on the function BM_bernoulli

obj <- BM_bernoulli(membership_type = 'SBM', as.matrix(A), verbosity = 0, plotting = '')
obj$estimate()                           # fits SBM to the data for different values of K and init
best_mod <- which.max(obj$ICL)           # the model (K) maximizing the ICL   
est_Z <- obj$memberships[[best_mod]]$Z   # the cluster memberships in 0-1 encoding
est_Z <- apply(est_Z, 1, which.max)      # the cluster memberships as integers
print(est_Z)
##  [1] 2 2 2 1 3 3 3 1 1 1 3 1 1 1 1 1 3 1 1 1 1 1 1 3 3 3 3 3 3 3 1 3 2 2

The value of \(K\) maximizing the ICL was 3 and we can plot the visualized the estimated clusters as follows

my_pal = brewer.pal(5, 'Accent')
deg <- igraph::degree(karate)
gplot(A, 
      usearrows = FALSE, 
      edge.lwd = .1, 
      coord = lay,
      vertex.col = my_pal[est_Z],    # nodes' colors
      vertex.cex = log(deg + 0.1))   # nodes' sizes

As it can be seen, fitting SBM to the data allows one to capture “heterogeneous” cluster types. For instance, the hubs (nodes with higher degree) are all clustered in the violet group, then two communities (groups of densely connected nodes) are in orange, whereas the green one seems to be a group of peripheral nodes, only or mostly connected with the hubs.

Modularity maximization

Sometimes in network analysis we are particularly interested in detection communities, with no particular emphasis on other structures such as hubs, triangles etc. Although this can in principle by done by SBM too, the main road to community detection is the modularity maximization (not seen in class). This can be done straightforward with igraph.

obj <- igraph::cluster_optimal(karate)
est_Z <- obj$membership
print(est_Z)
##  [1] 1 1 1 1 2 2 2 1 3 3 2 1 1 1 3 3 2 1 3 1 3 1 3 4 4 4 3 4 4 3 3 4 3 3

Here the clustered graph:

Spectral decomposition

As seen in class, we can build the graph Laplacian associated with \(A\) and perform a spectral decomposition in order to use some of the eigenvalues as nodes embeddings (and possibly perform clustering via K-means)

N <- nrow(A)
D <- diag(rowSums(A))
L <- D - A                        # graph Laplacian
out <- eigen(L)           
Lbd <- out$values[N:1]            # eigenvalues are automatically sorted in decreasing order
Q <- out $vectors[,N:1]
par(bty = 'n')
plot(Lbd, col = 'RoyalBlue3', ylab = 'eigenvalues')

Now if we select the first (non-constant) \(K\) columns of \(Q\) we get an \(N \times K\) matrix whose i-th row is the embedding of the i-th node of the graph in \(\mathbb{R}^K\). In order to get some insights, we fix \(K=4\) as the number of communities we found via modularity maximization.

library(ggfortify)
## Loading required package: ggplot2
K <- 4
U <- Q[,2:(K+1)]
pca <- prcomp(U)               # principal component analysis to visualize the embeddings
autoplot(pca, color = my_pal[est_Z])   # colors/labels from modularity maximization!

As it can be seen, the embedding positions are sensitive to the community structure of the graph: nodes in the same community (generally) correspond to nearby points in the latent space. The isolated black point in the figure above corresponds to the only node in a network having a single tie. The spectral decomposition puts this node aside from the others and a K-means algorithm on the rows of \(U\) will necessarily assign this point to a single cluster. However, if we test a K-means with \(K=5\) clusters

out <- kmeans(U, 5, nstart =100, iter.max = 50)
est_Z_sc <- out$cluster

this is what the clustering produces in the latent space

and here what one sees when looking at the graph:

As it can be seen by comparing the different figures, modularity maximization and spectral clustering may lead to very similar cluster assignments. This intuition is confirmed numerically:

library(mclust)
print(adjustedRandIndex(est_Z, est_Z_sc))
## [1] 0.8741661