Gaussian mixture models

When dealing with GMMs, it is impossible not to mention the excellent R package mclust:

# install.packages('mclust')
library(mclust)
## Warning: package 'mclust' was built under R version 4.2.3
## Package 'mclust' version 6.0.1
## Type 'citation("mclust")' for citing this R package in publications.

We test it on the iris dataset in R

data(iris)
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Although a flower’s species which actually is our label is known, here we ‘’forget’’ it and just consider the features

X <- iris[,c(1:4)]
Xcr <- scale(X)

The usage of mclust is incredibly easy. The simplest (yet exhaustive!) thing that we can do is

res <- Mclust(Xcr)
summary(res)
## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model with 2
## components: 
## 
##  log-likelihood   n df       BIC       ICL
##       -322.6936 150 29 -790.6956 -790.6969
## 
## Clustering table:
##   1   2 
##  50 100

As it can be checked by sending ?Mclust on the R console, the model fits GMMs to the data and tests any number of components (G) between 1 and 9, by default. For each values of G, several configurations of the covariance matrices are tested (spherical residuals or not, equal shape or not etc.). For each model the BIC is computed and used to select the optimal one. Above: \(G^* = 2\) and VVV, corresponding to completely unconstrained \(\Sigma_k\). A more in-depth check of the model selection can be done:

print(res$BIC)
## Bayesian Information Criterion (BIC): 
##         EII       VII        EEI        VEI        EVI        VVI        EEE
## 1 -1723.766 -1723.766 -1738.7979 -1738.7979 -1738.7979 -1738.7979 -1046.6559
## 2 -1344.053 -1325.273 -1259.6457 -1172.9600 -1223.9860 -1074.2293  -904.7750
## 3 -1214.525 -1222.844 -1029.7330  -995.8343 -1014.5146  -961.3205  -849.6448
## 4 -1177.163 -1139.392 -1044.1118  -965.1348 -1054.2229  -967.7021  -862.7323
## 5 -1135.642 -1107.791  -958.6219  -905.0241  -983.5054  -928.1301  -821.5159
## 6 -1131.245 -1102.864  -910.4787  -892.8517  -990.7497  -923.9756  -826.5558
## 7 -1119.876 -1053.274  -929.8709  -897.4180 -1030.2038  -983.3316  -849.1854
## 8 -1124.299 -1059.090  -908.1008  -896.1450  -957.1355  -980.8747  -855.9594
## 9 -1114.899 -1060.444  -912.9422  -918.7055  -984.5159  -972.5093  -869.7734
##          VEE        EVE        VVE        EEV        VEV        EVV        VVV
## 1 -1046.6559 -1046.6559 -1046.6559 -1046.6559 -1046.6559 -1046.6559 -1046.6559
## 2  -873.0048  -906.9282  -880.7933  -873.6590  -802.5253  -875.0084  -790.6956
## 3  -822.0760  -862.9077  -828.8731  -875.3724  -797.5483  -872.7515  -797.5190
## 4  -821.5149  -902.9964  -881.1270  -930.1693  -821.4057  -941.9898  -847.2777
## 5         NA  -901.9539  -853.4148  -877.2201  -837.6111         NA  -893.2973
## 6  -826.2392  -873.0654  -865.1217  -905.4942  -879.2801         NA  -971.4786
## 7         NA  -913.1458  -898.7411  -953.7660  -920.8553 -1026.5234 -1023.6106
## 8  -871.9319  -934.3770  -908.8799  -955.7741  -939.3449 -1048.4311 -1047.3167
## 9         NA  -955.8249  -947.9984  -990.5646  -987.1010 -1099.1226 -1100.3729
## 
## Top 3 models based on the BIC criterion: 
##     VVV,2     VVV,3     VEV,3 
## -790.6956 -797.5190 -797.5483

We can do the same thing by means of ICL

res <- mclustICL(Xcr)
print(res)
## Integrated Complete-data Likelihood (ICL) criterion: 
##         EII       VII        EEI        VEI       EVI        VVI        EEE
## 1 -1723.766 -1723.766 -1738.7979 -1738.7979 -1738.798 -1738.7979 -1046.6559
## 2 -1344.202 -1325.478 -1260.2203 -1172.9600 -1223.986 -1074.2293  -904.7763
## 3 -1229.544 -1237.574 -1037.4738 -1002.6577 -1019.236  -968.5785  -854.5499
## 4 -1198.217 -1156.164 -1069.5276  -978.3019 -1093.256  -979.9666  -878.7542
## 5 -1159.186 -1130.172  -989.9612  -918.0109 -1022.711  -941.7061  -837.3400
## 6 -1152.597 -1117.160  -933.9067  -906.7470 -1027.557  -936.4603  -854.2329
## 7 -1137.370 -1069.232  -957.7105  -913.1874 -1074.752  -999.3127  -893.2775
## 8 -1152.520 -1074.147  -940.3964  -913.6984  -986.908  -996.5274  -895.6516
## 9 -1141.677 -1078.341  -941.4725  -936.6474 -1019.060  -988.2414  -903.1997
##          VEE        EVE        VVE        EEV        VEV        EVV        VVV
## 1 -1046.6559 -1046.6559 -1046.6559 -1046.6559 -1046.6559 -1046.6559 -1046.6559
## 2  -873.0224  -906.9864  -880.8370  -873.6591  -802.5676  -875.0084  -790.6969
## 3  -828.9675  -872.2785  -835.0213  -884.9653  -804.5623  -882.4057  -800.7391
## 4  -835.5786  -958.7025  -892.4631  -955.0406  -837.1989  -988.0225  -854.6483
## 5         NA  -953.1908  -868.3194  -908.7734  -859.2445         NA  -906.9787
## 6  -852.8446  -889.7246  -881.6860  -937.9754  -899.0766         NA  -989.1171
## 7         NA  -991.1277  -918.6332  -975.2324  -935.6608 -1082.5280 -1042.6775
## 8  -903.5096  -982.7012  -927.1867 -1000.1578  -948.5550 -1076.5547 -1060.5857
## 9         NA  -988.0221  -966.8461 -1015.3776 -1006.5124 -1129.5143 -1116.8211
## 
## Top 3 models based on the ICL criterion: 
##     VVV,2     VVV,3     VEV,2 
## -790.6969 -800.7391 -802.5676

Some nice visualisations are available

res <- Mclust(Xcr)
plot(res)

And also, we can impose the number of clusters we want

res <- Mclust(Xcr, G = 3)
plot(res)

Although it is not a good practice to compare labels from clustering with the actual ones, here we do it just to show that GMM is particularly fit with respect to the species identification task

print(adjustedRandIndex(res$classification, iris$Species))
## [1] 0.9038742