When dealing with GMMs, it is impossible not to mention the excellent R package mclust:
# install.packages('mclust')
library(mclust)
## Warning: package 'mclust' was built under R version 4.2.3
## Package 'mclust' version 6.0.1
## Type 'citation("mclust")' for citing this R package in publications.
We test it on the iris dataset in R
data(iris)
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Although a flower’s species which actually is our label is known, here we ‘’forget’’ it and just consider the features
X <- iris[,c(1:4)]
Xcr <- scale(X)
The usage of mclust is incredibly easy. The simplest (yet exhaustive!) thing that we can do is
res <- Mclust(Xcr)
summary(res)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model with 2
## components:
##
## log-likelihood n df BIC ICL
## -322.6936 150 29 -790.6956 -790.6969
##
## Clustering table:
## 1 2
## 50 100
As it can be checked by sending ?Mclust on the R console, the model fits GMMs to the data and tests any number of components (G) between 1 and 9, by default. For each values of G, several configurations of the covariance matrices are tested (spherical residuals or not, equal shape or not etc.). For each model the BIC is computed and used to select the optimal one. Above: \(G^* = 2\) and VVV, corresponding to completely unconstrained \(\Sigma_k\). A more in-depth check of the model selection can be done:
print(res$BIC)
## Bayesian Information Criterion (BIC):
## EII VII EEI VEI EVI VVI EEE
## 1 -1723.766 -1723.766 -1738.7979 -1738.7979 -1738.7979 -1738.7979 -1046.6559
## 2 -1344.053 -1325.273 -1259.6457 -1172.9600 -1223.9860 -1074.2293 -904.7750
## 3 -1214.525 -1222.844 -1029.7330 -995.8343 -1014.5146 -961.3205 -849.6448
## 4 -1177.163 -1139.392 -1044.1118 -965.1348 -1054.2229 -967.7021 -862.7323
## 5 -1135.642 -1107.791 -958.6219 -905.0241 -983.5054 -928.1301 -821.5159
## 6 -1131.245 -1102.864 -910.4787 -892.8517 -990.7497 -923.9756 -826.5558
## 7 -1119.876 -1053.274 -929.8709 -897.4180 -1030.2038 -983.3316 -849.1854
## 8 -1124.299 -1059.090 -908.1008 -896.1450 -957.1355 -980.8747 -855.9594
## 9 -1114.899 -1060.444 -912.9422 -918.7055 -984.5159 -972.5093 -869.7734
## VEE EVE VVE EEV VEV EVV VVV
## 1 -1046.6559 -1046.6559 -1046.6559 -1046.6559 -1046.6559 -1046.6559 -1046.6559
## 2 -873.0048 -906.9282 -880.7933 -873.6590 -802.5253 -875.0084 -790.6956
## 3 -822.0760 -862.9077 -828.8731 -875.3724 -797.5483 -872.7515 -797.5190
## 4 -821.5149 -902.9964 -881.1270 -930.1693 -821.4057 -941.9898 -847.2777
## 5 NA -901.9539 -853.4148 -877.2201 -837.6111 NA -893.2973
## 6 -826.2392 -873.0654 -865.1217 -905.4942 -879.2801 NA -971.4786
## 7 NA -913.1458 -898.7411 -953.7660 -920.8553 -1026.5234 -1023.6106
## 8 -871.9319 -934.3770 -908.8799 -955.7741 -939.3449 -1048.4311 -1047.3167
## 9 NA -955.8249 -947.9984 -990.5646 -987.1010 -1099.1226 -1100.3729
##
## Top 3 models based on the BIC criterion:
## VVV,2 VVV,3 VEV,3
## -790.6956 -797.5190 -797.5483
We can do the same thing by means of ICL
res <- mclustICL(Xcr)
print(res)
## Integrated Complete-data Likelihood (ICL) criterion:
## EII VII EEI VEI EVI VVI EEE
## 1 -1723.766 -1723.766 -1738.7979 -1738.7979 -1738.798 -1738.7979 -1046.6559
## 2 -1344.202 -1325.478 -1260.2203 -1172.9600 -1223.986 -1074.2293 -904.7763
## 3 -1229.544 -1237.574 -1037.4738 -1002.6577 -1019.236 -968.5785 -854.5499
## 4 -1198.217 -1156.164 -1069.5276 -978.3019 -1093.256 -979.9666 -878.7542
## 5 -1159.186 -1130.172 -989.9612 -918.0109 -1022.711 -941.7061 -837.3400
## 6 -1152.597 -1117.160 -933.9067 -906.7470 -1027.557 -936.4603 -854.2329
## 7 -1137.370 -1069.232 -957.7105 -913.1874 -1074.752 -999.3127 -893.2775
## 8 -1152.520 -1074.147 -940.3964 -913.6984 -986.908 -996.5274 -895.6516
## 9 -1141.677 -1078.341 -941.4725 -936.6474 -1019.060 -988.2414 -903.1997
## VEE EVE VVE EEV VEV EVV VVV
## 1 -1046.6559 -1046.6559 -1046.6559 -1046.6559 -1046.6559 -1046.6559 -1046.6559
## 2 -873.0224 -906.9864 -880.8370 -873.6591 -802.5676 -875.0084 -790.6969
## 3 -828.9675 -872.2785 -835.0213 -884.9653 -804.5623 -882.4057 -800.7391
## 4 -835.5786 -958.7025 -892.4631 -955.0406 -837.1989 -988.0225 -854.6483
## 5 NA -953.1908 -868.3194 -908.7734 -859.2445 NA -906.9787
## 6 -852.8446 -889.7246 -881.6860 -937.9754 -899.0766 NA -989.1171
## 7 NA -991.1277 -918.6332 -975.2324 -935.6608 -1082.5280 -1042.6775
## 8 -903.5096 -982.7012 -927.1867 -1000.1578 -948.5550 -1076.5547 -1060.5857
## 9 NA -988.0221 -966.8461 -1015.3776 -1006.5124 -1129.5143 -1116.8211
##
## Top 3 models based on the ICL criterion:
## VVV,2 VVV,3 VEV,2
## -790.6969 -800.7391 -802.5676
Some nice visualisations are available
res <- Mclust(Xcr)
plot(res)
And also, we can impose the number of clusters we want
res <- Mclust(Xcr, G = 3)
plot(res)
Although it is not a good practice to compare labels from clustering with the actual ones, here we do it just to show that GMM is particularly fit with respect to the species identification task
print(adjustedRandIndex(res$classification, iris$Species))
## [1] 0.9038742