RSS sections and groups meeting reports

Northern Ireland group meeting: Variable selection for model-based clustering of categorical data

Written by Gilbert MacKenzie on . Posted in Sections and local group meeting reports

The Northern Ireland group held a meeting on Wednesday, January 20th, 2016 at 4pm in the David Bates Building in Queen's University of Belfast. The speaker was Professor Brendan Murphy, University College Dublin, Ireland.

Professor Murphy motivated his topic using two data sets: Early Onset Alzheimer Patient Symptoms (240 patients with 6 symptoms) and Muscoloskeletal (back) Pain in clinical practice (464 patients and 36 binary clinical indicators). These examples highlighted the need to (a) establish the existence of subgroups (clustering) and (b) identify the key variables (variable selection).

The basic methodological approach was to use model-based clustering in which data are assumed to arise from a nite mixture mode. Professor Murphy then described a Latent Class Analysis model which assumed local independence among the M = 6 symptoms in the back pain study. He was able to write a product likelihood (over variables and categories) given membership of group g, where g = 1,...,G groups (this is similar to naive Bayes). The next step was to form the complete mixture likelihood over groups and perform a classical EM analysis for a range of gs. From the analysis he produced a nice figure that showed which of the six variables were good predictors of the 2 (or possibly 3) clusters in the data.

The next analysis was more taxing technically. He proposed a Bayesian clustering approach. From the outset the variables were allocated to two sets which we may call active (clustering) or inactive. For variables is the active set the likelihood was as in the first example above while for variables in the inactive set it was homogeneous over groups (since these variables did not discriminate). Separate Dirichlet priors were set on the item probabilities in the likelihoods for the discriminating variables and for the non-discriminating variables. The prior for G was taken as 1/G! and as a Bernoulli for any variable being a priori in the active set. With these arrangements, the posterior distribution turned out to be obtainable in closed form and the discrete parameter space was searched using a MCMC algorithm. The algorithm identi ed 7 clusters and these matched the 3 major components of the clinical taxonomy of back pain well. However, there was surprising little data reduction. It was thought that this was due to the local independence assumption among variables. The data were re-analysed using a greedy variable selection algorithm and the 36 variables were reduced to 11 predictors defi ning 3 clusters which matched the clinical criteria well.

The talk was received with acclaim and led to an animated discussion. The closed form of the posterior was commented upon and the possibility of using other methods such as regularisation in the contingency table paradigm was discussed. The audience thanked Professor Murphy for a stimulating afternoon.

Northern Ireland local group

Join the RSS

Join the RSS

Become part of an organisation which works to advance statistics and support statisticians

Copyright 2019 Royal Statistical Society. All Rights Reserved.
12 Errol Street, London, EC1Y 8LX. UK registered charity in England and Wales. No.306096

Twitter Facebook YouTube RSS feed RSS feed RSS newsletter

We use cookies to understand how you use our site and to improve your experience. By continuing to use our site, you accept our use of cookies and Terms of Use.