Module 4: Cluster Analysis II

CMPS 163: Business Analytics

Introduction

Now that we have a basic understanding of clustering and even ran our own clustering, in this module we will look at several aspects that are important in cluster analysis that we have mostly glanced over so far. While we look at these issues in the context of clustering, they are important in other approaches as well such as classification and regression, but the details may be different. The first issue is assessing the performance of our cluster analysis. While we found our clusters and at least some of them made sense, how well was k-means able to find the clusters? Was it good or bad? Are there better clusters to be found? We will look at silhouettes to answer these questions (silhouettes are specific to cluster analysis, classification and regression have other ways to assess performance).

The second issue is somewhat related to performance assessment and involves the k in k-means clustering. So far we have fixed k to 4 but what about a different number of clusters? Maybe 3 or 5? Of course, when we change k to a different number we need a way to assess whether the clusters that are found are getting better or worse, that’s why these two things are related. In general, most algorithms have one or multiple parameters that have to be tweaked to reach optimal performance. In fact, clustering is also affected by which distance function is being used and so we will look at some alternatives for this as well.

Module Objectives

  • Assess the performance of clustering with silhouettes
  • Finding silhouettes for clustering in Excel
  • Investigate the effect of changing the number of clusters
  • Investigate the effect of changing the distance metric

Learning Resources

  • Module 4 Readings: Chapter 2
  • Module 4 Slides: Chapter 2

Learning Activities

  • Module 4 Assignment
  • Module 4 Slides: Chapter 2

For Further Study

Leave A Reply

Your email address will not be published. Required fields are marked *