Multiple correspondence analysis (MCA) is a statistical technique for nominal data (categorical data without overlaps) that represents the data as points in low-dimensions. It is a counterpart to principle component analysis and an extension to correspondence analysis.

Correspondence Analysis

Correspondence analysis (CA) provides a means of displaying or summarizing a set of categorical data in two-dimensional graphical form. It is typically applied to a contingency table of a pair of nominal variables where each cell is a count or a zero value. An example of a contingency table is given below:

Right-handedLeft-HandedTotal
Male43952
Female44448
Total8713100

The first step before any computation is to transform the values in the matrix . We start by computing a set of weights (called masses) for rows and columns:

where is the all-ones column vector and

We then construct diagonal matrices:

We then compute the matrix , called the matrix of standardized residuals, by

where is the outer product.

We then perform singular value decomposition on to get

where and are the left and right unitary singular vectors of and is the diagonal matrix of singular values . is of dimension , is and is of dimension where .

We define the total inertia of the data table by

We want to transform the singular vectors to coordinates, forming the principal coordinates, while preserving the -distances between rows or columns. For rows we use

and for columns we use

Example

Consider the following contingency table, :

TastyAestheticEconomic
Butterbeer572
Squishee184620
Slurm192939
Fizzy124049
Brawndo3716

Adding up the cells gives us that . Resulting in a observed proportions table, , of

TastyAestheticEconomic
Butterbeer0.0160.0220.006
Squishee0.0580.1470.064
Slurm0.0610.0930.125
Fizzy0.0380.1280.157
Brawndo0.010.0220.051
We get a row masses of

and column masses of

Computing the SVD of results in the following data:

  • Singular values ():
1st dim2nd dim3rd dim
2.65e-011.14e-014.21e-17
  • Left singular values ():
1st dim2nd dim3rd dim
Butterbeer-0.439-0.424-0.326
Squishee-0.6510.3550.029
Slurm0.16-0.672-0.362
Fizzy0.3710.488-0.747
Brawndo0.466-0.0590.451
  • Right singular values ():
1st dim2nd dim3rd dim
Tasty-0.41-0.8060.427
Aesthetic-0.4890.590.643
Economic0.77-0.0550.635
We can then compute the row principle coordinates ():
1st dim2nd dim3rd dim
Butterbeer-5.49e-01-2.271e-01-1.009e-16
Squishee-3.331e-017.768e-023.673e-18
Slurm8.05e-02-1.446e-01-4.494e-17
Fizzy1.73e-019.748e-02-8.609e-17
Brawndo4.305e-01-2.352e-021.024e-16
and the column principle coordinates ():
1st dim2nd dim3rd dim
Tasty-2.543e-01-2.14e-016.555e-17
Aesthetic-2.016e-011.041e-016.555e-17
Economic3.215e-01-9.751e-036.555e-17

Finally to visualize the results, we plot both coordinates with the first dimension on the -axis and the second dimension on the -axis:

center

Multiple Correspondence Analysis

MCA is performed by applying the CA algorithm to an indicator matrix (aka complete disjunctive table, see One-Hot Encoding). MCA only works on categorical data; any quantitative data (e.g. age, weight, height, etc) must be turned into categories through a method such as statistical quantiles.

Once the data set is only categorical, we construct the indicator matrix/table . As a result, if there are observations and categorical variables where is the number of categories for the -th variable, then is an matrix with all coefficients being 0 or 1 where .

We define where is the sum of all entries of . We construct two vectors:

i.e., is the sums along the rows of and is the sums along the columns of . We then find the singular value decomposition of

where and . The SVD of gives us unitary matrices and diagonal matrix such that . Then like before, we construct the row principle coordinates by

and the column principle coordinates by

Interpreting Correspondence Plots

Correspondence analysis is about the relativistic relations. A correspondence analysis won’t show which rows/column have the maximum, instead it will tell you what categories had the most.

The distance from the origin is a measure of how discriminating a column is. Variables close to the origin typically have less distinct values.

Columns that are close together (ensure the plots aspect ratio is set to 1) is an indication of similarity between them. To compare the relationship between rows and columns you have to instead look for small angles between the line connecting them and the line to the origin.