Multiple correspondence analysis with Stata

Stata has commands for both simple (CA) and multiple correspondence analysis (MCA), which I believe are based on Michael Greenacre´s code for the R package. At first, coming from specialized programs like SPAD, the commands in Stata for doing MCA appear very rudimentary, but because of the versality of Stata there is not very difficult to do the most important procedures in MCA. While SPAD no doubt offers a superior package for all kinds of factor analysis (including useful procedures not easily available elsewhere, like hybrid clustering after MCA), the strength of Stata is, in my belief, its superior speed and flexibility to do all sorts of gritty data cleaning, recoding, renaming and variable generation which are involved in “real life” data analysis, in particular if your are doing your own surveys. No doubt, SPSS, R and others are also fine programs for such work, but I have personally grown very attached to the Stata way of doing things and like to do most of my work here – including testing out different MCAs – before moving the data on to SPAD, if necessary. For an excellent introduction to the Stata workflow, see Kohler and Kreuter (2009) Data Analysis Using Stata. A nice, if very basic introduction to the most common forms of multivariate statistical analysis using Stata is Rabe-Hesketh and Everitt (2007) A Handbook of Statistical Analyses Using Stata.

My main purpose here is simply to demonstrate some of the more interesting ways to extend the basic functionality of the CA and MCA commands, both which are explained much better in the Stata manual and in the help function. I am at present using version 12 of Stata, but most of the procedures I suggest can be used on much older versions of Stata.

For an introduction to the methodology and mathematics of MCA, I highly reccomend first reading Multiple Correspondence Analysis (2010) by Brigitte Le Roux and Henry Rouanet, and when comfortable with this text, move on to Geometric Data Analysis (2004) by the same authors.

This page is very much under construction, and I will be very happy for any further suggestions of tips and tricks for using Stata for MCA.

CORRESPONDENCE ANALYSIS (CA)

We will concentrate on the MCA command, but to quickly demonstrate CA we will use the famous “Smoker” dataset. Simply type (without the period):

.  webuse ca_smoking

To inspect the table of the two variables (rank and smoking intensity), type

. tab rank smoking

            |              smoking intensity
       rank |      none      light     medium      heavy |     Total
------------+--------------------------------------------+----------
senior_mngr |         4          2          3          2 |        11
junior_mngr |         4          3          7          4 |        18
senior_empl |        25         10         12          4 |        51
junior_empl |        18         24         33         13 |        88
  secretary |        10          6          7          2 |        25
------------+--------------------------------------------+----------
      Total |        61         45         62         25 |       193

For the CA, simply type

. ca rank smoking

Correspondence analysis                          Number of obs     =      193
                                                 Pearson chi2(12)  =    16.44
                                                 Prob > chi2       =   0.1718
                                                 Total inertia     =   0.0852
    5 active rows                                Number of dim.    =        2
    4 active columns                             Expl. inertia (%) =    99.51

                |   singular    principal                             cumul
      Dimension |    value       inertia           chi2    percent   percent
    ------------+------------------------------------------------------------
          dim 1 |   .2734211    .0747591          14.43      87.76     87.76
          dim 2 |   .1000859    .0100172           1.93      11.76     99.51
          dim 3 |   .0203365    .0004136           0.08       0.49    100.00
    ------------+------------------------------------------------------------
          total |               .0851899          16.44     100

Statistics for row and column categories in symmetric normalization

                 |          overall          |        dimension_1        |        dimension_2
      Categories |    mass  quality   %inert |   coord   sqcorr  contrib |   coord   sqcorr  contrib
    -------------+---------------------------+---------------------------+---------------------------
    rank         |                           |                           |
     senior mngr |   0.057    0.893    0.031 |   0.126    0.092    0.003 |   0.612    0.800    0.214
     junior mngr |   0.093    0.991    0.139 |  -0.495    0.526    0.084 |   0.769    0.465    0.551
     senior empl |   0.264    1.000    0.450 |   0.728    0.999    0.512 |   0.034    0.001    0.003
     junior empl |   0.456    1.000    0.308 |  -0.446    0.942    0.331 |  -0.183    0.058    0.152
       secretary |   0.130    0.999    0.071 |   0.385    0.865    0.070 |  -0.249    0.133    0.081
    -------------+---------------------------+---------------------------+---------------------------
    smoking      |                           |                           |
            none |   0.316    1.000    0.577 |   0.752    0.994    0.654 |   0.096    0.006    0.029
           light |   0.233    0.984    0.083 |  -0.190    0.327    0.031 |  -0.446    0.657    0.463
          medium |   0.321    0.983    0.148 |  -0.375    0.982    0.166 |  -0.023    0.001    0.002
           heavy |   0.130    0.995    0.192 |  -0.562    0.684    0.150 |   0.625    0.310    0.506
    -------------------------------------------------------------------------------------------------

As can be seen, almost all variance are to be found in the first dimension. To make a plot, type

. cabiplot, origin

For a full list of all the subcommands, type

. help ca

Many of the examples we will see of MCA subcommands are also available for CA, like supplementary variables.

MULTIPLE CORRESPONDENCE ANALYSIS (MCA)

For demonstration, we will simply use the ISSP-93 dataset.

. webuse issp93

We will start with looking at the answers to the four attitude questions.

A — too much science, not enough feelings&faith
B — science does more harm than good
C — any change makes nature worse
D — science will solve environmental problems

To look at the frequency distribution, type either . tab1 A B C D or, as I prefer, . fre A B C D. Note that fre is a user-written program that need to be installed by typing . ssc install fre.

The results (and the soundness) of this analysis are not so important, here, the main point is to demonstrate some of the basic functionality of STATA. First, lets do an MCA of these four questions, with gender and education as supplementary variables. Type

. mca A B C D, sup(sex edu)

Multiple/Joint correspondence analysis         Number of obs      =       871
                                               Total inertia      =  .1702455
    Method: Burt/adjusted inertias             Number of axes     =         2

                |   principal               cumul
      Dimension |    inertia     percent   percent
    ------------+----------------------------------
          dim 1 |    .0764553     44.91      44.91
          dim 2 |    .0582198     34.20      79.11
          dim 3 |     .009197      5.40      84.51
          dim 4 |    .0056697      3.33      87.84
          dim 5 |    .0011719      0.69      88.53
          dim 6 |    6.61e-06      0.00      88.53
    ------------+----------------------------------
          Total |    .1702455    100.00

Statistics for column categories in standard normalization

                 |          overall          |        dimension_1        |        dimension_2
      Categories |    mass  quality   %inert |   coord   sqcorr  contrib |   coord   sqcorr  contrib
    -------------+---------------------------+---------------------------+---------------------------
    A            |                           |                           |
    agree stro~y |   0.034    0.963    0.060 |   1.837    0.860    0.115 |   0.727    0.103    0.018
           agree |   0.092    0.659    0.023 |   0.546    0.546    0.028 |  -0.284    0.113    0.007
    neither ag~e |   0.059    0.929    0.037 |  -0.447    0.143    0.012 |  -1.199    0.786    0.084
        disagree |   0.051    0.798    0.051 |  -1.166    0.612    0.069 |   0.737    0.186    0.028
    disagree s~y |   0.014    0.799    0.067 |  -1.995    0.369    0.055 |   2.470    0.430    0.084
    -------------+---------------------------+---------------------------+---------------------------
    B            |                           |                           |
    agree stro~y |   0.020    0.911    0.100 |   2.924    0.781    0.174 |   1.370    0.131    0.038
           agree |   0.050    0.631    0.027 |   0.642    0.346    0.021 |  -0.667    0.285    0.022
    neither ag~e |   0.059    0.806    0.027 |   0.346    0.117    0.007 |  -0.964    0.690    0.055
        disagree |   0.081    0.620    0.033 |  -0.714    0.555    0.041 |  -0.280    0.065    0.006
    disagree s~y |   0.040    0.810    0.116 |  -1.354    0.285    0.074 |   2.108    0.526    0.179
    -------------+---------------------------+---------------------------+---------------------------
    C            |                           |                           |
    agree stro~y |   0.044    0.847    0.122 |   2.158    0.746    0.203 |   0.909    0.101    0.036
           agree |   0.091    0.545    0.024 |   0.247    0.101    0.006 |  -0.592    0.444    0.032
    neither ag~e |   0.057    0.691    0.045 |  -0.619    0.218    0.022 |  -1.044    0.473    0.062
        disagree |   0.044    0.788    0.054 |  -1.349    0.674    0.080 |   0.635    0.114    0.018
    disagree s~y |   0.015    0.852    0.071 |  -1.468    0.202    0.032 |   3.017    0.650    0.136
    -------------+---------------------------+---------------------------+---------------------------
    D            |                           |                           |
    agree stro~y |   0.017    0.782    0.039 |   1.204    0.285    0.025 |   1.822    0.497    0.057
           agree |   0.067    0.126    0.012 |  -0.221    0.126    0.003 |  -0.007    0.000    0.000
    neither ag~e |   0.058    0.688    0.044 |  -0.385    0.087    0.009 |  -1.159    0.601    0.078
        disagree |   0.065    0.174    0.014 |  -0.222    0.103    0.003 |  -0.211    0.071    0.003
    disagree s~y |   0.043    0.869    0.034 |   0.708    0.288    0.022 |   1.152    0.581    0.057
    -------------+---------------------------+---------------------------+---------------------------
    sex          |                           |                           |
            male |   0.490    0.077    0.363 |  -0.349    0.074          |  -0.080    0.003
          female |   0.510    0.076    0.353 |   0.336    0.073          |   0.077    0.003
    -------------+---------------------------+---------------------------+---------------------------
    edu          |                           |                           |
    primary in~e |   0.044    0.012    0.361 |   0.440    0.011          |  -0.163    0.001
    primary co~d |   0.434    0.107    0.372 |   0.394    0.081          |  -0.254    0.026
    secondary ~e |   0.278    0.025    0.365 |  -0.166    0.009          |  -0.246    0.016
    secondary ~d |   0.108    0.103    0.350 |  -0.556    0.043          |   0.758    0.061
    tertiary i~e |   0.056    0.040    0.354 |  -0.420    0.013          |   0.714    0.028
    tertiary c~d |   0.080    0.106    0.355 |  -0.754    0.058          |   0.792    0.049
    -------------------------------------------------------------------------------------------------
    supplementary variables: sex edu

For the traditional correspondence analysis plot, type

. mcaplot, overlay legend(off) xline(0) yline(0) scale(.8)

Typing only mcaplot gives seperate plots for each variable – which I do not want, therefore I add overlay. To turn of the legend which is quite informative and eats space (A=blue, B=brown, C=green, D=yellow), I add legend(off). Finally, because Stata for some reason do not add drop lines for the zero coordinates, I add xline(0) yline(0). Finally, because the resulting graph was a bit to messy because of many categories and relative large text, I scale everything down 80% with scale(.8).

It is also possible to just show a selection of variables in the plot by specifying this at the start of the command, like this

. mcaplot sex edu, overlay legend(off) xline(0) yline(0) 

Now, to demonstrate some of the other possibilities using the basic MCA and MCAPLOT commands, type

. mca A B C D, meth(ind) normal(princ) dim(3) sup(sex age (agesex:age sex)) comp
. mcaplot,  overlay xline(0) yline(0) legend(off) mlabpos(12) msymbol(circle) mlabgap(2) title(“A nicer plot”, size(medium)) note(“”,size(zero)) scheme(sj) yneg xneg dim(1 3) scale(.7)

Comments, MCA: by specifying dim(3) in MCA we specifies that we want to retain three dimensions (at default, Stata only retains two). Also, we can use the method and normal commands to change the procedure – here I use the indicator matrix approach instead of the default Burt approach (joint MCA is also available), and display principcal coordinates instead of the default standard coordinates. The (agesex:age sex) subparanthesis is a quick way to make an interaction variable based on all possible combinations of two – or more – variables which I find often useful. Comp is simply a command for compressing the output of the statistical information.

Comments, MCAPLOT: In addition to those subcommands covered above for CAPLOT, I have here used mlabpos, msymbol and mlabgap to regulate the clock position of the text in relation to the symbol, the type of symbol used and the gap between the symbol and text. I could also have used msymbsize() and mlabsize() to control the size of the symbols and text, but for simplicity I have here used the familiar scale() command. The title and notes can be modified (or deleted) via the title() and note() commands. To use a different colour scheme, I have added scheme(sj), which is the one used in the Stata Journal. By dim(1 3) I specify that I want the first and third axis in the plot, and yneg and xneg tells STATA to reverse the direction of both axes. Note however, that most of these options can be added and changed easily via the use of menus by invoking the graph editor afterwards.

THE “NOT BLEEDINGLY OBVIOUS” BIT

= SPACE OF INDIVIDUALS =

Now, for some fun which is not extremely obvious if you are a new Stata user. First, the MCA command has no command for looking at the individuals. No problem! First, we will make two new variables with the factor coordinates for axis 1 and 2 (a1 and a2) and simply plot this using the scatterplot command (again scaling thing down with the scale subcommand)

. predict a1 a2
. scatter a2 a1, scale(.6)

Voila – the space of individuals. And not unexpected, we find the data has the “horseshoe” shape (or “Guttman”-effect) which are commonly found when analyzing preference data. If we would like to see the ID of the individual and add some reference lines, type

. scatter a2 a1,  xline(0) yline(0) scale(.3) mlabel(id)

= SHOWING QUALITY AND CONTRIBUTIONS IN THE PLOT =

Even old STATLAB which I started using for MCA in 1995 had the option to scale the variable symbols to reflect various statistics, like the absolute contribution or quality of representation – which is very useful, but this is not implemented in the Stata command. We need to do some matrix stuff here, and make two overlaying scatterplots (you, of course, can simply copy the commands). But first of all, we need to install an useful user-written command by Nicholas J. Cox, svmat2. Type

. findit svmat2

the clicking “dm79 from http://www.stata.com/stb/stb56” and then “click here to install”.

Now for the fun stuff:

. mca A B C D, sup(sex edu)
. mat mcamat=e(cGS)
. mat colnames mcamat = mass qual inert co1 rel1 abs1 co2 rel2 abs2
. svmat2 mcamat, rname(varname) name(col)

What we have done is simply to (1) make a matrix using some of the information of the matrices saved after an MCA, (2) give some nice, short column names and (3) save the information in this matrix to a list of variables. I use svmat2 instead of svmat because it adds the useful option of also saving the row names. Now, we have many nice possibilities.

== ADDITIONAL STATISTICAL INFORMATION ==

For example, it is now very easy to compute statistics like the average absolute and relative contributions:

. tabstat mass rel1 abs1 rel2 abs2, stat(mean sum)

   stats |      mass      rel1      abs1      rel2      abs2
---------+--------------------------------------------------
    mean |  .1071429  .2788044       .05  .2510534       .05
     sum |         3  7.806525         1  7.029495         1
------------------------------------------------------------
Other applications for this would for example be to compute Benzécri´s modified rates of inertia (c.f. Le Roux and Rouanet 2010:39).

== REGULATE SIZE OF SYMBOLS ACCORDING TO MASS, CONTRIBUTION ETC. ==

To display the MCA plot with symbols scaled according to mass, type:

. twoway (scatter co2 co1 [aweight=mass], xline(0) yline(0)  mlabsize(vsmall) msymbol(oh) msize(small) legend(off)) (scatter co2 co1, mlabsize(vsmall) msymbol(i) mlabel(varname) legend(off))

According to contributions on the 2nd axis:

. twoway (scatter co2 co1 [aweight=abs2], xline(0) yline(0)  mlabsize(vsmall) msymbol(oh) msize(small) legend(off)) (scatter co2 co1, mlabsize(vsmall) msymbol(i) mlabel(varname) legend(off))

To show the contributions on both 1st and 2nd axis (1=circles, 2=triangles):

. twoway (scatter co1 co2 [aweight=abs1], xline(0) yline(0)  mlabsize(vsmall) msymbol(oh) msize(small) legend(off)) (scatter co1 co2 [aweight=abs2], xline(0) yline(0)  mlabsize(vsmall) msymbol(Th) msize(small) legend(off)) (scatter co1 co2, mlabsize(vsmall) msymbol(i) mlabel(varname) legend(off))

…. you get the idea. For an overview of some of the possibilities in using scatterplots with Stata, I highly reccoment the pragmatic text by Mitchell(2004), A Visual Guide to Stata Graphics.

= BITS AND PIECES =

->Crossing histograms, showing the distribution along the two axes:

. twoway (histogram a2, vertical) (histogram a1, horizontal)

Lines between the variables:

. mcaplot, overlay xline(0) yline(0) legend(off) mlabpos(12)  connect(l)

->A very stripped down plot, nice for showing a lot of categories:

. mcaplot, overlay xline(0) yline(0) legend(off) mlabpos(12) msymbol(point) mlabgap(0) title(” “, size(zero)) note(“”,size(zero)) scheme(s1mono) scale(.5)

->Run the analysis seperate for men and women, and then combine them in a single graph:

. mca A B C D if sex==1
. mcaplot, xline(0) yline(0) overlay title(“Male”) name(male) scheme(sj)
. mca A B C D if sex==2
. mcaplot, xline(0) yline(0) overlay title(“Female”) name(female) scheme(sj)
. graph combine male female
. graph drop male female 

-> use only categories containing a certain number of individuals (e.g. useful for suplementary points):

I here assume a list of musical genres, v10* (dummies 1=like 0=do not like), and a variable measuring organization to which the respondent belongs, which have >100 categories, but you only wish to plot those with at  least >4 respondents.

. * first, we find out how many people belong to each organization, and make a new variable for this information
. bys organization : g catsize = _N
* all categories containing less than 5 respondents are recoded to zero
. recode organization 1/124=0 if catsize<5
. mca v10*, sup(organization)
. mcaplot organization, overlay legend(off) scale(.5) xline(0) yline(0)

 

Standard