Stata has commands for both simple (CA) and multiple correspondence analysis (MCA), which I believe are based on Michael Greenacre´s code for the R package. At first, coming from specialized programs like SPAD, the commands in Stata for doing MCA appear very rudimentary, but because of the versality of Stata there is not very difficult to do the most important procedures in MCA. While SPAD no doubt offers a superior package for all kinds of factor analysis (including useful procedures not easily available elsewhere, like hybrid clustering after MCA), the strength of Stata is, in my belief, its superior speed and flexibility to do all sorts of gritty data cleaning, recoding, renaming and variable generation which are involved in “real life” data analysis, in particular if your are doing your own surveys. No doubt, SPSS, R and others are also fine programs for such work, but I have personally grown very attached to the Stata way of doing things and like to do most of my work here – including testing out different MCAs – before moving the data on to SPAD, if necessary. For an excellent introduction to the Stata workflow, see Kohler and Kreuter (2009) Data Analysis Using Stata. A nice, if very basic introduction to the most common forms of multivariate statistical analysis using Stata is Rabe-Hesketh and Everitt (2007) A Handbook of Statistical Analyses Using Stata.
My main purpose here is simply to demonstrate some of the more interesting ways to extend the basic functionality of the CA and MCA commands, both which are explained much better in the Stata manual and in the help function. I am at present using version 12 of Stata, but most of the procedures I suggest can be used on much older versions of Stata.
For an introduction to the methodology and mathematics of MCA, I highly reccomend first reading Multiple Correspondence Analysis (2010) by Brigitte Le Roux and Henry Rouanet, and when comfortable with this text, move on to Geometric Data Analysis (2004) by the same authors.
This page is very much under construction, and I will be very happy for any further suggestions of tips and tricks for using Stata for MCA.
CORRESPONDENCE ANALYSIS (CA)
We will concentrate on the MCA command, but to quickly demonstrate CA we will use the famous “Smoker” dataset. Simply type (without the period):
. webuse ca_smoking
To inspect the table of the two variables (rank and smoking intensity), type
. tab rank smoking
| smoking intensity rank | none light medium heavy | Total ------------+--------------------------------------------+---------- senior_mngr | 4 2 3 2 | 11 junior_mngr | 4 3 7 4 | 18 senior_empl | 25 10 12 4 | 51 junior_empl | 18 24 33 13 | 88 secretary | 10 6 7 2 | 25 ------------+--------------------------------------------+---------- Total | 61 45 62 25 | 193
For the CA, simply type
. ca rank smoking
Correspondence analysis Number of obs = 193 Pearson chi2(12) = 16.44 Prob > chi2 = 0.1718 Total inertia = 0.0852 5 active rows Number of dim. = 2 4 active columns Expl. inertia (%) = 99.51 | singular principal cumul Dimension | value inertia chi2 percent percent ------------+------------------------------------------------------------ dim 1 | .2734211 .0747591 14.43 87.76 87.76 dim 2 | .1000859 .0100172 1.93 11.76 99.51 dim 3 | .0203365 .0004136 0.08 0.49 100.00 ------------+------------------------------------------------------------ total | .0851899 16.44 100 Statistics for row and column categories in symmetric normalization | overall | dimension_1 | dimension_2 Categories | mass quality %inert | coord sqcorr contrib | coord sqcorr contrib -------------+---------------------------+---------------------------+--------------------------- rank | | | senior mngr | 0.057 0.893 0.031 | 0.126 0.092 0.003 | 0.612 0.800 0.214 junior mngr | 0.093 0.991 0.139 | -0.495 0.526 0.084 | 0.769 0.465 0.551 senior empl | 0.264 1.000 0.450 | 0.728 0.999 0.512 | 0.034 0.001 0.003 junior empl | 0.456 1.000 0.308 | -0.446 0.942 0.331 | -0.183 0.058 0.152 secretary | 0.130 0.999 0.071 | 0.385 0.865 0.070 | -0.249 0.133 0.081 -------------+---------------------------+---------------------------+--------------------------- smoking | | | none | 0.316 1.000 0.577 | 0.752 0.994 0.654 | 0.096 0.006 0.029 light | 0.233 0.984 0.083 | -0.190 0.327 0.031 | -0.446 0.657 0.463 medium | 0.321 0.983 0.148 | -0.375 0.982 0.166 | -0.023 0.001 0.002 heavy | 0.130 0.995 0.192 | -0.562 0.684 0.150 | 0.625 0.310 0.506 -------------------------------------------------------------------------------------------------
As can be seen, almost all variance are to be found in the first dimension. To make a plot, type
. cabiplot, origin
For a full list of all the subcommands, type
. help ca
Many of the examples we will see of MCA subcommands are also available for CA, like supplementary variables.
MULTIPLE CORRESPONDENCE ANALYSIS (MCA)
For demonstration, we will simply use the ISSP-93 dataset.
. webuse issp93
We will start with looking at the answers to the four attitude questions.
A — too much science, not enough feelings&faith
B — science does more harm than good
C — any change makes nature worse
D — science will solve environmental problems
To look at the frequency distribution, type either . tab1 A B C D or, as I prefer, . fre A B C D. Note that fre is a user-written program that need to be installed by typing . ssc install fre.
The results (and the soundness) of this analysis are not so important, here, the main point is to demonstrate some of the basic functionality of STATA. First, lets do an MCA of these four questions, with gender and education as supplementary variables. Type
. mca A B C D, sup(sex edu)
Multiple/Joint correspondence analysis Number of obs = 871 Total inertia = .1702455 Method: Burt/adjusted inertias Number of axes = 2 | principal cumul Dimension | inertia percent percent ------------+---------------------------------- dim 1 | .0764553 44.91 44.91 dim 2 | .0582198 34.20 79.11 dim 3 | .009197 5.40 84.51 dim 4 | .0056697 3.33 87.84 dim 5 | .0011719 0.69 88.53 dim 6 | 6.61e-06 0.00 88.53 ------------+---------------------------------- Total | .1702455 100.00 Statistics for column categories in standard normalization | overall | dimension_1 | dimension_2 Categories | mass quality %inert | coord sqcorr contrib | coord sqcorr contrib -------------+---------------------------+---------------------------+--------------------------- A | | | agree stro~y | 0.034 0.963 0.060 | 1.837 0.860 0.115 | 0.727 0.103 0.018 agree | 0.092 0.659 0.023 | 0.546 0.546 0.028 | -0.284 0.113 0.007 neither ag~e | 0.059 0.929 0.037 | -0.447 0.143 0.012 | -1.199 0.786 0.084 disagree | 0.051 0.798 0.051 | -1.166 0.612 0.069 | 0.737 0.186 0.028 disagree s~y | 0.014 0.799 0.067 | -1.995 0.369 0.055 | 2.470 0.430 0.084 -------------+---------------------------+---------------------------+--------------------------- B | | | agree stro~y | 0.020 0.911 0.100 | 2.924 0.781 0.174 | 1.370 0.131 0.038 agree | 0.050 0.631 0.027 | 0.642 0.346 0.021 | -0.667 0.285 0.022 neither ag~e | 0.059 0.806 0.027 | 0.346 0.117 0.007 | -0.964 0.690 0.055 disagree | 0.081 0.620 0.033 | -0.714 0.555 0.041 | -0.280 0.065 0.006 disagree s~y | 0.040 0.810 0.116 | -1.354 0.285 0.074 | 2.108 0.526 0.179 -------------+---------------------------+---------------------------+--------------------------- C | | | agree stro~y | 0.044 0.847 0.122 | 2.158 0.746 0.203 | 0.909 0.101 0.036 agree | 0.091 0.545 0.024 | 0.247 0.101 0.006 | -0.592 0.444 0.032 neither ag~e | 0.057 0.691 0.045 | -0.619 0.218 0.022 | -1.044 0.473 0.062 disagree | 0.044 0.788 0.054 | -1.349 0.674 0.080 | 0.635 0.114 0.018 disagree s~y | 0.015 0.852 0.071 | -1.468 0.202 0.032 | 3.017 0.650 0.136 -------------+---------------------------+---------------------------+--------------------------- D | | | agree stro~y | 0.017 0.782 0.039 | 1.204 0.285 0.025 | 1.822 0.497 0.057 agree | 0.067 0.126 0.012 | -0.221 0.126 0.003 | -0.007 0.000 0.000 neither ag~e | 0.058 0.688 0.044 | -0.385 0.087 0.009 | -1.159 0.601 0.078 disagree | 0.065 0.174 0.014 | -0.222 0.103 0.003 | -0.211 0.071 0.003 disagree s~y | 0.043 0.869 0.034 | 0.708 0.288 0.022 | 1.152 0.581 0.057 -------------+---------------------------+---------------------------+--------------------------- sex | | | male | 0.490 0.077 0.363 | -0.349 0.074 | -0.080 0.003 female | 0.510 0.076 0.353 | 0.336 0.073 | 0.077 0.003 -------------+---------------------------+---------------------------+--------------------------- edu | | | primary in~e | 0.044 0.012 0.361 | 0.440 0.011 | -0.163 0.001 primary co~d | 0.434 0.107 0.372 | 0.394 0.081 | -0.254 0.026 secondary ~e | 0.278 0.025 0.365 | -0.166 0.009 | -0.246 0.016 secondary ~d | 0.108 0.103 0.350 | -0.556 0.043 | 0.758 0.061 tertiary i~e | 0.056 0.040 0.354 | -0.420 0.013 | 0.714 0.028 tertiary c~d | 0.080 0.106 0.355 | -0.754 0.058 | 0.792 0.049 ------------------------------------------------------------------------------------------------- supplementary variables: sex edu
For the traditional correspondence analysis plot, type
. mcaplot, overlay legend(off) xline(0) yline(0) scale(.8)
Typing only mcaplot gives seperate plots for each variable – which I do not want, therefore I add overlay. To turn of the legend which is quite informative and eats space (A=blue, B=brown, C=green, D=yellow), I add legend(off). Finally, because Stata for some reason do not add drop lines for the zero coordinates, I add xline(0) yline(0). Finally, because the resulting graph was a bit to messy because of many categories and relative large text, I scale everything down 80% with scale(.8).
It is also possible to just show a selection of variables in the plot by specifying this at the start of the command, like this
. mcaplot sex edu, overlay legend(off) xline(0) yline(0)
Now, to demonstrate some of the other possibilities using the basic MCA and MCAPLOT commands, type
. mca A B C D, meth(ind) normal(princ) dim(3) sup(sex age (agesex:age sex)) comp
. mcaplot, overlay xline(0) yline(0) legend(off) mlabpos(12) msymbol(circle) mlabgap(2) title(“A nicer plot”, size(medium)) note(“”,size(zero)) scheme(sj) yneg xneg dim(1 3) scale(.7)
Comments, MCA: by specifying dim(3) in MCA we specifies that we want to retain three dimensions (at default, Stata only retains two). Also, we can use the method and normal commands to change the procedure – here I use the indicator matrix approach instead of the default Burt approach (joint MCA is also available), and display principcal coordinates instead of the default standard coordinates. The (agesex:age sex) subparanthesis is a quick way to make an interaction variable based on all possible combinations of two – or more – variables which I find often useful. Comp is simply a command for compressing the output of the statistical information.
Comments, MCAPLOT: In addition to those subcommands covered above for CAPLOT, I have here used mlabpos, msymbol and mlabgap to regulate the clock position of the text in relation to the symbol, the type of symbol used and the gap between the symbol and text. I could also have used msymbsize() and mlabsize() to control the size of the symbols and text, but for simplicity I have here used the familiar scale() command. The title and notes can be modified (or deleted) via the title() and note() commands. To use a different colour scheme, I have added scheme(sj), which is the one used in the Stata Journal. By dim(1 3) I specify that I want the first and third axis in the plot, and yneg and xneg tells STATA to reverse the direction of both axes. Note however, that most of these options can be added and changed easily via the use of menus by invoking the graph editor afterwards.
THE “NOT BLEEDINGLY OBVIOUS” BIT
= SPACE OF INDIVIDUALS =
Now, for some fun which is not extremely obvious if you are a new Stata user. First, the MCA command has no command for looking at the individuals. No problem! First, we will make two new variables with the factor coordinates for axis 1 and 2 (a1 and a2) and simply plot this using the scatterplot command (again scaling thing down with the scale subcommand)
. predict a1 a2
. scatter a2 a1, scale(.6)
Voila – the space of individuals. And not unexpected, we find the data has the “horseshoe” shape (or “Guttman”-effect) which are commonly found when analyzing preference data. If we would like to see the ID of the individual and add some reference lines, type
. scatter a2 a1, xline(0) yline(0) scale(.3) mlabel(id)
= SHOWING QUALITY AND CONTRIBUTIONS IN THE PLOT =
Even old STATLAB which I started using for MCA in 1995 had the option to scale the variable symbols to reflect various statistics, like the absolute contribution or quality of representation – which is very useful, but this is not implemented in the Stata command. We need to do some matrix stuff here, and make two overlaying scatterplots (you, of course, can simply copy the commands). But first of all, we need to install an useful user-written command by Nicholas J. Cox, svmat2. Type
. findit svmat2
the clicking “dm79 from http://www.stata.com/stb/stb56” and then “click here to install”.
Now for the fun stuff:
. mca A B C D, sup(sex edu)
. mat mcamat=e(cGS)
. mat colnames mcamat = mass qual inert co1 rel1 abs1 co2 rel2 abs2
. svmat2 mcamat, rname(varname) name(col)
What we have done is simply to (1) make a matrix using some of the information of the matrices saved after an MCA, (2) give some nice, short column names and (3) save the information in this matrix to a list of variables. I use svmat2 instead of svmat because it adds the useful option of also saving the row names. Now, we have many nice possibilities.
== ADDITIONAL STATISTICAL INFORMATION ==
For example, it is now very easy to compute statistics like the average absolute and relative contributions:
. tabstat mass rel1 abs1 rel2 abs2, stat(mean sum)
stats | mass rel1 abs1 rel2 abs2 ---------+-------------------------------------------------- mean | .1071429 .2788044 .05 .2510534 .05 sum | 3 7.806525 1 7.029495 1 ------------------------------------------------------------ Other applications for this would for example be to compute Benzécri´s modified rates of inertia (c.f. Le Roux and Rouanet 2010:39).
== REGULATE SIZE OF SYMBOLS ACCORDING TO MASS, CONTRIBUTION ETC. ==
To display the MCA plot with symbols scaled according to mass, type:
. twoway (scatter co2 co1 [aweight=mass], xline(0) yline(0) mlabsize(vsmall) msymbol(oh) msize(small) legend(off)) (scatter co2 co1, mlabsize(vsmall) msymbol(i) mlabel(varname) legend(off))
According to contributions on the 2nd axis:
. twoway (scatter co2 co1 [aweight=abs2], xline(0) yline(0) mlabsize(vsmall) msymbol(oh) msize(small) legend(off)) (scatter co2 co1, mlabsize(vsmall) msymbol(i) mlabel(varname) legend(off))
To show the contributions on both 1st and 2nd axis (1=circles, 2=triangles):
. twoway (scatter co1 co2 [aweight=abs1], xline(0) yline(0) mlabsize(vsmall) msymbol(oh) msize(small) legend(off)) (scatter co1 co2 [aweight=abs2], xline(0) yline(0) mlabsize(vsmall) msymbol(Th) msize(small) legend(off)) (scatter co1 co2, mlabsize(vsmall) msymbol(i) mlabel(varname) legend(off))
…. you get the idea. For an overview of some of the possibilities in using scatterplots with Stata, I highly reccoment the pragmatic text by Mitchell(2004), A Visual Guide to Stata Graphics.
= BITS AND PIECES =
->Crossing histograms, showing the distribution along the two axes:
. twoway (histogram a2, vertical) (histogram a1, horizontal)
Lines between the variables:
. mcaplot, overlay xline(0) yline(0) legend(off) mlabpos(12) connect(l)
->A very stripped down plot, nice for showing a lot of categories:
. mcaplot, overlay xline(0) yline(0) legend(off) mlabpos(12) msymbol(point) mlabgap(0) title(” “, size(zero)) note(“”,size(zero)) scheme(s1mono) scale(.5)
->Run the analysis seperate for men and women, and then combine them in a single graph:
. mca A B C D if sex==1
. mcaplot, xline(0) yline(0) overlay title(“Male”) name(male) scheme(sj)
. mca A B C D if sex==2
. mcaplot, xline(0) yline(0) overlay title(“Female”) name(female) scheme(sj)
. graph combine male female
. graph drop male female
-> use only categories containing a certain number of individuals (e.g. useful for suplementary points):
I here assume a list of musical genres, v10* (dummies 1=like 0=do not like), and a variable measuring organization to which the respondent belongs, which have >100 categories, but you only wish to plot those with at least >4 respondents.
. * first, we find out how many people belong to each organization, and make a new variable for this information
. bys organization : g catsize = _N
* all categories containing less than 5 respondents are recoded to zero
. recode organization 1/124=0 if catsize<5
. mca v10*, sup(organization)
. mcaplot organization, overlay legend(off) scale(.5) xline(0) yline(0)