Tag Archives: cross tables in r

R Tutorial: Using R with NORC GSS Data Part Two, Generating Simple Tables and Using Subsets


A tutorial by Douglas M. Wiig

Part one of the tutorial  centered on importing NORC GSS data in STATA or SPSS formats in an R data frame. For illustration I used the GSS2014 survey data set that consists of 2538 cases and 866 variables. If a researcher wishes to generate some simple cross tabulations the R CrossTable function is very useful.

The CrossTable function is part of the gmodels package, so before running scripts in this tutorial make sure you have installed and loaded gmodels from your favorite CRAN mirror site. As discussed in part one of the tutorial load the GSS2014 dataset into the global environment using:

>require(foreign)

>Dataset <- read.spss(“E:/research/Documents/GSS2014.sav”,

use.value.labels=TRUE, max.value.labels=Inf, to.data.frame=TRUE)

The CrossTable function allows a basic cross tabulation to be performed and includes a large number of options that can be incorporated into the table. The basic structure is as follows:

Usage

CrossTable(x, y, digits=3, max.width = 5, expected=FALSE, prop.r=TRUE, prop.c=TRUE,

prop.t=TRUE, prop.chisq=TRUE, chisq = FALSE, fisher=FALSE, mcnemar=FALSE,

resid=FALSE, sresid=FALSE, asresid=FALSE,

missing.include=FALSE,

format=c(“SAS”,”SPSS”), dnn = NULL, …)

Arguments

x A vector or a matrix. If y is specified, x must be a vector

y A vector in a matrix or a dataframe

digits Number of digits after the decimal point for cell proportions

max.width In the case of a 1 x n table, the default will be to print the output horizontally.

If the number of columns exceeds max.width, the table will be wrapped for

each successive increment of max.width columns. If you want a single column

vertical table, set max.width to 1

expected If TRUE, chisq will be set to TRUE and expected cell counts from the _2 will be

included

prop.r If TRUE, row proportions will be included

prop.c If TRUE, column proportions will be included

prop.t If TRUE, table proportions will be included

prop.chisq If TRUE, chi-square contribution of each cell will be included

chisq If TRUE, the results of a chi-square test will be included

fisher If TRUE, the results of a Fisher Exact test will be included

mcnemar If TRUE, the results of a McNemar test will be included

resid If TRUE, residual (Pearson) will be included

sresid If TRUE, standardized residual will be included

asresid If TRUE, adjusted standardized residual will be included

missing.include

If TRUE, then remove any unused factor levels

format Either SAS (default) or SPSS, depending on the type of output desired.

dnn the names to be given to the dimensions in the result (the dimnames names).

optional arguments

(Gregory Warnes, maintainer, Package ‘Gmodels’ February, 2015. http://cran.r-project.org/src/contrib/PACKAGES.html)

In this tutorial I will create a  table to examine the relationship between income and education using the variables ‘degree’ and ‘income6’ from the GSS dataset. Both are categorical factors. To simplify the resulting table only actual frequencies will be reported and the ‘chisq’ option will be used to generate a chi-squared test. The format used will be set to SPSS. Use the following statement:

>Generate a cross table of frequencies with chisq reported

>CrossTable(Dataset$”incom16″,Dataset$”degree”, chisq=TRUE, format=c(“SPSS”),prop.r=FALSE, prop.c=FALSE, prop.t=FALSE, prop.chisq=FALSE)

>

In the above code, the row variable is income the appropriate column of the dataset is selected with the ‘Dataset$”incom16” statement. The column variable for the table is education and the appropriate column of the dataset is selected with the ‘Dataset$”degree” statement. The various cell proportions must be set to ‘FALSE’ as they are defaulted to ‘True.’

When you run the above script the table will be generated in SPSS format on the screen.  I will not reproduce the table here because of formatting problems of fitting the table into the blog format.

In part three of this turorial I will discuss generating subsets of the GSS data file and using subsets for statistical analyses such as t tests and ANOVA.