R Tutorial: Using R with NORC GSS Data Part Two, Generating Simple Tables and Using Subsets


A tutorial by Douglas M. Wiig

Part one of the tutorial  centered on importing NORC GSS data in STATA or SPSS formats in an R data frame. For illustration I used the GSS2014 survey data set that consists of 2538 cases and 866 variables. If a researcher wishes to generate some simple cross tabulations the R CrossTable function is very useful.

The CrossTable function is part of the gmodels package, so before running scripts in this tutorial make sure you have installed and loaded gmodels from your favorite CRAN mirror site. As discussed in part one of the tutorial load the GSS2014 dataset into the global environment using:

>require(foreign)

>Dataset <- read.spss(“E:/research/Documents/GSS2014.sav”,

use.value.labels=TRUE, max.value.labels=Inf, to.data.frame=TRUE)

The CrossTable function allows a basic cross tabulation to be performed and includes a large number of options that can be incorporated into the table. The basic structure is as follows:

Usage

CrossTable(x, y, digits=3, max.width = 5, expected=FALSE, prop.r=TRUE, prop.c=TRUE,

prop.t=TRUE, prop.chisq=TRUE, chisq = FALSE, fisher=FALSE, mcnemar=FALSE,

resid=FALSE, sresid=FALSE, asresid=FALSE,

missing.include=FALSE,

format=c(“SAS”,”SPSS”), dnn = NULL, …)

Arguments

x A vector or a matrix. If y is specified, x must be a vector

y A vector in a matrix or a dataframe

digits Number of digits after the decimal point for cell proportions

max.width In the case of a 1 x n table, the default will be to print the output horizontally.

If the number of columns exceeds max.width, the table will be wrapped for

each successive increment of max.width columns. If you want a single column

vertical table, set max.width to 1

expected If TRUE, chisq will be set to TRUE and expected cell counts from the _2 will be

included

prop.r If TRUE, row proportions will be included

prop.c If TRUE, column proportions will be included

prop.t If TRUE, table proportions will be included

prop.chisq If TRUE, chi-square contribution of each cell will be included

chisq If TRUE, the results of a chi-square test will be included

fisher If TRUE, the results of a Fisher Exact test will be included

mcnemar If TRUE, the results of a McNemar test will be included

resid If TRUE, residual (Pearson) will be included

sresid If TRUE, standardized residual will be included

asresid If TRUE, adjusted standardized residual will be included

missing.include

If TRUE, then remove any unused factor levels

format Either SAS (default) or SPSS, depending on the type of output desired.

dnn the names to be given to the dimensions in the result (the dimnames names).

optional arguments

(Gregory Warnes, maintainer, Package ‘Gmodels’ February, 2015. http://cran.r-project.org/src/contrib/PACKAGES.html)

In this tutorial I will create a  table to examine the relationship between income and education using the variables ‘degree’ and ‘income6’ from the GSS dataset. Both are categorical factors. To simplify the resulting table only actual frequencies will be reported and the ‘chisq’ option will be used to generate a chi-squared test. The format used will be set to SPSS. Use the following statement:

>Generate a cross table of frequencies with chisq reported

>CrossTable(Dataset$”incom16″,Dataset$”degree”, chisq=TRUE, format=c(“SPSS”),prop.r=FALSE, prop.c=FALSE, prop.t=FALSE, prop.chisq=FALSE)

>

In the above code, the row variable is income the appropriate column of the dataset is selected with the ‘Dataset$”incom16” statement. The column variable for the table is education and the appropriate column of the dataset is selected with the ‘Dataset$”degree” statement. The various cell proportions must be set to ‘FALSE’ as they are defaulted to ‘True.’

When you run the above script the table will be generated in SPSS format on the screen.  I will not reproduce the table here because of formatting problems of fitting the table into the blog format.

In part three of this turorial I will discuss generating subsets of the GSS data file and using subsets for statistical analyses such as t tests and ANOVA.


			

Tutorial: Using R to Analyze GSS2014 Social Science Data, Part One: Importing the Database in SPSS or STATA Format


For anyone interested in researching social science questions there is a wealth of survey data available through the National Opinion Research Center (NORC) and its associated research universities. The Center has been conducting a national survey each year since 1972 and has compiled a massive database of data from these surveys. Most if not all of these data files can be accessed and downloaded without charge. I have been working with the 2014 edition of the data and for this tutorial will use the GSS2014 data file that is available for download on the Center’s web site. ( See the NORC main website at http://www.norc.org/Research/Projects/Pages/general-social-survey.aspx and at http://www3.norc.org/GSS+Website ).

As noted above the datasets that are available for download are available in both SPSS format and STATA format. To work with either of these formats using R it is necessary to read the file into a data frame using one of a couple of different packages. The first option I will discuss uses the Hmisc package. The second option I will discuss uses the foreign package. Install both of these packages from your favorite CRAN mirror site before starting the code in this tutorial.

For this tutorial I am using the one year release file GSS2014. This file contains 2538 cases and 866 variables. Download the file   from the web site listed above in both SPSS and STATA formats. Use the following code to load the Hmisc package into your R global environment:

                >require(Hmisc)

Now load the GSS2014.sav SPSS version from your storage device using the following line of code. I am using the filename GSS2014 for my data file and loading the file into the data frame ‘gss14’:

>#load the GSS data file in SPSS format

                >put data into data frame ‘gss14’

                gss14 <- spss.get(F:/research/Documents/GSS2014.sav”,                     use.value.labels=TRUE)

                >

To view the data that was loaded use the command:

>View(gss14)

This will produce a spreadsheet-like matrix of rows and columns containing the data. To load the data file in STATA format download the STATA version of the file from the NORC web site a discussed above. My STATA file is also named GSS2014, but with the STATA .dta extension. Load the file into a data frame using:

>load STATA format file into data frame ‘Dataset2’

                >Datatset2 <- read.dta(“F:/resarch/Documents/GSS2014.dta”)

               >

Once again, you can view the data frame loaded using the command:

>View(dataset2)

Both the STATA and SPSS formats of the data set can also be loaded into R using the foreign package. The procedure is the same for both SPSS and STATA

>load SPSS version

                >require(foreign)

                >Dataset <- read.spss(“F:/research/Documents/GSS2014.sav”,   use.value.labels=TRUE)

 >load STATA version into data frame ‘Dataset3’

>Dataset3 <- read.dta(“E:/research/Documents/GSS2014.dta”)

Use the ‘View()’ command to view the data frame.

In part two I will discuss some techniques using R to create and analyze subsets of the GSS2014 data file.

 

Using R for Nonparametric Analysis, The Kruskal-Wallis Test Part Four: R Script and Some Notes on IDE’s


 

Using R for Nonparametric Analysis, The Kruskal-Wallis Test: R Script and Some Notes on IDE’s

 

A tutorial by Douglas M. Wiig

 

In the previous three parts of this tutorial I discussed using R to enter a data set and perform a nonparametric Kruska-Wallis test for ranked means. In this final part the commented script that was used in the first three parts is listed. 

 

If you are going to use R for the majority of your statistical analysis it is highly advisable that you investigate some of the IDE’s (Integrated Development Environments) that are available to assist in coding and debugging R script and creating R packages for personal use or distribution. I think one of the easiest to use is R Studio. R Studio is available in both free open source and commercial versions and can be downloaded at http://www.rstudio.com    There are versions available for Windows, various Linux distributions, and Mac OS.

The R studio console provides a number of useful tools that facilitate coding. The screen is divided into four sections with one section providing a code editor that features syntax highlighting, code completion and many other features such as line or block code execution. Another window contains R and all displays output, error messages and warnings when code is executed from the editor. A third window displays all of the current environmental variables that are active and can also show all currently loaded R packages. A fourth window can show graphic output from executed code, can be used to manage, download and install R packages, and can be used to access the CRAN database of online help. There are other useful tools that are too numerous to discuss here.

 

Another program that is worth looking at is RKWard which combines an IDE with a graphics GUI for R statistical analysis. Information and downloads for RKWard can be found at https://rkward.kde.org. This program is also free and open source and can be run on a Windows platform, Mac OS, or various distributions of Linux. The program has been optimized for Linux. A discussion of these IDE’s is beyond the scope of this posting.

 

Shown below is the commented R script for all three parts of the Kruskal-Wallis tutorial. For ease of reading code portions are shown in bold print.

 

#packages that must be present in the global environment before running these scripts

 

#stats; graphics; grDevices; utils; datasets; methods; base

 

#

 

#code to enter data using the data editor

 

#KW data entry, define file kruskal as a data frame

 

kruskal <-data.frame()

 

#invoke the data editor

 

kruskal <-edit(kruskal)

 

#define group as containing 3 factors; tell R which data column goes with which factor

 

group <- factor(1,2,3)

 

#alternative data entry method

 

#Define factor Group as containing three categories

 

Group <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3)

 

#create a vector defined as authscore and enter values

 

authscore <- c(96,128,83,61,101,82,121,132,135,109,115,149,166,147)

 

#create data frame kruskal matching each group factor to individual scores

 

kruskal <- data.frame(Group, authscore)

 

#use the following line to look at the structure of the data frame created

 

str(kruskal)

 

#run the basic Kruskal-Wallis test

 

#

 

kruskal.test(authscore ~ Group, data=kruskal)

 

#

 

#the following code is used to conduct a post-hoc comparison of the ranked means

 

#it is useful to first do a simple boxplot for a visual comparison

 

#Use this script to save the boxplot graphic to a .png data file

 

#save output in pdf file authplot

 

#send output to screen and file

 

sink()

 

authplot <- boxplot(authscore ~ Group, data=kruskal, main=”Group Comparison”, ylab=”authscore”)

 

#save .png file

 

png(“authplot.png”)

 

#now return all output to console

 

dev.off()

 

#

 

#make sure that the package PMCMR is loaded before running the following script

 

library(PMCMR)

 

#use the ‘with’ function to pass the data from the kruskal data frame to the post hoc

 

#test script; specify the Tukey HSD method for determining significance of each

 

# pair of comparisions

 

#

 

with(kruskal, {

 

 

posthoc.kruskal.nemenyi.test(authscore, Group, “Tukey”)

 

})

 

#

 

#NOTE: if using a version of R < 3.xx then use the package pgirmess instead of PMCMR

 

#

 

#the following lines show the post hoc analysis using the pgirmess package

 

#note the function kuskalmc is used for the comparisons

 

library(pgirmess)

 

authscore <- c(96,128,83,61,101,82,121,132,135,109,115,149,166,147)

 

Group <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3)

 

kruskalmc(authscore ~ Group, probs=.05, cont=NULL)

 

#