Tag Archives: r code

R Tutorial: Using the NORC GSS2014 Data File, Creating and Using Subsets


R Tutorial:  Using the NORC GSS2014 data file, creating and using subsets

By Douglas M. Wiig

As discussed in the first part of this tutorial, for anyone interested in researching social science questions there is a wealth of survey data available through the National Opinion Research Center (NORC) and its associated research universities. The Center has been conducting a national survey each year since 1972 and has compiled a massive database of data from these surveys. Most if not all of these data files can be accessed and downloaded without charge. I have been working with the 2014 edition of the data and for all part of this tutorial will use the GSS2014 data file that is available for download on the Center’s web site. (See the NORC main website at http://www.norc.org/Research/Projects/Pages/general-social-survey.aspx and at http://www3.norc.org/GSS+Website ).

Accessing and loading the NORC GSS2014 data set was discussed in part one of this tutorial. Refer to it if you need specific information on downloading the data set in STATA or SPSS format.  In this segment we will  use the subset function to select a desired set of cases from all of the cases in the data file that meet certain criteria.  As indicated in my previous tutorial the GSS2014 data set contains a total of 2588 cases and 866 variables.

One of the areas surveyed by NORC each year deals with attitudes toward abortion. One of the questions simply asks respondents if they '...approve of abortion under any circumstances.'  The response is either YES or NO to this question.  Let's assume a researcher is interested in investigating whether or not education has an effect on how the respondent answers the question.

To look at this hypothesis we can use the abortion attitude variable mentioned above, 'abany', and an education variable 'educ' which measures education as the actual number of years of education.  Twelve years of education would be a high school graduate for example, and 16 years would be a college graduate.  We can select a subset of all respondents who indicated 'YES' on the survey question and then generate a mean years of education for this subset.  We can then select a subset of all respondents who indicated 'NO' on the question and calculate a mean years of education for the second subset.

Before starting this code segment be sure that the foreign package is installed and loaded into your R session.  Import the GSS 2014 data file and load it into the data frame ‘Dataset’ using:

########################################################
#import GSS2014 file in SPSS .sav format
#uses foreign package
########################################################
require(foreign)
Dataset <- read.spss("/path to your location/GSS2014.sav", 
                     use.value.labels=TRUE, max.value.labels=Inf, to.data.frame=TRUE)
########################################################

Once the GSS2014 file is loaded use the subset function to select your first subset of respondents who answered the 'abany' question with and 'YES response.  Use the following code to select the subset and store it in a data frame 'SS1':

####################################################
#select subset from Dataset and write to data frame SS1
####################################################
SS1 <- subset(Dataset, abany == "YES", select=educ)
View(SS1)
####################################################

Now select a second subset of respondents who answered the 'abany' question with a 'NO' response. Use the following code to select the subset and store in a data frame 'SS2':

######################################################
#select subset from Dataset and write to data frame SS2
######################################################
SS2 <- subset(Dataset, abany == "NO", select=educ)
View(SS2)
######################################################

In using the subset function as seen above the name of the data set is specified, the criteria for selecting rows is given, and the variables to select from each row specified.  If no 'select' option is given all variables will be shown for the selected row.

Using the View command to examine each subset shows the years of education for each of the 746 respondents who answered ‘YES’ and each of the 907 respondents who answered ‘NO.’ Since the variable ‘educ’ is measured as ratio level numeric data we can calculate a mean and standard deviation for each subset and perform both graphical and statistical analysis of any observed difference between the two means. This will be the subject of the next installment of the tutorial.

Using R in Nonparametric Statistical Analysis, The Kruskal-Wallis Test Part Three: Post Hoc Pairwise Multiple Comparison Analysis of Ranked Means


Using the Kruskal-Wallis Test, Part Three:  Post Hoc Pairwise Multiple Comparison Analysis of Ranked Means

A tutorial by Douglas M. Wiig

In previous tutorials I discussed an example of entering data into a data frame and performing a nonparametric Kruskal-Wallis test to determine if there were differences in the authoritarian scores of three different groups of educators. The test statistic indicated that at least one of the groups(group 1) was significantly different from the other two.

In order to explore the difference further it common practice to do post hoc analysis of the differences. There are a number of methods that have been devised to do these comparisons, but one of the most straightforward and easiest to understand is pairwise comparison of ranked means(or means if using standard ANOVA.)

Prior to entering the code for this section be sure that the following packages are installed and loaded:

       PMCMR

   prirmess

In part one data was entered into the R editor to create a data frame. Data frames can also be created directly using R script. The script to create the data frame for this example uses the following code:

#create data frame from script input

>Group <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3)

>authscore <-c(96,128,83,61,101,82,121,132,135,109,115,149,166,147)

>kruskal <- data.frame(Group, authscore)

The group identifiers are entered and assigned to the variable Group, and the authority scores are assigned to the variable authscore. Notice that each identifier is matched with an appropriate authscore just as they were when entered in columns using the data editor. The vectors are then assigned to the variable kruskal to create a data.frame. Once again the structure of the data frame can be checked using the command:

>str(kruskal)

resulting in:

'data.frame':   14 obs. of  2 variables:
 $ Group    : num  1 1 1 1 1 2 2 2 2 2 ...
 $ authscore: num  96 128 83 61 101 82 121 132 135 109 ...

>

It is often useful to do a visual examination of the ranked means prior to post hoc analysis. This can be easily accomplished using a boxplot to display the 3 groups that are presented in the example. If the data frame created in tutorial one is still in the global environment the boxplot can be generated with the following script:

>#boxplot using authscore and group variables from the data frame created in part one

>boxplot(authscore ~ group, data=kruskal, main=”Group Comparison”, ylab=”authscore”)

>

The resulting boxplot is seen below:

Rplot5

As can be seen in the plot, authority score differences are the greatest between group 1 and 3 with group 2 In between. Use the following code to run the Kruskal-Wallis test and examine if any of the means are significantly different:

#library(PMCMR)

with(kruskal, {

posthoc.kruskal.nemenyi.test(authscore, Group, “Tukey”)

}

The post hoc test used in this example is from the recently released PMCMR R package. For details of this and other post hoc tests contained in the package( see Thorsten Polert, Calculate Pairwise Multiple Comparisons of Mean Rank Sums, 2015. http://cran.r-project.org/web/packages/PMCMR/PMCMR.pdf.) The test employed here used the Tukey method to make pairwise comparisons of the mean rank authoritarianism scores of the three groups. The output from the script above is:

Pairwise comparisons using Tukey and Kramer (Nemenyi) test

with Tukey-Dist approximation for independent samples

data: authscore and Group

      1                    2

2   0.493             –

3    0.031        0.310

P value adjustment method: none

The output above confirms what would be expected from observing the boxplot. The only means that differ significantly are means 1 and 3 with a p = .031.

The PMCMR package will only work with R versions 3.0.x. If using an earlier version of R another package can be used to accomplish the post hoc comparisons. This package is the pgirmess package (see http://cran.r-project.org/web/packages/pgirmess/pgirmess.pdf for complete details). Using the vectors authscore and Group that were created earlier the script for multiple comparison using the pgirmess package is:

library(pgirmess)

authscore <- c(96,128,83,61,101,82,121,132,135,109,115,149,166,147)

Group <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3)

kruskalmc(authscore ~ Group, probs=.05, cont=NULL)

and the output from this script using a significance level of p = .05 is:

Multiple comparison test after Kruskal-Wallis

p.value: 0.05

Comparisons

      obs.dif    critical.dif     difference

1-2    3.0        6.333875         FALSE

1-3    7.1        6.718089         TRUE

2-3    4.1        6.718089        FALSE

>

As noted earlier the comparison between groups one and three is shown to be the only significant difference at the p=.05 level.

Both the PMCMR and the pgirmess packages are useful in producing post hoc comparisons with the Kruskal-Wallis test. It hoped that the series of tutorials discussing nonparametric alternatives common parametric statistical tests has helped demonstrate the utility of these approaches in statistical analysis.

In part four I will post the complete script used in all three tutorials.

Book Review: R High Performance Programming


A book review by Douglas M. Wiig

Aloysius Lim and William Tjhi. R High Performance Programming. Birmingham, UK: Packt Publishing Ltd., 2015. bit.ly/14Rhpp

R High Performance Programming is a well written, informative book most suited for the experienced R programmer. This book offers a handy guide for R users who need speed and efficiency for the tasks that they perform.

The authors begin with an informative chapter discussing some of the inherent constraints on R’s computing performance such as CPU and RAM usage, and how R code is interpreted on the fly rather than compiled. A guide to several methods of profiling R’s code execution time, memory allocation and CPU usage is discussed in the next chapter. Sample code included in the chapter allows the reader to experiment with various benchmarking techniques to measure processing time and memory usage. This chapter provides the reader with some good tools for benchmarking R projects and identifying areas where improvements in processing can be made.

As is always the case with technical books from Packt Publishing, ample code examples are used in the chapter and the complete code used in each chapter is available for download with the book. This is a very handy feature and allows readers to do some live programming with R as the book is read.

The authors discuss a number of simple tweaks that can be easily performed to increase processing speed such as using built in functions and using hash tables. The hash table technique is useful for applications that use frequent lookups and can dramatically reduce processing time when compared to the use of lists. Running example code using this technique shows a large decrease in processing time when using the hash table approach as compared to straight list processing lookups.

In chapter 4 the authors discuss the use of compiled R code and integrating compiled languages into R code. They show several examples of using the R package inline that allows users to embed C, C++, Objective-C, Objective-C++ and Fortran code within R. Once again there are ample code examples to illustrate the use of this technique. For more advanced uses of compiled code the authors discuss how to create entire modules coded in C++ using the Rcpp package. Several completed code examples are included to illustrate the technique.

Another interesting approach to speeding up R is discussed in a chapter that explores several R packages designed to exploit the capability of GPU’s (Graphic Processing Cards) that are a used in many computers. These techniques can facilitate creating very fast and efficient statistical modeling code using R and the GPU.

As indicated above, readers can download the code package included with the book and find a well-organized set of ten folders (one for each chapter) containing 51 files. These files contain the sample code from the book as well as other code segments and benchmark code discussed in the book. The authors indicate that the code has been tested on R 3.1.1, Ubuntu 14.04 Trusty Tahr, Mac OS X 10.9 Mavericks, and Windows 8.1. This allows integration of these code segments into the reader’s own projects with minimal changes.

Other chapters in R High Performance Programming discuss simple tweaks to use less memory, techniques to speed processing of large datasets and using parallel processing and clustering techniques. The last chapter contains a discussion of using R and Hadoop to process Big Data (massive datasets with sizes measured in petabytes -one petabyes is 1,048,576 gigabytes). Processing data of this magnitude presents many challenges and is an area that is currently the subject of much program development.

I found R High Performance Programming to be a useful and informative book for the advanced user of R. A working knowledge of statistics, R and other programming languages such as C++ or Java is necessary to realize the full benefit of the techniques presented in the book. The book also serves as a good learning tool for less knowledgeable R users who are seeking to advance their programming skills.

Readers who are interested in the use of Hadoop and cluster computer processing might find the book Raspberry Pi Super Cluster by Andrew K. Dennis of interest. (Packt Publishing, 2013

PAC-14-1987838-1387169). A review of this book can be found on my web site at http://dmwiig.net.

Reviewer Information:

Douglas M. Wiig, Professor of Political Science

Grand View University

Teaching areas include social science statistics and research methods, comparative politics, international politics.

Long time user and developer of computer and statistical applications

Host of Open Source Technology in Higher Education web site at http://dmwiig.net

Creator and moderator of LinkedIn discussion forum “Open Source Technology in Higher Education”

Regular contributor to several LinkedIn discussion forums

Author of numerous tutorials on using the R statistical programming language and Raspberry Pi computer