R Tutorial: Using the NORC GSS2014 Data File, Creating and Using Subsets


R Tutorial:  Using the NORC GSS2014 data file, creating and using subsets

By Douglas M. Wiig

As discussed in the first part of this tutorial, for anyone interested in researching social science questions there is a wealth of survey data available through the National Opinion Research Center (NORC) and its associated research universities. The Center has been conducting a national survey each year since 1972 and has compiled a massive database of data from these surveys. Most if not all of these data files can be accessed and downloaded without charge. I have been working with the 2014 edition of the data and for all part of this tutorial will use the GSS2014 data file that is available for download on the Center’s web site. (See the NORC main website at http://www.norc.org/Research/Projects/Pages/general-social-survey.aspx and at http://www3.norc.org/GSS+Website ).

Accessing and loading the NORC GSS2014 data set was discussed in part one of this tutorial. Refer to it if you need specific information on downloading the data set in STATA or SPSS format.  In this segment we will  use the subset function to select a desired set of cases from all of the cases in the data file that meet certain criteria.  As indicated in my previous tutorial the GSS2014 data set contains a total of 2588 cases and 866 variables.

One of the areas surveyed by NORC each year deals with attitudes toward abortion. One of the questions simply asks respondents if they '...approve of abortion under any circumstances.'  The response is either YES or NO to this question.  Let's assume a researcher is interested in investigating whether or not education has an effect on how the respondent answers the question.

To look at this hypothesis we can use the abortion attitude variable mentioned above, 'abany', and an education variable 'educ' which measures education as the actual number of years of education.  Twelve years of education would be a high school graduate for example, and 16 years would be a college graduate.  We can select a subset of all respondents who indicated 'YES' on the survey question and then generate a mean years of education for this subset.  We can then select a subset of all respondents who indicated 'NO' on the question and calculate a mean years of education for the second subset.

Before starting this code segment be sure that the foreign package is installed and loaded into your R session.  Import the GSS 2014 data file and load it into the data frame ‘Dataset’ using:

########################################################
#import GSS2014 file in SPSS .sav format
#uses foreign package
########################################################
require(foreign)
Dataset <- read.spss("/path to your location/GSS2014.sav", 
                     use.value.labels=TRUE, max.value.labels=Inf, to.data.frame=TRUE)
########################################################

Once the GSS2014 file is loaded use the subset function to select your first subset of respondents who answered the 'abany' question with and 'YES response.  Use the following code to select the subset and store it in a data frame 'SS1':

####################################################
#select subset from Dataset and write to data frame SS1
####################################################
SS1 <- subset(Dataset, abany == "YES", select=educ)
View(SS1)
####################################################

Now select a second subset of respondents who answered the 'abany' question with a 'NO' response. Use the following code to select the subset and store in a data frame 'SS2':

######################################################
#select subset from Dataset and write to data frame SS2
######################################################
SS2 <- subset(Dataset, abany == "NO", select=educ)
View(SS2)
######################################################

In using the subset function as seen above the name of the data set is specified, the criteria for selecting rows is given, and the variables to select from each row specified.  If no 'select' option is given all variables will be shown for the selected row.

Using the View command to examine each subset shows the years of education for each of the 746 respondents who answered ‘YES’ and each of the 907 respondents who answered ‘NO.’ Since the variable ‘educ’ is measured as ratio level numeric data we can calculate a mean and standard deviation for each subset and perform both graphical and statistical analysis of any observed difference between the two means. This will be the subject of the next installment of the tutorial.