Using R to Work with GSS Survey Data: Viewing Datasets and Performing Cross Tabulations
A tutorial by D. M. Wiig
In a previous tutorial I discussed how to import datasets from the NORC General Social Science Survey using R to write the SPSS formatted data to an R data frame. Once the data has been imported into the R working environment it can be viewed and analyzed. There is a wealth of survey research data available at the NORC web site located at www.norc.org. In this tutorial the dataset gss2010.sav will be used. The dataset is available from www3.norc.org/GSS+Website.
From that page click on the “Quick Downloads” link on the right hand side of the page to access the list of available datasets. From the next page choose SPSS to access ‘.sav’ format files and finally “2010” under the heading “GSS 1972-2012 Release 6.” Please note that this is a rather large data file with 2044 observations of 794 variables. Download the file to a directory that you can access from your R console.
As discussed in a previous tutorial the SPSS format file can be loaded into an R data frame. Make sure that the R packages Hmisc and foreign have been installed and loaded before attempting to import the SPSS file. The following code will load the ‘.sav’ file:
>install.packages(“Hmisc”) #need for file import
>install.packages(“foreign”) #need for file import
>#get spss gss file and put into data frame
>library(Hmisc)
>gssdataframe <- spss.get(“/path-to-your-file/GSS2010.sav”, use.value.labels=TRUE)
Once the file is read into an R data frame it can be viewed in a spreadsheet like interface by using the command:
>View(gssdataframe)
Using the arrow keys, the home key, end key, and the page up and page down keys allows navigating and browsing the file.
Survey data such as that found in the GSS file is usually a mixture of data types ranging from ratio level numbers to categorical data. Cross tabulations are often used to explore relationships among variables that are ordinal or categorical in nature. R has a number of functions available for cross tabulations. The Table function is a quick way to generate a cross tabulation table with a number of options available. The following results in a frequency table of the variables “partyid” and “polviews” both of which are measured in categories:
>#use the gssdataframe
>#the variables partyid and polviews are used
>attach(gssdataframe)
>#create a table named ‘gsstable’
>gsstable <- table(partyid, polviews)
>gsstable #print table frequencies
The following output results:
polviews
partyid EXTREMELY LIBERAL LIBERAL SLIGHTLY LIBERAL MODERATE
STRONG DEMOCRAT 41 105 42 94
NOT STR DEMOCRAT 14 62 57 154
IND,NEAR DEM 11 47 57 103
INDEPENDENT 5 20 33 189
IND,NEAR REP 1 4 16 74
NOT STR REPUBLICAN 2 10 16 88
STRONG REPUBLICAN 0 5 5 22
OTHER PARTY 1 5 6 16
polviews
partyid SLGHTLY CONSERVATIVE CONSERVATIVE EXTRMLY CONSERVATIVE
STRONG DEMOCRAT 22 25 6
NOT STR DEMOCRAT 28 16 7
IND,NEAR DEM 25 11 5
INDEPENDENT 43 32 9
IND,NEAR REP 49 43 8
NOT STR REPUBLICAN 72 72 13
STRONG REPUBLICAN 23 101 27
OTHER PARTY 3 12 4
|
|
|
There are options available with the Table function that include calculating row and column marginal totals as well a cell percentages. Another quick method to generate tables is with the CrossTable function. The function is contained in the gmodels package and can be used on the table generated with the Table function above. Use the following lines of code to generate a cross table between ‘polviews’ and ‘partyid’ using the gsstable created above:
>library(gmodels)
>#produce basic crosstabs
>CrossTable(gsstable,prop.t=FALSE,prop.r=FALSE,prop.c=FALSE,chisq=TRUE,format=c(“SPSS”))
>
Cell Contents
|-------------------------|
| Count |
| Chi-square contribution |
|-------------------------|
Total Observations in Table: 1961
| polviews
partyid | EXTREMELY LIBERAL | LIBERAL | SLIGHTLY LIBERAL | MODERATE | SLGHTLY CONSERVATIVE | CONSERVATIVE | EXTRMLY CONSERVATIVE | Row Total |
-------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|
STRONG DEMOCRAT | 41 | 105 | 42 | 94 | 22 | 25 | 6 | 335 |
| 62.014 | 84.219 | 0.141 | 8.312 | 11.962 | 15.026 | 4.163 | |
-------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|
NOT STR DEMOCRAT | 14 | 62 | 57 | 154 | 28 | 16 | 7 | 338 |
| 0.089 | 6.911 | 7.238 | 5.486 | 6.840 | 26.537 | 3.215 | |
-------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|
IND,NEAR DEM | 11 | 47 | 57 | 103 | 25 | 11 | 5 | 259 |
| 0.121 | 4.902 | 22.674 | 0.284 | 2.857 | 22.144 | 2.830 | |
-------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|
INDEPENDENT | 5 | 20 | 33 | 189 | 43 | 32 | 9 | 331 |
| 4.634 | 12.733 | 0.969 | 32.889 | 0.067 | 8.107 | 1.409 | |
-------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|
IND,NEAR REP | 1 | 4 | 16 | 74 | 49 | 43 | 8 | 195 |
| 5.592 | 18.279 | 2.167 | 0.002 | 19.466 | 4.622 | 0.003 | |
-------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|
NOT STR REPUBLICAN | 2 | 10 | 16 | 88 | 72 | 72 | 13 | 273 |
| 6.824 | 18.702 | 8.224 | 2.190 | 33.411 | 18.786 | 0.364 | |
-------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|
STRONG REPUBLICAN | 0 | 5 | 5 | 22 | 23 | 101 | 27 | 183 |
| 6.999 | 15.115 | 12.805 | 32.065 | 0.121 | 177.476 | 52.256 | |
-------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|
OTHER PARTY | 1 | 5 | 6 | 16 | 3 | 12 | 4 | 47 |
| 0.354 | 0.227 | 0.035 | 0.170 | 1.768 | 2.735 | 2.344 | |
-------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|
Column Total | 75 | 258 | 232 | 740 | 265 | 312 | 79 | 1961 |
-------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|
Statistics for All Table Factors
Pearson's Chi-squared test
------------------------------------------------------------
Chi^2 = 801.8746 d.f. = 42 p = 3.738705e-141
Minimum expected frequency: 1.797552
Cells with Expected Frequency < 5: 2 of 56 (3.571429%)
Warning message:
In chisq.test(t, correct = FALSE, ...) :
Chi-squared approximation may be incorrect
|
|
|
This code produces a table of frequencies along with a basic Ch-squared test. Other options include generating cell percentages and using either SPSS or SAS table format. This is accomplished by changing the appropriate flag from FALSE to TRUE and specifying either SPSS or SAS for the format flag. The table formatting is compressed in this example due to the narrow margin requirements of the web page. Use the scroll bar at the bottom of the page to view the entire table.
There are many functions available in R to analyze data in tabular format. In my next tutorial I will examine using the xtabs function to produce basic cross tabulation with control variables.
Like this:
Like Loading...