Category Archives: R Tutorials

R Tutorial: Using R with NORC GSS Data Part Two, Generating Simple Tables and Using Subsets

A tutorial by Douglas M. Wiig

Part one of the tutorial  centered on importing NORC GSS data in STATA or SPSS formats in an R data frame. For illustration I used the GSS2014 survey data set that consists of 2538 cases and 866 variables. If a researcher wishes to generate some simple cross tabulations the R CrossTable function is very useful.

The CrossTable function is part of the gmodels package, so before running scripts in this tutorial make sure you have installed and loaded gmodels from your favorite CRAN mirror site. As discussed in part one of the tutorial load the GSS2014 dataset into the global environment using:


>Dataset <- read.spss(“E:/research/Documents/GSS2014.sav”,

use.value.labels=TRUE, max.value.labels=Inf,

The CrossTable function allows a basic cross tabulation to be performed and includes a large number of options that can be incorporated into the table. The basic structure is as follows:


CrossTable(x, y, digits=3, max.width = 5, expected=FALSE, prop.r=TRUE, prop.c=TRUE,

prop.t=TRUE, prop.chisq=TRUE, chisq = FALSE, fisher=FALSE, mcnemar=FALSE,

resid=FALSE, sresid=FALSE, asresid=FALSE,


format=c(“SAS”,”SPSS”), dnn = NULL, …)


x A vector or a matrix. If y is specified, x must be a vector

y A vector in a matrix or a dataframe

digits Number of digits after the decimal point for cell proportions

max.width In the case of a 1 x n table, the default will be to print the output horizontally.

If the number of columns exceeds max.width, the table will be wrapped for

each successive increment of max.width columns. If you want a single column

vertical table, set max.width to 1

expected If TRUE, chisq will be set to TRUE and expected cell counts from the _2 will be


prop.r If TRUE, row proportions will be included

prop.c If TRUE, column proportions will be included

prop.t If TRUE, table proportions will be included

prop.chisq If TRUE, chi-square contribution of each cell will be included

chisq If TRUE, the results of a chi-square test will be included

fisher If TRUE, the results of a Fisher Exact test will be included

mcnemar If TRUE, the results of a McNemar test will be included

resid If TRUE, residual (Pearson) will be included

sresid If TRUE, standardized residual will be included

asresid If TRUE, adjusted standardized residual will be included


If TRUE, then remove any unused factor levels

format Either SAS (default) or SPSS, depending on the type of output desired.

dnn the names to be given to the dimensions in the result (the dimnames names).

optional arguments

(Gregory Warnes, maintainer, Package ‘Gmodels’ February, 2015.

In this tutorial I will create a  table to examine the relationship between income and education using the variables ‘degree’ and ‘income6’ from the GSS dataset. Both are categorical factors. To simplify the resulting table only actual frequencies will be reported and the ‘chisq’ option will be used to generate a chi-squared test. The format used will be set to SPSS. Use the following statement:

>Generate a cross table of frequencies with chisq reported

>CrossTable(Dataset$”incom16″,Dataset$”degree”, chisq=TRUE, format=c(“SPSS”),prop.r=FALSE, prop.c=FALSE, prop.t=FALSE, prop.chisq=FALSE)


In the above code, the row variable is income the appropriate column of the dataset is selected with the ‘Dataset$”incom16” statement. The column variable for the table is education and the appropriate column of the dataset is selected with the ‘Dataset$”degree” statement. The various cell proportions must be set to ‘FALSE’ as they are defaulted to ‘True.’

When you run the above script the table will be generated in SPSS format on the screen.  I will not reproduce the table here because of formatting problems of fitting the table into the blog format.

In part three of this turorial I will discuss generating subsets of the GSS data file and using subsets for statistical analyses such as t tests and ANOVA.


Tutorial: Using R to Analyze GSS2014 Social Science Data, Part One: Importing the Database in SPSS or STATA Format

For anyone interested in researching social science questions there is a wealth of survey data available through the National Opinion Research Center (NORC) and its associated research universities. The Center has been conducting a national survey each year since 1972 and has compiled a massive database of data from these surveys. Most if not all of these data files can be accessed and downloaded without charge. I have been working with the 2014 edition of the data and for this tutorial will use the GSS2014 data file that is available for download on the Center’s web site. ( See the NORC main website at and at ).

As noted above the datasets that are available for download are available in both SPSS format and STATA format. To work with either of these formats using R it is necessary to read the file into a data frame using one of a couple of different packages. The first option I will discuss uses the Hmisc package. The second option I will discuss uses the foreign package. Install both of these packages from your favorite CRAN mirror site before starting the code in this tutorial.

For this tutorial I am using the one year release file GSS2014. This file contains 2538 cases and 866 variables. Download the file   from the web site listed above in both SPSS and STATA formats. Use the following code to load the Hmisc package into your R global environment:


Now load the GSS2014.sav SPSS version from your storage device using the following line of code. I am using the filename GSS2014 for my data file and loading the file into the data frame ‘gss14’:

>#load the GSS data file in SPSS format

                >put data into data frame ‘gss14’

                gss14 <- spss.get(F:/research/Documents/GSS2014.sav”,                     use.value.labels=TRUE)


To view the data that was loaded use the command:


This will produce a spreadsheet-like matrix of rows and columns containing the data. To load the data file in STATA format download the STATA version of the file from the NORC web site a discussed above. My STATA file is also named GSS2014, but with the STATA .dta extension. Load the file into a data frame using:

>load STATA format file into data frame ‘Dataset2’

                >Datatset2 <- read.dta(“F:/resarch/Documents/GSS2014.dta”)


Once again, you can view the data frame loaded using the command:


Both the STATA and SPSS formats of the data set can also be loaded into R using the foreign package. The procedure is the same for both SPSS and STATA

>load SPSS version


                >Dataset <- read.spss(“F:/research/Documents/GSS2014.sav”,   use.value.labels=TRUE)

 >load STATA version into data frame ‘Dataset3’

>Dataset3 <- read.dta(“E:/research/Documents/GSS2014.dta”)

Use the ‘View()’ command to view the data frame.

In part two I will discuss some techniques using R to create and analyze subsets of the GSS2014 data file.


Using R for Nonparametric Analysis, The Kruskal-Wallis Test Part Four: R Script and Some Notes on IDE’s


Using R for Nonparametric Analysis, The Kruskal-Wallis Test: R Script and Some Notes on IDE’s


A tutorial by Douglas M. Wiig


In the previous three parts of this tutorial I discussed using R to enter a data set and perform a nonparametric Kruska-Wallis test for ranked means. In this final part the commented script that was used in the first three parts is listed. 


If you are going to use R for the majority of your statistical analysis it is highly advisable that you investigate some of the IDE’s (Integrated Development Environments) that are available to assist in coding and debugging R script and creating R packages for personal use or distribution. I think one of the easiest to use is R Studio. R Studio is available in both free open source and commercial versions and can be downloaded at    There are versions available for Windows, various Linux distributions, and Mac OS.

The R studio console provides a number of useful tools that facilitate coding. The screen is divided into four sections with one section providing a code editor that features syntax highlighting, code completion and many other features such as line or block code execution. Another window contains R and all displays output, error messages and warnings when code is executed from the editor. A third window displays all of the current environmental variables that are active and can also show all currently loaded R packages. A fourth window can show graphic output from executed code, can be used to manage, download and install R packages, and can be used to access the CRAN database of online help. There are other useful tools that are too numerous to discuss here.


Another program that is worth looking at is RKWard which combines an IDE with a graphics GUI for R statistical analysis. Information and downloads for RKWard can be found at This program is also free and open source and can be run on a Windows platform, Mac OS, or various distributions of Linux. The program has been optimized for Linux. A discussion of these IDE’s is beyond the scope of this posting.


Shown below is the commented R script for all three parts of the Kruskal-Wallis tutorial. For ease of reading code portions are shown in bold print.


#packages that must be present in the global environment before running these scripts


#stats; graphics; grDevices; utils; datasets; methods; base




#code to enter data using the data editor


#KW data entry, define file kruskal as a data frame


kruskal <-data.frame()


#invoke the data editor


kruskal <-edit(kruskal)


#define group as containing 3 factors; tell R which data column goes with which factor


group <- factor(1,2,3)


#alternative data entry method


#Define factor Group as containing three categories


Group <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3)


#create a vector defined as authscore and enter values


authscore <- c(96,128,83,61,101,82,121,132,135,109,115,149,166,147)


#create data frame kruskal matching each group factor to individual scores


kruskal <- data.frame(Group, authscore)


#use the following line to look at the structure of the data frame created




#run the basic Kruskal-Wallis test




kruskal.test(authscore ~ Group, data=kruskal)




#the following code is used to conduct a post-hoc comparison of the ranked means


#it is useful to first do a simple boxplot for a visual comparison


#Use this script to save the boxplot graphic to a .png data file


#save output in pdf file authplot


#send output to screen and file




authplot <- boxplot(authscore ~ Group, data=kruskal, main=”Group Comparison”, ylab=”authscore”)


#save .png file




#now return all output to console




#make sure that the package PMCMR is loaded before running the following script




#use the ‘with’ function to pass the data from the kruskal data frame to the post hoc


#test script; specify the Tukey HSD method for determining significance of each


# pair of comparisions




with(kruskal, {



posthoc.kruskal.nemenyi.test(authscore, Group, “Tukey”)






#NOTE: if using a version of R < 3.xx then use the package pgirmess instead of PMCMR




#the following lines show the post hoc analysis using the pgirmess package


#note the function kuskalmc is used for the comparisons




authscore <- c(96,128,83,61,101,82,121,132,135,109,115,149,166,147)


Group <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3)


kruskalmc(authscore ~ Group, probs=.05, cont=NULL)




Using R in Nonparametric Statistical Analysis, The Kruskal-Wallis Test Part Three: Post Hoc Pairwise Multiple Comparison Analysis of Ranked Means

Using the Kruskal-Wallis Test, Part Three:  Post Hoc Pairwise Multiple Comparison Analysis of Ranked Means

A tutorial by Douglas M. Wiig

In previous tutorials I discussed an example of entering data into a data frame and performing a nonparametric Kruskal-Wallis test to determine if there were differences in the authoritarian scores of three different groups of educators. The test statistic indicated that at least one of the groups(group 1) was significantly different from the other two.

In order to explore the difference further it common practice to do post hoc analysis of the differences. There are a number of methods that have been devised to do these comparisons, but one of the most straightforward and easiest to understand is pairwise comparison of ranked means(or means if using standard ANOVA.)

Prior to entering the code for this section be sure that the following packages are installed and loaded:



In part one data was entered into the R editor to create a data frame. Data frames can also be created directly using R script. The script to create the data frame for this example uses the following code:

#create data frame from script input

>Group <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3)

>authscore <-c(96,128,83,61,101,82,121,132,135,109,115,149,166,147)

>kruskal <- data.frame(Group, authscore)

The group identifiers are entered and assigned to the variable Group, and the authority scores are assigned to the variable authscore. Notice that each identifier is matched with an appropriate authscore just as they were when entered in columns using the data editor. The vectors are then assigned to the variable kruskal to create a data.frame. Once again the structure of the data frame can be checked using the command:


resulting in:

'data.frame':   14 obs. of  2 variables:
 $ Group    : num  1 1 1 1 1 2 2 2 2 2 ...
 $ authscore: num  96 128 83 61 101 82 121 132 135 109 ...


It is often useful to do a visual examination of the ranked means prior to post hoc analysis. This can be easily accomplished using a boxplot to display the 3 groups that are presented in the example. If the data frame created in tutorial one is still in the global environment the boxplot can be generated with the following script:

>#boxplot using authscore and group variables from the data frame created in part one

>boxplot(authscore ~ group, data=kruskal, main=”Group Comparison”, ylab=”authscore”)


The resulting boxplot is seen below:


As can be seen in the plot, authority score differences are the greatest between group 1 and 3 with group 2 In between. Use the following code to run the Kruskal-Wallis test and examine if any of the means are significantly different:


with(kruskal, {

posthoc.kruskal.nemenyi.test(authscore, Group, “Tukey”)


The post hoc test used in this example is from the recently released PMCMR R package. For details of this and other post hoc tests contained in the package( see Thorsten Polert, Calculate Pairwise Multiple Comparisons of Mean Rank Sums, 2015. The test employed here used the Tukey method to make pairwise comparisons of the mean rank authoritarianism scores of the three groups. The output from the script above is:

Pairwise comparisons using Tukey and Kramer (Nemenyi) test

with Tukey-Dist approximation for independent samples

data: authscore and Group

      1                    2

2   0.493             –

3    0.031        0.310

P value adjustment method: none

The output above confirms what would be expected from observing the boxplot. The only means that differ significantly are means 1 and 3 with a p = .031.

The PMCMR package will only work with R versions 3.0.x. If using an earlier version of R another package can be used to accomplish the post hoc comparisons. This package is the pgirmess package (see for complete details). Using the vectors authscore and Group that were created earlier the script for multiple comparison using the pgirmess package is:


authscore <- c(96,128,83,61,101,82,121,132,135,109,115,149,166,147)

Group <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3)

kruskalmc(authscore ~ Group, probs=.05, cont=NULL)

and the output from this script using a significance level of p = .05 is:

Multiple comparison test after Kruskal-Wallis

p.value: 0.05


      obs.dif    critical.dif     difference

1-2    3.0        6.333875         FALSE

1-3    7.1        6.718089         TRUE

2-3    4.1        6.718089        FALSE


As noted earlier the comparison between groups one and three is shown to be the only significant difference at the p=.05 level.

Both the PMCMR and the pgirmess packages are useful in producing post hoc comparisons with the Kruskal-Wallis test. It hoped that the series of tutorials discussing nonparametric alternatives common parametric statistical tests has helped demonstrate the utility of these approaches in statistical analysis.

In part four I will post the complete script used in all three tutorials.

Using R for Nonparametric Statistics: The Kruskal-Wallis Test, Part Two

Using R for Nonparametric Statistics:  The Kruskal-Wallis Test, Part Two

A Tutorial by Douglas M. Wiig

Before we can run the Kruskal-Wallis test we need to define which column contains the factors (independent variables) and which contains the authoritarianism scores (dependent variable). Once we define the factor column R will match the correct score to each of the 14 observations.
As set up in the study, ‘Group’ is the factor(independent variable), and ‘authscore’ is the dependent variable. Use the command:

> Group <-factor(1,2,3)

This designates which observation belongs to each group. To make sure the data structure has been set up correctly use the command:

> str(kruskal)
‘data.frame’: 14 obs. of 2 variables:
$ Group : num 1 1 1 1 1 2 2 2 2 2 …
$ authscore: num 96 128 83 61 101 82 124 132 135 109 …

The output of this command shows a summary of the structure of the data frame created. We can now run the Kruskal Wallis test with the command:

> kruskal.test(authscore ~ Group, data=kruskal)

The output will be:

Kruskal-Wallis rank sum test

data: authscore by Group
Kruskal-Wallis chi-squared = 6.4057, df = 2, p-value = 0.04065


As seen in the above output the analysis of authoritarianism score by group indicates that the probability of differences in scores among the three groups being due to chance alone is less that the .05 alpha level that was set for the study. (pobt < .05). Further post hoc analysis would be necessary to determine the exact nature of the differences among the scores of the three groups. This will be the topic of a future tutorial.

More to come:  Part Three will explore the use of multiple comparison techniques to analyze ranked means

Using R for Nonparametric Analysis: The Kruskal-Wallis Test, Part One


Using R for Nonparametric Data Analysis: The Kruskal-Wallis Test

A tutorial by Douglas M. Wiig

Analysis of variance(ANOVA) is a commonly used technique for examining the effect of an independent variable on three or more dependent variables. There are several types of ANOVA ranging from simple one-way ANOVA to the more complex multiple analysis of variance, MANOVA. ANOVA makes several assumptions about the sample data being used such as the assumption of normal distribution of the variables in the parent population, underlying continuous distribution of the variables, and interval or ratio level measurement of all variables. If any of these assumptions cannot be met a researcher can turn to a nonparametric counterpart to ANOVA for the analysis. This tutorial will discuss the use of the Kruskal-Wallis test, the nonparametric counterpart to analysis of variance.

In this tutorial I will explore a simple example and discuss entering the sample data into a data file using the R data editor. I will then discuss setting up the data for analysis and using the Kruskal-Wallis test.

I am going to assume that the reader has a working knowledge of ANOVA with parametric data. Since ANOVA uses sample means and variances as the basis of the statistical test interval or ratio level measurement is necessary to insure valid results in addition to the assumptions indicated above. With the nonparametric Kruskal-Wallis test the only assumptions to be met are ordinal or better measurement and the assumption of an underlying continuous measurement. The example to be used here is taken from a book on nonparametric statistics by Sidney Seigel.(Sidney Seigel, Nonparametric Statistics for the Behavioral Sciences, New York: McGraw-Hill, 1956, pp-184-196).

A researcher wishes to test the hypothesis that school administrators are typically more authoritarian than classroom teachers. He also believes that many classroom teachers are adminstration-oriented in their professional aspirations which may, in turn, have an effect on their authoritarianism. 14 subjects are selected and divided into three groups: teaching-oriented teachers (classroom teachers who wish to remain in a teaching position), administration-oriented teachers (classroom teachers who aspire to become administrators), and practicing administrators.(Seigel, p. 186). The level of authoritarianism of each subject is measured through a survey that assigns an authoritarianism score that is considered to be at least ordinal in nature. Higher scores indicate higher levels of authoritarianism. (Siegel, p. 186). The null hypothesis is that there is no difference in mean authoritarianism scores among the three groups. The alternative hypothesis is that the mean authoritarianism scores among the three groups are different. The alpha level for rejecting the null hypothesis is p = .05. (Seigel, p. 186).

Since we make no assumption about a normal distribution of scores, have a small sample size of n = 14, and ordinal measure we will use the nonparametric test which is based on median scores and ranks rather than means and variances as used in parametric ANOVA. The mathematical details of how this is done is beyond the scope of this tutorial. See Seigel, p. 187-189 for details. The authoritarian scores for the three groups are shown below:

Authoritarianism Scores of Three Groups of Educators

Teacher-Oriented        Admin-oriented    Administrators

teachers   n=5                teachers   n=5                n=4


96                                  82                               115

128                              124                               149

83                               132                               166

61                               123                               147

101                             109


(Seigel, p. 187)

The first task is to create an R data frame with the scores from the table. We will enter the scores using the R data editor. We will name the data frame ‘kruskal.’   Invoke the editor using the following commands:

  > kruskal <- data.frame()

   > kruskal <- edit(kruskal)

You should see the data entry editor open in a separate window. In order to process the data properly it needs to be entered into two columns. The first column will be the factors (which group the scores belong to), and the second column will contain the actual scores. Label column 1 ‘Group’ and column 2 ‘authscore.’ When the data are entered your editor should look like this:


Group  authscore

1    1               96

2    1            128

3    1             83

4    1            61

5    1           101

6    2            82

7    2           121

8    2          132

9    2          135

10   2       109

11   3       115

12   3       149

13   3       166

14   3       147


Make sure that each column of numbers is of the data type “Real.” l Close the data editor by clicking ‘Quit’ and the data will be saved in the working directory for access. To see what has been entered in the data editor use the command:

> kruskal

Group authscore

1     1             96

2     1            128

3     1             83

4     1            61

5     1           101

6     2            82

7     2           121

8     2            132

9     2            135

10     2         109

11     3         115

12     3         149

13     3         166


You should see the output as above. If you need to make changes simple invoke the editor with:

> kruskal <-edit(kruskal)

The editor will open and you can make any changes you need to. Be sure to click on ‘Quit’ to save the changes to the working directory.

Part Two will continue the analysis

R Tutorial: A Simple Script to Create and Analyze a Data File, Part Two

A simple R script to create and analyze a data file:part two:    A tutorial by D.M. Wiig

In part one I discussed creating a simple data file containing the height and weight of 10 subjects.  In part two I will discuss the script needed to create a simple scatter diagram of the data and perform a basic Pearson correlation.  Before attempting to continue the script in this tutorial make sure that you have created and save the data file as discussed in part one.

To conduct a correlation/regression analysis of the data we want to first view a simple scatter plot. Load a library named ‘car’ into R memory. Use the command:

> library(car)

Then issue the following command to plot the graph:

> plot(Height~Weight, log=”xy”, data=Sampledatafile)

The output is seen below:


We can calculate a Pearson’s Product Moment correlation coefficient by using the command:

> # Pearson rank-order correlations between height and weight

> cor(Sampledatafile[,c(“Height”,”Weight”)], use=”complete.obs”, method=”pearson”)

Which results in:

Height Weight

Height 1.0000000 0.8813799

Weight 0.8813799 1.0000000

To run a simple linear regression for Height and Weight use the following code. Note that the dependent variable (Weight) is listed firt:

> model <-lm(Weight~Height, data=Sampledatafile)

> summary(model)


lm(formula = Weight ~ Height, data = Sampledatafile)


Min 1Q Median 3Q Max

-30.6800 -16.9749 -0.8774 19.9982 25.3200


Estimate Std. Error t value Pr(>|t|)

(Intercept) -337.986 98.403 -3.435 0.008893 **

Height 7.518 1.425 5.277 0.000749 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 21.93 on 8 degrees of freedom

Multiple R-squared: 0.7768, Adjusted R-squared: 0.7489

F-statistic: 27.85 on 1 and 8 DF, p-value: 0.0007489


To plot a regression line on the scatter diagram use the following command line. Note that we enter the y (dependent)variable first and then the x (independent)variable:

> scatterplot(Weight~Height, log=”xy”, reg.line=lm, smooth=FALSE, spread=FALSE,

+ data=Sampledatafile)


This will produce a graph as seen below. Note that box plots have also been included in the output:


This tutorial has hopefully demonstrated that complex tasks can be accomplished with relatively simple command line script. I will explore more of these simple scripts in future tutorials.

More to Come:


R Tutorial: A Script to Create and Analyze a Simple Data File, Part One

R Tutorial: A Simple Script to Create and Analyze a Data File, Part One

By D.M. Wiig

In this tutorial I will walk you through a simple script that will show you how to create a data file and perform some simple statistical procedures on the file. I will break the code into segments and discuss what each segment does. Before starting this tutorial make sure you have a terminal window open and open R from the command line.

The first task is to create a simple data file. Let’s assume that we have some data from 10 individuals measuring each person’s height and weight. The data is shown below:

Height(inches) Weight(lbs)

72               225

60               128

65               176

75               215

66               145

65               120

70               210

71               176

68               155

77               250

We can enter the data into a data matrix by invoking the data editor and entering the values. Please note that the lines of code preceded by a # are comments and are ignored by R:

#Create a new file and invoke the data editor to enter data

#Create the file Sampledatafile, height and weight of 10 s subjects

Sampledatafile <-data.frame()

Sampledatafile <-edit(Sampledatafile)

You will see a window open that is the R Data Editor. Click on the column heading ‘var1’ and you will see several different data types in the drop down menu. Choose the ‘real’ data type. Follow the same procedure to set the data type for the second column. Enter the data pairs in the columns, with height in the first column and weight in the second column. When the data have been entered click on the var1 heading for column 1 and click ‘Change Name.’ Enter ‘Height’ to label the first column. Follow the same steps to rename the second column ‘Weight.’

Once both columns of data have been entered you can click ‘Quit.’ The datafile ‘Sampledatafile’ is now loaded into memory.

To run so me basic descriptive statistics use the following code:

> #Run descriptives on the data

> summary(Sampledatafile)

The output from this code will be:

  Height                Weight

Min. :60.00          Min. :120.0

1st Qu.:65.25        1st Qu.:147.5

Median :69.00        Median :176.0

Mean :68.90          Mean :180.0

3rd Qu.:71.75        3rd Qu.:213.8

Max. :77.00          Max. :250.0


To view the data file use the following lines of code:

>#print the datafile ‘Sampledatafile’ on the screen

> print(Sampledatafile)

You will see the output:

Height          Weight

1 72             225

2 60             128

3 65             176

4 75             215

5 66             145

6 65             120

7 70             210

8 71             176

9 68             155

10 77            250

In Part Two I will discuss an R script to do a simple correlation and scatter diagram.  Check back later!

Nonparametric Statistical Analysis Using R: The Sign Test

Using R in Nonparametic Statistical Analysis:  The Binomial Sign Test

A tutorial by D.M. Wiig

One of the core competencies that students master in introductory social science statistics is to create a null and alternative hypothesis pair relative to a research question and to use a statistical test to evaluate and make a decision about rejecting or retaining the null hypothesis.  I have found that one of the easiest statistical tests to use when teaching these concepts is the sign test.  This is a very easy test to use and students seem to intuitively grasp the concepts of trials and binomial outcomes as these are easily related to the common and familiar event of ‘flipping a coin.’


While it is possible to use the sign test by looking up probabilities of outcomes in a table of the binomial distribution I have found that using R to perform the analysis is a good way to get them involved in using statistics software to solve the problem.  R has an easy to use sign test routine that is called with the binom.test command.  To illustrate the use of the test consider an experiment where the researcher has randomly assigned 10 individuals to a group and observes them in both a control and experimental condition.  The researcher measures the criterion variable of interest in each condition for each subject and measures the effect on each subject’s behavior using a relative scale of effect.


The researcher at this point is only interested in whether or not the criterion variable has an effect on behavior, so a non-directional hypothesis is used.  The data collected is shown in the following table:


Subject   1     2     3     4     5     6      7      8     9     10


Pre      50   49   37   16   80   42    40    58   31    21

Post.   56   50   30   25   90   44    60    71   32    22


+     +     –     +     +     +      +      +    +      –

The general format for the sign test is as follows:


binom.test(x, n, p =.5, alternative = “two.sided”, “less”, “greater”, conf.level = .95)


where: x = number of successes

n = number of trials

alternative = indicates the alternative hypthesis as directional or nondirectional

conf.level = the confidence level for the returned confidence interval.


In the example as described above we have 8 pluses and 2 minuses.  We will use the “two.sided” option for the alternative hypothesis a probability of success of .50, and a conf.level of .95. The following is entered into R:

Under a nondirectional alternative hypothesis we are testing the probability of obtaining 0, 1, 2, 8, 9, 10 pluses or:

> binom.test(8, 10, p=.5, alternative=”two.sided”, conf.level=.95)

Exact binomial test

data:  8 and 10
number of successes = 8, number of trials = 10,
p-value = 0.1094
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.4439045 0.9747893
sample estimates:
probability of success

Under a nondirectional alternative hypothesis we are testing the probability of obtaining 0, 1, 2, 8, 9, 10 pluses or:


p(o, 1, 2, 8, 9, 10 pluses)  = .1094


If we had set an alpha of ά=.05 then we would retain the null hypothesis as p(obt) > .05.  We could not conclude that the experimental criterion has an effect on behavior.  R has many other nonparametric statistical tests that are easy to use from the command line.  These are topics for future tutorials.


More to Follow:

Using R for Basic Cross Tabulation Analysis: Part Three, Using the xtabs Function

Using R to Work with GSS Survey Data Part Three: Using xtabs to Create and Analyze Tables

A tutorial by D. M. Wiig
In Part Two of this series of tutorials I discussed how to find and import a data set from the NORC GSS survey. The focus of that tutorial was on the GSS2010 data set that was imported into the R workspace in SPSS format and then loaded into an R data frame for analysis.

Use the following code to load the data set into an R workspace:

>install.packages(“Hmisc”) #need for file import
>install.packages(“foreign”) #need for file import
>#get spss gss file and put into data frame
>gssdataframe <- spss.get(“/path-to-your-file/GSS2010.sav”, use.value.labels=TRUE)

The xtabs function provides a quick way to generate and view a cross tabulation of two variables and allows the user to specify one or more control variables in the cross tabulation. Using the variables “ partyid” and “polviews” the cross tablulation is generated with:

>#use xtabs to produce a table
>gsstab <- xtabs(~ partyid + polviews, data=gssdataframe)

To view the resulting table use:

>gsstab #show table

To view summary statistics generated use:


This summary shows the number of cases in the table, the number of factors and the Chi-square value for the table.

Variables used in social science research are often interrelated so it is desirable to control for one or more variables in order to further examine the variables of interest. The table created in the gsstab data frame shows the relationship between political ideology and political party affiliation. To look at the relationship by gender use the following:

>#use xtabs to produce a table with a control variable
>gsstab2 <- xtabs(~ partyid + polviews+ sex, data=gssdataframe)

To view the new table use:


To view summary statistics for the table enter:


As noted above xtabs is a quick and powerful function to create N x N tables with or without control variables. In the next tutorial I explore the use of the ca function to produce a basic Correspondence analysis of underlying dimensions in an N x N table.