A tutorial by Douglas M. Wiig
Please note that this post is an embedded Word document. To read the document full screen click on the icon in the lower right portion of the document window.
A tutorial by Douglas M. Wiig
Please note that this post is an embedded Word document. To read the document full screen click on the icon in the lower right portion of the document window.
A tutorial by D.M. Wiig
This tutorial is posted as an embedded Word document. To view the document full screen click on the button in the lower right corner of the window. Please note that you must be online for the full page Word document display to work.
An R tutorial by D. M. Wiig
This tutorial is posted as an embedded Word document. To view the document full screen click on the icon in the lower-right corner of the document window.
My next post covering installing and using the Rcommander GUI will be out in a day or two.
R-Fiddle is a great tool to develop and test code segments or complete R programs. By accessing the R-Fiddle web site users have a fully functioning R console, code editor and discussion board all in one place. If a user has code uploaded that has been designated to share, other users can access the code and make suggestions or additions. Code can be run with full R support from your web browser.
Try the link below to test out R-Fiddle. I have uploaded a small program as a demo. Feel free to share your own projects, help others or try out code segments.
http://www.r-fiddle.org/#/embed?id=rtOt8yR3
Click in the link above to activate the R editor and R console.
Tutorial: Using R to Analyze NORC GSS Social Science Data, Part Six, R and ANOVA
A tutorial by Douglas M. Wiig
As discussed in previous segments of this tutorial, for anyone interested in researching social science questions there is a wealth of survey data available through the National Opinion Research Center (NORC) and its associated research universities. The Center has been conducting a national survey each year since 1972 and has compiled a massive database of data from these surveys. Most if not all of these data files can be accessed and downloaded without charge. I have been working with the 2014 edition of the data and for all part of this tutorial will use the GSS2014 data file that is available for download on the Center’s web site. (See the NORC main website at http://www.norc.org/Research/Projects/Pages/general-social-survey.aspx and at http://www3.norc.org/GSS+Website ).
Accessing and loading the NORC GSS2014 data set was discussed in part one of this tutorial. Refer to it if you need specific information on downloading the data set in STATA or SPSS format. In this segment we will use the subset function to select a desired set of cases from all of the cases in the data file that meet certain criteria. As indicated in my previous tutorial the GSS2014 data set contains a total of 2588 cases and 866 variables. Before starting this segment of the tutorial be sure that the foreign package is installed and loaded into your R session. As I have indicated in previous tutorials, use of an IDE such as R Studio greatly facilitates entering and debugging R code when doing research such as is discussed in my tutorials. Import the GSS 2014 data file in SPSS format and load it into the data frame ‘gss14’ using: #########################################################import GSS2014 file in SPSS .sav format #uses foreign package ########################################################require(foreign) gss14<- read.spss("/path to your location/GSS2014.sav", use.value.labels=TRUE,max.value.labels=Inf, to.data.frame=TRUE) #################################################
In this tutorial the analysis of this sample will focus on examining the hypothesis “An individual’s outlook on life is influenced by the amount of education the person has attained.” The GSS variables ‘educ’, education in number of years and ‘life’, whether the respondent rated life DULL, ROUTINE, or EXCITING. A simple approach to testing this hypothesis is to compare mean levels of education for each of the three categories of response. I will do this analysis in two stages. In the first stage I will use techniques discussed in a previous tutorial to select a subset of each response category and display the mean education level for each of the three categories.
The three subsets life1, life2, and life3 are generated using the following code:
###################################################
# create 3 subsets from gss14 view on life by years of education
##################################################
life1 <- subset(gss14, life == “DULL”, select=educ)
life2 <- subset(gss14, life ==”ROUTINE”, select=educ)
life3 <- subset(gss14, life == “EXCITING”, select=educ)
###################################################
#The three means of the subsets are displayed using the code:
# run summary statistics for each subgroup
##################################################
summary(life1)
summary(life2)
summary(life3)
###################################################
resulting the following output:
educ
Min. : 0.00
1st Qu.:10.00
Median :12.00
Mean :11.78
3rd Qu.:13.00
Max. :20.00
summary(life2)
educ
Min. : 0.00
1st Qu.:12.00
Median :13.00
Mean :13.22
3rd Qu.:16.00
Max. :20.00
NA’s :1
summary(life3)
educ
Min. : 2.00
1st Qu.:12.00
Median :14.00
Mean :14.31
3rd Qu.:16.00
Max. :20.00
As is seen above there does appear to be a difference among mean years of education and the corresponding outlook on life. In order to examine whether or not these observed differences are not due to chance a simple one-way Analysis of Variance can be generated.
In the next tutorial I will discuss performing the ANOVA and using pair-wise comparisons to determine which if any means are different.
R Tutorial Part Three: Selecting Subsets and Comparing Means Using an Independent Sample t Test
A tutorial by Douglas M. Wiig
As discussed in previous segments of this tutorial, for anyone interested in researching social science questions there is a wealth of survey data available through the National Opinion Research Center (NORC) and its associated research universities. The Center has been conducting a national survey each year since 1972 and has compiled a massive database of data from these surveys. Most if not all of these data files can be accessed and downloaded without charge. I have been working with the 2014 edition of the data and for all part of this tutorial will use the GSS2014 data file that is available for download on the Center’s web site. (See the NORC main website at http://www.norc.org/Research/Projects/Pages/general-social-survey.aspx and at http://www3.norc.org/GSS+Website ).
Accessing and loading the NORC GSS2014 data set was discussed in part one of this tutorial. Refer to it if you need specific information on downloading the data set in STATA or SPSS format. In this segment we will use the subset function to select a desired set of cases from all of the cases in the data file that meet certain criteria. As indicated in my previous tutorial the GSS2014 data set contains a total of 2588 cases and 866 variables.
Before starting this segment of the tutorial be sure that the foreign package is installed and loaded into your R session. Import the GSS 2014 data file and load it into the data frame ‘Dataset’ using: ######################################################## #import GSS2014 file in SPSS .sav format #uses foreign package ######################################################## require(foreign) Dataset <- read.spss("/path to your location/GSS2014.sav", use.value.labels=TRUE, max.value.labels=Inf, to.data.frame=TRUE)
###########################################################
In the previous segment of this tutorial we started to investigate whether or not an individual’s education had an effect on their response to a NORC survey item dealing with abortion. The item asked respondents to either ‘AGREE’ or ‘DISAGREE’ with the statement ‘A women should be allowed to obtain an abortion under any circumstances.’ We selected a subset of all of the respondents who answered ‘AGREE’ and a second subset of all the respondents who answered ‘DISAGREE’ using the following code:
##############################################
#select subset from Dataset and write to data frame SS1
################################################### SS1 <- subset(Dataset, abany == "YES", select=educ)
View(SS1)
#######################################################
###################################################### #select subset from Dataset and write to data frame SS2 ###################################################### SS2 <- subset(Dataset, abany == "NO", select=educ) View(SS2)
A mean number of years of education can be calculated for each of the subsets using the following:
#calculate descriptive statistics for SS1 and SS2
####################################################
summary(SS1)
summary(SS2)
####################################################
Output from the above for SS1 is:
In this tutorial I will use the Student’s t test function t.test that is found in the stats package. The function is used in the following form:
t.test =(x,y, alternative = c(“two.sided”, “less”, “greater”), mu=0, paired = FALSE, var.equal = FALSE, conf.level = .95)
where x and y = numeric vectors of data values
alternative = specification of a one-tailed or two-tailed test
mu = 0 specification that true difference between means is zero
paired = FALSE specification of a two independent sample test; if TRUE a paired samples test will be used
var.equal = specification of equal variances of the two samples; if TRUE the pooled variance is used otherwise a Welsh approximation of degrees of freedom is used
conf.level = confidence level of the interval
For further information see the documentation in CRAN help files for the function t.test().
Using the vectors selected from the dataset SS1, and SS2 the t test is performed using:
###########################################################
#perform a t test to compare sample means
#########################################################
t.test(SS1,SS2, alternative = c(“two.sided”), mu=0, paired=FALSE, var.equal = TRUE, conf.level = .95)
###########################################################
Resulting in output of:
Two Sample t-test data: SS1 and SS2 t = 11.1356, df = 1650, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 1.369673 1.955333 sample estimates: mean of x mean of y 14.59517 12.93267
We can see that the difference between the mean years of education for the ‘YES’ and the ‘NO’ samples is significant at an alpha level of p=.05. Subsets can also be used to compare means involving more than two samples and using simple one-way Analysis of Variance. This will be covered in the next part of the tutorial.
A tutorial by Douglas M. Wiig
Part one of the tutorial centered on importing NORC GSS data in STATA or SPSS formats in an R data frame. For illustration I used the GSS2014 survey data set that consists of 2538 cases and 866 variables. If a researcher wishes to generate some simple cross tabulations the R CrossTable function is very useful.
The CrossTable function is part of the gmodels package, so before running scripts in this tutorial make sure you have installed and loaded gmodels from your favorite CRAN mirror site. As discussed in part one of the tutorial load the GSS2014 dataset into the global environment using:
>require(foreign)
>Dataset <- read.spss(“E:/research/Documents/GSS2014.sav”,
use.value.labels=TRUE, max.value.labels=Inf, to.data.frame=TRUE)
The CrossTable function allows a basic cross tabulation to be performed and includes a large number of options that can be incorporated into the table. The basic structure is as follows:
Usage
CrossTable(x, y, digits=3, max.width = 5, expected=FALSE, prop.r=TRUE, prop.c=TRUE,
prop.t=TRUE, prop.chisq=TRUE, chisq = FALSE, fisher=FALSE, mcnemar=FALSE,
resid=FALSE, sresid=FALSE, asresid=FALSE,
missing.include=FALSE,
format=c(“SAS”,”SPSS”), dnn = NULL, …)
Arguments
x A vector or a matrix. If y is specified, x must be a vector
y A vector in a matrix or a dataframe
digits Number of digits after the decimal point for cell proportions
max.width In the case of a 1 x n table, the default will be to print the output horizontally.
If the number of columns exceeds max.width, the table will be wrapped for
each successive increment of max.width columns. If you want a single column
vertical table, set max.width to 1
expected If TRUE, chisq will be set to TRUE and expected cell counts from the _2 will be
included
prop.r If TRUE, row proportions will be included
prop.c If TRUE, column proportions will be included
prop.t If TRUE, table proportions will be included
prop.chisq If TRUE, chi-square contribution of each cell will be included
chisq If TRUE, the results of a chi-square test will be included
fisher If TRUE, the results of a Fisher Exact test will be included
mcnemar If TRUE, the results of a McNemar test will be included
resid If TRUE, residual (Pearson) will be included
sresid If TRUE, standardized residual will be included
asresid If TRUE, adjusted standardized residual will be included
missing.include
If TRUE, then remove any unused factor levels
format Either SAS (default) or SPSS, depending on the type of output desired.
dnn the names to be given to the dimensions in the result (the dimnames names).
… optional arguments
(Gregory Warnes, maintainer, Package ‘Gmodels’ February, 2015. http://cran.r-project.org/src/contrib/PACKAGES.html)
In this tutorial I will create a table to examine the relationship between income and education using the variables ‘degree’ and ‘income6’ from the GSS dataset. Both are categorical factors. To simplify the resulting table only actual frequencies will be reported and the ‘chisq’ option will be used to generate a chi-squared test. The format used will be set to SPSS. Use the following statement:
>Generate a cross table of frequencies with chisq reported
>CrossTable(Dataset$”incom16″,Dataset$”degree”, chisq=TRUE, format=c(“SPSS”),prop.r=FALSE, prop.c=FALSE, prop.t=FALSE, prop.chisq=FALSE)
>
In the above code, the row variable is income the appropriate column of the dataset is selected with the ‘Dataset$”incom16” statement. The column variable for the table is education and the appropriate column of the dataset is selected with the ‘Dataset$”degree” statement. The various cell proportions must be set to ‘FALSE’ as they are defaulted to ‘True.’
When you run the above script the table will be generated in SPSS format on the screen. I will not reproduce the table here because of formatting problems of fitting the table into the blog format.
In part three of this turorial I will discuss generating subsets of the GSS data file and using subsets for statistical analyses such as t tests and ANOVA.
Using the Kruskal-Wallis Test, Part Three: Post Hoc Pairwise Multiple Comparison Analysis of Ranked Means
A tutorial by Douglas M. Wiig
In previous tutorials I discussed an example of entering data into a data frame and performing a nonparametric Kruskal-Wallis test to determine if there were differences in the authoritarian scores of three different groups of educators. The test statistic indicated that at least one of the groups(group 1) was significantly different from the other two.
In order to explore the difference further it common practice to do post hoc analysis of the differences. There are a number of methods that have been devised to do these comparisons, but one of the most straightforward and easiest to understand is pairwise comparison of ranked means(or means if using standard ANOVA.)
Prior to entering the code for this section be sure that the following packages are installed and loaded:
PMCMR
prirmess
In part one data was entered into the R editor to create a data frame. Data frames can also be created directly using R script. The script to create the data frame for this example uses the following code:
#create data frame from script input
>Group <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3)
>authscore <-c(96,128,83,61,101,82,121,132,135,109,115,149,166,147)
>kruskal <- data.frame(Group, authscore)
The group identifiers are entered and assigned to the variable Group, and the authority scores are assigned to the variable authscore. Notice that each identifier is matched with an appropriate authscore just as they were when entered in columns using the data editor. The vectors are then assigned to the variable kruskal to create a data.frame. Once again the structure of the data frame can be checked using the command:
>str(kruskal)
resulting in:
'data.frame': 14 obs. of 2 variables: $ Group : num 1 1 1 1 1 2 2 2 2 2 ... $ authscore: num 96 128 83 61 101 82 121 132 135 109 ... |
|
|
|
It is often useful to do a visual examination of the ranked means prior to post hoc analysis. This can be easily accomplished using a boxplot to display the 3 groups that are presented in the example. If the data frame created in tutorial one is still in the global environment the boxplot can be generated with the following script:
>#boxplot using authscore and group variables from the data frame created in part one
>boxplot(authscore ~ group, data=kruskal, main=”Group Comparison”, ylab=”authscore”)
>
The resulting boxplot is seen below:
As can be seen in the plot, authority score differences are the greatest between group 1 and 3 with group 2 In between. Use the following code to run the Kruskal-Wallis test and examine if any of the means are significantly different:
#library(PMCMR)
with(kruskal, {
posthoc.kruskal.nemenyi.test(authscore, Group, “Tukey”)
}
The post hoc test used in this example is from the recently released PMCMR R package. For details of this and other post hoc tests contained in the package( see Thorsten Polert, Calculate Pairwise Multiple Comparisons of Mean Rank Sums, 2015. http://cran.r-project.org/web/packages/PMCMR/PMCMR.pdf.) The test employed here used the Tukey method to make pairwise comparisons of the mean rank authoritarianism scores of the three groups. The output from the script above is:
Pairwise comparisons using Tukey and Kramer (Nemenyi) test
with Tukey-Dist approximation for independent samples
data: authscore and Group
1 2
2 0.493 –
3 0.031 0.310
P value adjustment method: none
The output above confirms what would be expected from observing the boxplot. The only means that differ significantly are means 1 and 3 with a p = .031.
The PMCMR package will only work with R versions 3.0.x. If using an earlier version of R another package can be used to accomplish the post hoc comparisons. This package is the pgirmess package (see http://cran.r-project.org/web/packages/pgirmess/pgirmess.pdf for complete details). Using the vectors authscore and Group that were created earlier the script for multiple comparison using the pgirmess package is:
library(pgirmess)
authscore <- c(96,128,83,61,101,82,121,132,135,109,115,149,166,147)
Group <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3)
kruskalmc(authscore ~ Group, probs=.05, cont=NULL)
and the output from this script using a significance level of p = .05 is:
Multiple comparison test after Kruskal-Wallis p.value: 0.05 Comparisons obs.dif critical.dif difference 1-2 3.0 6.333875 FALSE 1-3 7.1 6.718089 TRUE 2-3 4.1 6.718089 FALSE |
|
|
As noted earlier the comparison between groups one and three is shown to be the only significant difference at the p=.05 level.
Both the PMCMR and the pgirmess packages are useful in producing post hoc comparisons with the Kruskal-Wallis test. It hoped that the series of tutorials discussing nonparametric alternatives common parametric statistical tests has helped demonstrate the utility of these approaches in statistical analysis.
In part four I will post the complete script used in all three tutorials.
Using R for Nonparametric Statistics: The Kruskal-Wallis Test, Part Two
A Tutorial by Douglas M. Wiig
Before we can run the Kruskal-Wallis test we need to define which column contains the factors (independent variables) and which contains the authoritarianism scores (dependent variable). Once we define the factor column R will match the correct score to each of the 14 observations.
As set up in the study, ‘Group’ is the factor(independent variable), and ‘authscore’ is the dependent variable. Use the command:
> Group <-factor(1,2,3)
This designates which observation belongs to each group. To make sure the data structure has been set up correctly use the command:
> str(kruskal)
‘data.frame’: 14 obs. of 2 variables:
$ Group : num 1 1 1 1 1 2 2 2 2 2 …
$ authscore: num 96 128 83 61 101 82 124 132 135 109 …
>
The output of this command shows a summary of the structure of the data frame created. We can now run the Kruskal Wallis test with the command:
> kruskal.test(authscore ~ Group, data=kruskal)
The output will be:
Kruskal-Wallis rank sum test
data: authscore by Group
Kruskal-Wallis chi-squared = 6.4057, df = 2, p-value = 0.04065
>
As seen in the above output the analysis of authoritarianism score by group indicates that the probability of differences in scores among the three groups being due to chance alone is less that the .05 alpha level that was set for the study. (pobt < .05). Further post hoc analysis would be necessary to determine the exact nature of the differences among the scores of the three groups. This will be the topic of a future tutorial.
More to come: Part Three will explore the use of multiple comparison techniques to analyze ranked means
Using R for Nonparametric Data Analysis: The Kruskal-Wallis Test
A tutorial by Douglas M. Wiig
Analysis of variance(ANOVA) is a commonly used technique for examining the effect of an independent variable on three or more dependent variables. There are several types of ANOVA ranging from simple one-way ANOVA to the more complex multiple analysis of variance, MANOVA. ANOVA makes several assumptions about the sample data being used such as the assumption of normal distribution of the variables in the parent population, underlying continuous distribution of the variables, and interval or ratio level measurement of all variables. If any of these assumptions cannot be met a researcher can turn to a nonparametric counterpart to ANOVA for the analysis. This tutorial will discuss the use of the Kruskal-Wallis test, the nonparametric counterpart to analysis of variance.
In this tutorial I will explore a simple example and discuss entering the sample data into a data file using the R data editor. I will then discuss setting up the data for analysis and using the Kruskal-Wallis test.
I am going to assume that the reader has a working knowledge of ANOVA with parametric data. Since ANOVA uses sample means and variances as the basis of the statistical test interval or ratio level measurement is necessary to insure valid results in addition to the assumptions indicated above. With the nonparametric Kruskal-Wallis test the only assumptions to be met are ordinal or better measurement and the assumption of an underlying continuous measurement. The example to be used here is taken from a book on nonparametric statistics by Sidney Seigel.(Sidney Seigel, Nonparametric Statistics for the Behavioral Sciences, New York: McGraw-Hill, 1956, pp-184-196).
A researcher wishes to test the hypothesis that school administrators are typically more authoritarian than classroom teachers. He also believes that many classroom teachers are adminstration-oriented in their professional aspirations which may, in turn, have an effect on their authoritarianism. 14 subjects are selected and divided into three groups: teaching-oriented teachers (classroom teachers who wish to remain in a teaching position), administration-oriented teachers (classroom teachers who aspire to become administrators), and practicing administrators.(Seigel, p. 186). The level of authoritarianism of each subject is measured through a survey that assigns an authoritarianism score that is considered to be at least ordinal in nature. Higher scores indicate higher levels of authoritarianism. (Siegel, p. 186). The null hypothesis is that there is no difference in mean authoritarianism scores among the three groups. The alternative hypothesis is that the mean authoritarianism scores among the three groups are different. The alpha level for rejecting the null hypothesis is p = .05. (Seigel, p. 186).
Since we make no assumption about a normal distribution of scores, have a small sample size of n = 14, and ordinal measure we will use the nonparametric test which is based on median scores and ranks rather than means and variances as used in parametric ANOVA. The mathematical details of how this is done is beyond the scope of this tutorial. See Seigel, p. 187-189 for details. The authoritarian scores for the three groups are shown below:
Authoritarianism Scores of Three Groups of Educators
Teacher-Oriented Admin-oriented Administrators
teachers n=5 teachers n=5 n=4
—————————————————————————————-
96 82 115
128 124 149
83 132 166
61 123 147
101 109
—————————————————————————————-
(Seigel, p. 187)
The first task is to create an R data frame with the scores from the table. We will enter the scores using the R data editor. We will name the data frame ‘kruskal.’ Invoke the editor using the following commands:
> kruskal <- data.frame()
> kruskal <- edit(kruskal)
You should see the data entry editor open in a separate window. In order to process the data properly it needs to be entered into two columns. The first column will be the factors (which group the scores belong to), and the second column will contain the actual scores. Label column 1 ‘Group’ and column 2 ‘authscore.’ When the data are entered your editor should look like this:
———————-
Group authscore
1 1 96
2 1 128
3 1 83
4 1 61
5 1 101
6 2 82
7 2 121
8 2 132
9 2 135
10 2 109
11 3 115
12 3 149
13 3 166
14 3 147
———————-
Make sure that each column of numbers is of the data type “Real.” l Close the data editor by clicking ‘Quit’ and the data will be saved in the working directory for access. To see what has been entered in the data editor use the command:
> kruskal
Group authscore
1 1 96
2 1 128
3 1 83
4 1 61
5 1 101
6 2 82
7 2 121
8 2 132
9 2 135
10 2 109
11 3 115
12 3 149
13 3 166
>
You should see the output as above. If you need to make changes simple invoke the editor with:
> kruskal <-edit(kruskal)
The editor will open and you can make any changes you need to. Be sure to click on ‘Quit’ to save the changes to the working directory.
Part Two will continue the analysis