Tag Archives: r code

Using R in Nonparametric Statistical Analysis, The Kruskal-Wallis Test Part Three: Post Hoc Pairwise Multiple Comparison Analysis of Ranked Means


Using the Kruskal-Wallis Test, Part Three:  Post Hoc Pairwise Multiple Comparison Analysis of Ranked Means

A tutorial by Douglas M. Wiig

In previous tutorials I discussed an example of entering data into a data frame and performing a nonparametric Kruskal-Wallis test to determine if there were differences in the authoritarian scores of three different groups of educators. The test statistic indicated that at least one of the groups(group 1) was significantly different from the other two.

In order to explore the difference further it common practice to do post hoc analysis of the differences. There are a number of methods that have been devised to do these comparisons, but one of the most straightforward and easiest to understand is pairwise comparison of ranked means(or means if using standard ANOVA.)

Prior to entering the code for this section be sure that the following packages are installed and loaded:

       PMCMR

   prirmess

In part one data was entered into the R editor to create a data frame. Data frames can also be created directly using R script. The script to create the data frame for this example uses the following code:

#create data frame from script input

>Group <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3)

>authscore <-c(96,128,83,61,101,82,121,132,135,109,115,149,166,147)

>kruskal <- data.frame(Group, authscore)

The group identifiers are entered and assigned to the variable Group, and the authority scores are assigned to the variable authscore. Notice that each identifier is matched with an appropriate authscore just as they were when entered in columns using the data editor. The vectors are then assigned to the variable kruskal to create a data.frame. Once again the structure of the data frame can be checked using the command:

>str(kruskal)

resulting in:

'data.frame':   14 obs. of  2 variables:
 $ Group    : num  1 1 1 1 1 2 2 2 2 2 ...
 $ authscore: num  96 128 83 61 101 82 121 132 135 109 ...

>

It is often useful to do a visual examination of the ranked means prior to post hoc analysis. This can be easily accomplished using a boxplot to display the 3 groups that are presented in the example. If the data frame created in tutorial one is still in the global environment the boxplot can be generated with the following script:

>#boxplot using authscore and group variables from the data frame created in part one

>boxplot(authscore ~ group, data=kruskal, main=”Group Comparison”, ylab=”authscore”)

>

The resulting boxplot is seen below:

Rplot5

As can be seen in the plot, authority score differences are the greatest between group 1 and 3 with group 2 In between. Use the following code to run the Kruskal-Wallis test and examine if any of the means are significantly different:

#library(PMCMR)

with(kruskal, {

posthoc.kruskal.nemenyi.test(authscore, Group, “Tukey”)

}

The post hoc test used in this example is from the recently released PMCMR R package. For details of this and other post hoc tests contained in the package( see Thorsten Polert, Calculate Pairwise Multiple Comparisons of Mean Rank Sums, 2015. http://cran.r-project.org/web/packages/PMCMR/PMCMR.pdf.) The test employed here used the Tukey method to make pairwise comparisons of the mean rank authoritarianism scores of the three groups. The output from the script above is:

Pairwise comparisons using Tukey and Kramer (Nemenyi) test

with Tukey-Dist approximation for independent samples

data: authscore and Group

      1                    2

2   0.493             –

3    0.031        0.310

P value adjustment method: none

The output above confirms what would be expected from observing the boxplot. The only means that differ significantly are means 1 and 3 with a p = .031.

The PMCMR package will only work with R versions 3.0.x. If using an earlier version of R another package can be used to accomplish the post hoc comparisons. This package is the pgirmess package (see http://cran.r-project.org/web/packages/pgirmess/pgirmess.pdf for complete details). Using the vectors authscore and Group that were created earlier the script for multiple comparison using the pgirmess package is:

library(pgirmess)

authscore <- c(96,128,83,61,101,82,121,132,135,109,115,149,166,147)

Group <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3)

kruskalmc(authscore ~ Group, probs=.05, cont=NULL)

and the output from this script using a significance level of p = .05 is:

Multiple comparison test after Kruskal-Wallis

p.value: 0.05

Comparisons

      obs.dif    critical.dif     difference

1-2    3.0        6.333875         FALSE

1-3    7.1        6.718089         TRUE

2-3    4.1        6.718089        FALSE

>

As noted earlier the comparison between groups one and three is shown to be the only significant difference at the p=.05 level.

Both the PMCMR and the pgirmess packages are useful in producing post hoc comparisons with the Kruskal-Wallis test. It hoped that the series of tutorials discussing nonparametric alternatives common parametric statistical tests has helped demonstrate the utility of these approaches in statistical analysis.

In part four I will post the complete script used in all three tutorials.

Book Review: R High Performance Programming


A book review by Douglas M. Wiig

Aloysius Lim and William Tjhi. R High Performance Programming. Birmingham, UK: Packt Publishing Ltd., 2015. bit.ly/14Rhpp

R High Performance Programming is a well written, informative book most suited for the experienced R programmer. This book offers a handy guide for R users who need speed and efficiency for the tasks that they perform.

The authors begin with an informative chapter discussing some of the inherent constraints on R’s computing performance such as CPU and RAM usage, and how R code is interpreted on the fly rather than compiled. A guide to several methods of profiling R’s code execution time, memory allocation and CPU usage is discussed in the next chapter. Sample code included in the chapter allows the reader to experiment with various benchmarking techniques to measure processing time and memory usage. This chapter provides the reader with some good tools for benchmarking R projects and identifying areas where improvements in processing can be made.

As is always the case with technical books from Packt Publishing, ample code examples are used in the chapter and the complete code used in each chapter is available for download with the book. This is a very handy feature and allows readers to do some live programming with R as the book is read.

The authors discuss a number of simple tweaks that can be easily performed to increase processing speed such as using built in functions and using hash tables. The hash table technique is useful for applications that use frequent lookups and can dramatically reduce processing time when compared to the use of lists. Running example code using this technique shows a large decrease in processing time when using the hash table approach as compared to straight list processing lookups.

In chapter 4 the authors discuss the use of compiled R code and integrating compiled languages into R code. They show several examples of using the R package inline that allows users to embed C, C++, Objective-C, Objective-C++ and Fortran code within R. Once again there are ample code examples to illustrate the use of this technique. For more advanced uses of compiled code the authors discuss how to create entire modules coded in C++ using the Rcpp package. Several completed code examples are included to illustrate the technique.

Another interesting approach to speeding up R is discussed in a chapter that explores several R packages designed to exploit the capability of GPU’s (Graphic Processing Cards) that are a used in many computers. These techniques can facilitate creating very fast and efficient statistical modeling code using R and the GPU.

As indicated above, readers can download the code package included with the book and find a well-organized set of ten folders (one for each chapter) containing 51 files. These files contain the sample code from the book as well as other code segments and benchmark code discussed in the book. The authors indicate that the code has been tested on R 3.1.1, Ubuntu 14.04 Trusty Tahr, Mac OS X 10.9 Mavericks, and Windows 8.1. This allows integration of these code segments into the reader’s own projects with minimal changes.

Other chapters in R High Performance Programming discuss simple tweaks to use less memory, techniques to speed processing of large datasets and using parallel processing and clustering techniques. The last chapter contains a discussion of using R and Hadoop to process Big Data (massive datasets with sizes measured in petabytes -one petabyes is 1,048,576 gigabytes). Processing data of this magnitude presents many challenges and is an area that is currently the subject of much program development.

I found R High Performance Programming to be a useful and informative book for the advanced user of R. A working knowledge of statistics, R and other programming languages such as C++ or Java is necessary to realize the full benefit of the techniques presented in the book. The book also serves as a good learning tool for less knowledgeable R users who are seeking to advance their programming skills.

Readers who are interested in the use of Hadoop and cluster computer processing might find the book Raspberry Pi Super Cluster by Andrew K. Dennis of interest. (Packt Publishing, 2013

PAC-14-1987838-1387169). A review of this book can be found on my web site at http://dmwiig.net.

Reviewer Information:

Douglas M. Wiig, Professor of Political Science

Grand View University

Teaching areas include social science statistics and research methods, comparative politics, international politics.

Long time user and developer of computer and statistical applications

Host of Open Source Technology in Higher Education web site at http://dmwiig.net

Creator and moderator of LinkedIn discussion forum “Open Source Technology in Higher Education”

Regular contributor to several LinkedIn discussion forums

Author of numerous tutorials on using the R statistical programming language and Raspberry Pi computer