Tag Archives: nonparametric statistics

R For Beginners: Basic R Code for Common Statistical Procedures Part I


An R tutorial by D. M. Wiig

This section gives examples of code to perform some of the most common elementary statistical procedures. All code segments assume that the package ‘car’ has been loaded and the file ‘Freedman’ has been loaded as the active dataset. Use the menu from the R console to load the ’car’ dataset or use the following command line to access the CRAN site list and packages:


install.packages()

Once the ’car’ package has been downloaded and installed use the following command to make it the active library.

require(car)

Load the ‘Freedman’ data file from the dataset ‘car’

data(Freedman, package="car")

List basic descriptives of the variables:

summary(Freedman)

Perform a correlation between two variables using Pearson, Kendall or Spearman’s correlation:

cor(filename[,c("var1","var2")], use="complete.obs", method="pearson")

cor(filename[,c("var1","var2")], use="complete.obs", method="spearman")

cor(filename[,c("var1","var2")], use="complete.obs", method="kendall")

Example:

cor(Freedman[,c("crime","density")], use="complete.obs", method="pearson")

cor(Freedman[,c("crime","density")], use="complete.obs", method="kendall")

cor(Freedman[,c("crime","density")], use="complete.obs", method="spearman")

In the next post I will discuss basic code to produce multiple correlations and linear regression analysis.  See other tutorials on this blog for more R code examples for basic statistical analysis.

 

Using R for Nonparametric Analysis, The Kruskal-Wallis Test Part Four: R Script and Some Notes on IDE’s


 

Using R for Nonparametric Analysis, The Kruskal-Wallis Test: R Script and Some Notes on IDE’s

 

A tutorial by Douglas M. Wiig

 

In the previous three parts of this tutorial I discussed using R to enter a data set and perform a nonparametric Kruska-Wallis test for ranked means. In this final part the commented script that was used in the first three parts is listed. 

 

If you are going to use R for the majority of your statistical analysis it is highly advisable that you investigate some of the IDE’s (Integrated Development Environments) that are available to assist in coding and debugging R script and creating R packages for personal use or distribution. I think one of the easiest to use is R Studio. R Studio is available in both free open source and commercial versions and can be downloaded at http://www.rstudio.com    There are versions available for Windows, various Linux distributions, and Mac OS.

The R studio console provides a number of useful tools that facilitate coding. The screen is divided into four sections with one section providing a code editor that features syntax highlighting, code completion and many other features such as line or block code execution. Another window contains R and all displays output, error messages and warnings when code is executed from the editor. A third window displays all of the current environmental variables that are active and can also show all currently loaded R packages. A fourth window can show graphic output from executed code, can be used to manage, download and install R packages, and can be used to access the CRAN database of online help. There are other useful tools that are too numerous to discuss here.

 

Another program that is worth looking at is RKWard which combines an IDE with a graphics GUI for R statistical analysis. Information and downloads for RKWard can be found at https://rkward.kde.org. This program is also free and open source and can be run on a Windows platform, Mac OS, or various distributions of Linux. The program has been optimized for Linux. A discussion of these IDE’s is beyond the scope of this posting.

 

Shown below is the commented R script for all three parts of the Kruskal-Wallis tutorial. For ease of reading code portions are shown in bold print.

 

#packages that must be present in the global environment before running these scripts

 

#stats; graphics; grDevices; utils; datasets; methods; base

 

#

 

#code to enter data using the data editor

 

#KW data entry, define file kruskal as a data frame

 

kruskal <-data.frame()

 

#invoke the data editor

 

kruskal <-edit(kruskal)

 

#define group as containing 3 factors; tell R which data column goes with which factor

 

group <- factor(1,2,3)

 

#alternative data entry method

 

#Define factor Group as containing three categories

 

Group <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3)

 

#create a vector defined as authscore and enter values

 

authscore <- c(96,128,83,61,101,82,121,132,135,109,115,149,166,147)

 

#create data frame kruskal matching each group factor to individual scores

 

kruskal <- data.frame(Group, authscore)

 

#use the following line to look at the structure of the data frame created

 

str(kruskal)

 

#run the basic Kruskal-Wallis test

 

#

 

kruskal.test(authscore ~ Group, data=kruskal)

 

#

 

#the following code is used to conduct a post-hoc comparison of the ranked means

 

#it is useful to first do a simple boxplot for a visual comparison

 

#Use this script to save the boxplot graphic to a .png data file

 

#save output in pdf file authplot

 

#send output to screen and file

 

sink()

 

authplot <- boxplot(authscore ~ Group, data=kruskal, main=”Group Comparison”, ylab=”authscore”)

 

#save .png file

 

png(“authplot.png”)

 

#now return all output to console

 

dev.off()

 

#

 

#make sure that the package PMCMR is loaded before running the following script

 

library(PMCMR)

 

#use the ‘with’ function to pass the data from the kruskal data frame to the post hoc

 

#test script; specify the Tukey HSD method for determining significance of each

 

# pair of comparisions

 

#

 

with(kruskal, {

 

 

posthoc.kruskal.nemenyi.test(authscore, Group, “Tukey”)

 

})

 

#

 

#NOTE: if using a version of R < 3.xx then use the package pgirmess instead of PMCMR

 

#

 

#the following lines show the post hoc analysis using the pgirmess package

 

#note the function kuskalmc is used for the comparisons

 

library(pgirmess)

 

authscore <- c(96,128,83,61,101,82,121,132,135,109,115,149,166,147)

 

Group <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3)

 

kruskalmc(authscore ~ Group, probs=.05, cont=NULL)

 

#

 

Nonparametric Statistical Analysis Using R: The Sign Test


Using R in Nonparametic Statistical Analysis:  The Binomial Sign Test

A tutorial by D.M. Wiig

One of the core competencies that students master in introductory social science statistics is to create a null and alternative hypothesis pair relative to a research question and to use a statistical test to evaluate and make a decision about rejecting or retaining the null hypothesis.  I have found that one of the easiest statistical tests to use when teaching these concepts is the sign test.  This is a very easy test to use and students seem to intuitively grasp the concepts of trials and binomial outcomes as these are easily related to the common and familiar event of ‘flipping a coin.’

 

While it is possible to use the sign test by looking up probabilities of outcomes in a table of the binomial distribution I have found that using R to perform the analysis is a good way to get them involved in using statistics software to solve the problem.  R has an easy to use sign test routine that is called with the binom.test command.  To illustrate the use of the test consider an experiment where the researcher has randomly assigned 10 individuals to a group and observes them in both a control and experimental condition.  The researcher measures the criterion variable of interest in each condition for each subject and measures the effect on each subject’s behavior using a relative scale of effect.

 

The researcher at this point is only interested in whether or not the criterion variable has an effect on behavior, so a non-directional hypothesis is used.  The data collected is shown in the following table:

 

Subject   1     2     3     4     5     6      7      8     9     10

———————————————————————-

Pre      50   49   37   16   80   42    40    58   31    21

Post.   56   50   30   25   90   44    60    71   32    22

———————————————————————–

+     +     –     +     +     +      +      +    +      –

The general format for the sign test is as follows:

 

binom.test(x, n, p =.5, alternative = “two.sided”, “less”, “greater”, conf.level = .95)

 

where: x = number of successes

n = number of trials

alternative = indicates the alternative hypthesis as directional or nondirectional

conf.level = the confidence level for the returned confidence interval.

 

In the example as described above we have 8 pluses and 2 minuses.  We will use the “two.sided” option for the alternative hypothesis a probability of success of .50, and a conf.level of .95. The following is entered into R:

Under a nondirectional alternative hypothesis we are testing the probability of obtaining 0, 1, 2, 8, 9, 10 pluses or:

> binom.test(8, 10, p=.5, alternative=”two.sided”, conf.level=.95)

Exact binomial test

data:  8 and 10
number of successes = 8, number of trials = 10,
p-value = 0.1094
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.4439045 0.9747893
sample estimates:
probability of success
                  0.8
>

Under a nondirectional alternative hypothesis we are testing the probability of obtaining 0, 1, 2, 8, 9, 10 pluses or:

 

p(o, 1, 2, 8, 9, 10 pluses)  = .1094

 

If we had set an alpha of ά=.05 then we would retain the null hypothesis as p(obt) > .05.  We could not conclude that the experimental criterion has an effect on behavior.  R has many other nonparametric statistical tests that are easy to use from the command line.  These are topics for future tutorials.

 

More to Follow:

Using R for Nonparametric Statistical Analysis: Nonparametric Correlation


Using R for Nonparametric Statistical Analysis: Nonparametric Correlation

A Tutorial by D.M. Wiig

In previous tutorials I discussed how the download and install R on a Linux Debian operating system and how to use R to perform Kendall’s Concordance analysis. This tutorial explores some basic R commands to open a built-in dataset, produce a simple scatter plot of the data and perform a nonparametric correlation using Kendall’s and Spearman’s rank order correlations. Before beginning this tutorial open a terminal window and start R.

 

One of the packages t hat is downloaded with the R distribution is called “datasets.” One of the files in the dataset, USJudgeRatings, contains a data frame that measures lawyer’s rating of 43 state judges on 12 numeric variables. Since the scale used in these ratings is ordinal it is appropriate to use rank order correlation to analyze the data. To examine the data in the USJudgeRatings file use the command sequence:

 

> data(USJudgeRatings, package=”datasets”)

	> print(USJudgeRatings)

                CONT INTG DMNR DILG CFMG DECI PREP FAMI ORAL WRIT PHYS RTEN
AARONSON,L.H.    5.7  7.9  7.7  7.3  7.1  7.4  7.1  7.1  7.1  7.0  8.3  7.8
ALEXANDER,J.M.   6.8  8.9  8.8  8.5  7.8  8.1  8.0  8.0  7.8  7.9  8.5  8.7
ARMENTANO,A.J.   7.2  8.1  7.8  7.8  7.5  7.6  7.5  7.5  7.3  7.4  7.9  7.8
BERDON,R.I.      6.8  8.8  8.5  8.8  8.3  8.5  8.7  8.7  8.4  8.5  8.8  8.7
BRACKEN,J.J.     7.3  6.4  4.3  6.5  6.0  6.2  5.7  5.7  5.1  5.3  5.5  4.8
BURNS,E.B.       6.2  8.8  8.7  8.5  7.9  8.0  8.1  8.0  8.0  8.0  8.6  8.6
CALLAHAN,R.J.   10.6  9.0  8.9  8.7  8.5  8.5  8.5  8.5  8.6  8.4  9.1  9.0

……………

 

You will see all 43 cases in the output. To save space here I have just shown a portion of the output. Please note that file names in R are case sensitive so be sure to use capital letters where shown.

The basic R distribution has fairly extensive graphing capabilities. To produce

a simple scatter diagram of the variables PHYS and RTEN that graphs RTEN on the

X axis and PHYS on the Y axis use the following line of code:

 

	> plot(PHYS~RTEN, log="xy", data=USJudgeRatings)

 

You should see a scatter plot similar to the one below: (yours will be larger, I reduced this to save space)

 

 

                         Scatter plot did not show in this html markup 

 

 

 

 

 

 

We can perform a correlation analysis on the data using either Kendall’s rank order correlation or Spearman’s Rho. For a Kendall correlation make sure the file USJudgeRatings is loaded into memory by using the command:

 

>data(USJudgeRatings, package=”datasets”)

 

Now perform the analysis with the command:

> cor(USJudgeRatings[,c(“PHYS”,”RTEN”)], use=”complete.obs”, method=”kendall”)

 

   	       PHYS      RTEN
	PHYS 1.0000000 0.7659126
	RTEN 0.7659126 1.0000000

 

As seen above we specify the two variable we want to correlate and indicate that all oberservations are to be used. Running a Spearman’s on the same variables is a matter of changing the “method =” designator:

 

> cor(USJudgeRatings[,c(“PHYS”,”RTEN”)], use=”complete.obs”, method=”spearman”)

 

             PHYS      RTEN
	PHYS 1.0000000 0.9031373
	RTEN 0.9031373 1.0000000

 

To produce a kendall’s correlation matrix of all 12 of the variables use:

 

> cor(USJudgeRatings[,c("CONT","INTG","DMNR","DILG","CFMG", "DECI",
+                       "ORAL","WRIT","PHYS","RTEN")], use="complete.obs", method="kendall")
             CONT       INTG       DMNR         DILG       CFMG       DECI
CONT  1.000000000 -0.1203440 -0.1162402 -0.001142206 0.09409104 0.05498285
INTG -0.120344017  1.0000000  0.8607446  0.689935415 0.60919580 0.64371783
DMNR -0.116240241  0.8607446  1.0000000  0.662117755 0.60801429 0.63320857
DILG -0.001142206  0.6899354  0.6621178  1.000000000 0.86484298 0.89194190
CFMG  0.094091035  0.6091958  0.6080143  0.864842984 1.00000000 0.91212083
DECI  0.054982854  0.6437178  0.6332086  0.891941895 0.91212083 1.00000000
ORAL -0.027381743  0.7451506  0.7272732  0.859909442 0.82495629 0.83952698
WRIT -0.028474100  0.7187820  0.6942712  0.877775007 0.83497447 0.85064096
PHYS -0.066667371  0.6309756  0.6296740  0.752740177 0.72853135 0.77215650
RTEN -0.021652594  0.8013829  0.7979569  0.822527726 0.76344652 0.80206419
            ORAL       WRIT        PHYS        RTEN
CONT -0.02738174 -0.0284741 -0.06666737 -0.02165259
INTG  0.74515064  0.7187820  0.63097556  0.80138292
DMNR  0.72727320  0.6942712  0.62967404  0.79795687
DILG  0.85990944  0.8777750  0.75274018  0.82252773
CFMG  0.82495629  0.8349745  0.72853135  0.76344652
DECI  0.83952698  0.8506410  0.77215650  0.80206419
ORAL  1.00000000  0.9596834  0.79429138  0.90227331
WRIT  0.95968339  1.0000000  0.77463199  0.85309146
PHYS  0.79429138  0.7746320  1.00000000  0.76591261
RTEN  0.90227331  0.8530915  0.76591261  1.00000000

>

 

If the data you are using is measured at the interval or ratio level just change the “method=” designator to “Pearson” to produce a product-moment correlation.

 

 

More to Come:

 

 

 

 

Using R in Nonparametric Statistics: Basic Table Analysis, Part Two


Using R in Nonparametric Statistics: Basic Table Analysis, Part Two

A Tutorial by D.M. Wiig

As discussed in a previous tutorial one of the most common methods display ng and analyzing data is through the use of tables. In this tutorial I will discuss setting up a basic table using R and exploring the use of the CrossTable function that is available in the R ‘gmodel’ package. I will use the same hypothetical data table that I created in Part One of this tutorial, data that examines the relationship between income and political party identification among a group of registered voters. The variable “income” will be considered ordinal in nature and consists of categories of income in thousands as follows:

“< 25”; “25-50”; “51-100” and “>100”

Political party identification is nominal in nature with the following categories:

“Dem”, “Rep”, “Indep”

Frequency counts of individuals that fall into each category are numeric. In the first example we will create a table by entering the data as a data frame and displaying the results. When using this method it is a good idea to set up the table on paper before entering the data into R. This will help to make sure that all cases and factors are entered correctly. The table I want to generate will look like this:

party
income                Dem Rep Indep
<25 1                          5     5      10
26-50                      20    15    15
51-100                  10     20    10
>100                        5       30    10

When using the CrossTable() function the data should be entered in matrix format. Enter the data from the table above as follows:

>#enter data as table matrix creating the variable ‘Partyid’
>#enter the frequencies
>Partyid <-matrix(c(15,20,10,5, 5,15,20,30, 10,15,10,10),4,3)
>#enter the column dimension names and column heading categories
>dimnames(Partyid) = list(income=c(“<25”, “25-50″,”51-100”, “>100”), party=c(“Dem”,”Rep”,”Indep”))

To view the structue of the created data matrix use the command:

> str(Partyid)
num [1:4, 1:3] 15 20 10 5 5 15 20 30 10 15 …
– attr(*, “dimnames”)=List of 2
..$ income: chr [1:4] “<25” “25-50” “51-100” “>100”
..$ party : chr [1:3] “Dem” “Rep” “Indep”
>

To view the table use the command:

> Partyid
                                                     party
income                       Dem Rep Indep
<25                                   15     5      10
25-50                             20     15    15
51-100                         10      20   10
>100                               5        30   10
>  

Remember that R is case sensitive so make sure you use upper case if you named your variable ‘Partyid.’

Once the table has been entered as a matrix it can be displayed with a number of available options using the CrossTable() function. In this example I will produce a table in SAS format(default format), display both observed and expected cell frequencies, the proportion of the Chi-square total contributed by each cell, and the results of the chi-square analysis. The script is:
> #make sure gmodels package is loaded
> require(gmodels)
> #CrossTable analysis
> CrossTable(Partyid,prop.t=FALSE,prop.r=FALSE,prop.c=FALSE,expected=TRUE,chisq=TRUE,prop.chisq=TRUE)

Cell Contents
|—————————–|
|                                                    N |
|                             Expected N |
| Chi-square contribution |
|—————————-|
Total Observations in Table: 165
                                             | party
income | Dem | Rep | Indep | Row Total |
<25        |    15     | 5              | 10        | 30                   |
                 | 9.091 | 12.727 |8.182  |                          |
                 | 3.841 | 4.692 | 0.404 |                             |

25-50 |      20             15             | 15 |      |50

                 15.152 | 21.212 | 13.636 | |
               | 1.552   | 1.819    | 0.136 | |

51-100 | 10           | 20            | 10 |         40 |
              | 12.121 | 16.970 | 10.909 | |
|                 0.371 |   0.541 |    0.076 | |
————-|———–|———–|———–|———–|
>100 |        5 |          30             | 10 |        45 |
          | 13.636 | 19.091 |    12.273 | |
           | 5.470 |   6.234 |         0.421 | |
————-|———–|———–|———–|———–|
Column Total | 50 | 70 | 45 | 165 |
————-|———–|———–|———–|———–|
Statistics for All Table Factors
Pearson’s Chi-squared test
————————————————————
Chi^2 = 25.55608 d.f. = 6 p = 0.0002692734

>

As seen above row marginal totals and column marginal totals are displayed by default with the SAS format. There are other options available for the CrossTable() function. See the CRAN documentation for a detailed description of all of the options available. In the next installment of this tutorial I will examine some of the measures of association that are available in R for nominal and ordinal data displayed in a table format.

 

Using R in Nonparametric Statistics: Basic Table Analysis, Part One


Using R in Nonparametric Statistics: Basic Table Analysis, Part One

A Tutorial by D.M. Wiig
One of the most common methods displaying and analyzing data is through the use of tables. In this tutorial I will discuss setting up a basic table using R and performing an initial Chi-Square test on the table. R has an extensive set of tools for manipulating data in the form of a matrix, table, or data frame. The package ‘vcd’ is specifically designed to provide tools for table analysis. Before beginning this tutorial open an R session in your terminal window. You can install the vcd package using the following command:

>install.packages()

Depending on your R installation you may be asked to designate a CRAN reflector to download from or you may see a list of available packages in your default CRAN mirror. Select the package ‘vcd’ and download it. I might add at this point that if you are running the newest release of R, R-3.0.x you will have to reload a number of dependencies that will not work under the latest version of R. Any time you are installing a package and see the ‘non-zero exit status’ error message look the dialog over to see which packages have to be reinstalled to work with the newest version of R. If you are using R-2.xx.x the vcd package will install without any other re-installations.

In social science research we often use data that is nominal or ordinal in nature. Data is displayed in categories with associated frequency counts. In this tutorial I will use a set of hypothetical data that examines the relationship between income and political party identification among a group of registered voters. The variable “income” will be considered ordinal in nature and consists of categories of income in thousands as follows:

“< 25”; “25-50”; “51-100” and “>100”

Political party identification is nominal in nature with the following categories:

“Dem”, “Rep”, “Indep”

Frequency counts of individuals that fall into each category are numeric. In the first example we will create a table by entering the data as a data frame and displaying the results. When using this method it is a good idea to set up the table on paper before entering the data into R. This will help to make sure that all cases and factors are entered correctly. The table I want to generate will look like this:

party
income                 Dem Rep Indep
<25                             15    5      10
26-50                        20   15    15
51-100                     10   20    10
>100                            5    30    10

To enter the above into a data frame use the following on the command line:

> partydata <- data.frame(expand.grid(income=c(“<25″,”25-50″,”51-100″,”>100″), party=c(“Dem”,”Rep”, “Indep”)),count=c(15,20,10,5,5,15,20,30,10,15,10,10))
>

Make sure the syntax is exactly as shown and make sure the entire script is on the same line or has done an automatic return to the next line in your R console. When the command runs without error you can view the data by entering:

> partydata

The following output is produced:

> partydata
income                    party         count
1 <25                         Dem            15
2 25-50                    Dem            20
3 51-100                 Dem           10
4 >100                      Dem             5
5 <25                         Rep               5
6 25-50                    Rep              15
7 51-100                 Rep              20
8 >100                      Rep              30
9 <25                         Indep          10
10 25-50                 Indep          15
11 51-100              Indep          10
12 >100                   Indep          10
>

At this point the data is in frequency rather that table or matrix form. To view a summary of information about the data use the command:

>str(partydata)

You will see:

> str(partydata)
‘data.frame’: 12 obs. of 3 variables:
$ income: Factor w/ 4 levels “<25″,”26-50”,..: 1 2 3 4 1 2 3 4 1 2 …
$ party : Factor w/ 3 levels “Dem”,”Rep”,”Indep”: 1 1 1 1 2 2 2 2 3 3 …
$ count : num 15 20 10 5 5 15 20 30 10 15 …

To convert the data into tabular format use the command xtabs to perform a cross tabulation. I have named the resulting table “tabs”:

>tabs <- xtabs(count ~income + party, data=partydata)

To view the resulting table use:

> tabs
                                                        party
income                              Dem Rep Indep
<25                                        15      5        10
26-50                                   20      15      15
51-100                                10      20      10
>100                                       5       30      10
>

This produces a table in the desired format. To do a quick analysis of the table that produces a Chi-square statistic use the command:

> summary(tabs)

The output is

> summary(tabs)
Call: xtabs(formula = count ~ income + party, data = partydata)
Number of cases in table: 165
Number of factors: 2
Test for independence of all factors:
Chisq = 25.556, df = 6, p-value = 0.0002693
>

In future tutorials I will discuss many of the other resources that are available with the vcd package for manipulating and analyzing data in a tabular format.