# R Tutorial: A Script to Create and Analyze a Simple Data File, Part One

R Tutorial: A Simple Script to Create and Analyze a Data File, Part One

By D.M. Wiig

In this tutorial I will walk you through a simple script that will show you how to create a data file and perform some simple statistical procedures on the file. I will break the code into segments and discuss what each segment does. Before starting this tutorial make sure you have a terminal window open and open R from the command line.

The first task is to create a simple data file. Let’s assume that we have some data from 10 individuals measuring each person’s height and weight. The data is shown below:

Height(inches) Weight(lbs)

72               225

60               128

65               176

75               215

66               145

65               120

70               210

71               176

68               155

77               250

We can enter the data into a data matrix by invoking the data editor and entering the values. Please note that the lines of code preceded by a # are comments and are ignored by R:

#Create a new file and invoke the data editor to enter data

#Create the file Sampledatafile, height and weight of 10 s subjects

Sampledatafile <-data.frame()

Sampledatafile <-edit(Sampledatafile)

You will see a window open that is the R Data Editor. Click on the column heading ‘var1’ and you will see several different data types in the drop down menu. Choose the ‘real’ data type. Follow the same procedure to set the data type for the second column. Enter the data pairs in the columns, with height in the first column and weight in the second column. When the data have been entered click on the var1 heading for column 1 and click ‘Change Name.’ Enter ‘Height’ to label the first column. Follow the same steps to rename the second column ‘Weight.’

Once both columns of data have been entered you can click ‘Quit.’ The datafile ‘Sampledatafile’ is now loaded into memory.

To run so me basic descriptive statistics use the following code:

> #Run descriptives on the data

> summary(Sampledatafile)

The output from this code will be:

Height                Weight

Min. :60.00          Min. :120.0

1st Qu.:65.25        1st Qu.:147.5

Median :69.00        Median :176.0

Mean :68.90          Mean :180.0

3rd Qu.:71.75        3rd Qu.:213.8

Max. :77.00          Max. :250.0

>

To view the data file use the following lines of code:

>#print the datafile ‘Sampledatafile’ on the screen

> print(Sampledatafile)

You will see the output:

Height          Weight

1 72             225

2 60             128

3 65             176

4 75             215

5 66             145

6 65             120

7 70             210

8 71             176

9 68             155

10 77            250

In Part Two I will discuss an R script to do a simple correlation and scatter diagram.  Check back later!

# Nonparametric Statistical Analysis Using R: The Sign Test

Using R in Nonparametic Statistical Analysis:  The Binomial Sign Test

A tutorial by D.M. Wiig

One of the core competencies that students master in introductory social science statistics is to create a null and alternative hypothesis pair relative to a research question and to use a statistical test to evaluate and make a decision about rejecting or retaining the null hypothesis.  I have found that one of the easiest statistical tests to use when teaching these concepts is the sign test.  This is a very easy test to use and students seem to intuitively grasp the concepts of trials and binomial outcomes as these are easily related to the common and familiar event of ‘flipping a coin.’

While it is possible to use the sign test by looking up probabilities of outcomes in a table of the binomial distribution I have found that using R to perform the analysis is a good way to get them involved in using statistics software to solve the problem.  R has an easy to use sign test routine that is called with the binom.test command.  To illustrate the use of the test consider an experiment where the researcher has randomly assigned 10 individuals to a group and observes them in both a control and experimental condition.  The researcher measures the criterion variable of interest in each condition for each subject and measures the effect on each subject’s behavior using a relative scale of effect.

The researcher at this point is only interested in whether or not the criterion variable has an effect on behavior, so a non-directional hypothesis is used.  The data collected is shown in the following table:

Subject   1     2     3     4     5     6      7      8     9     10

———————————————————————-

Pre      50   49   37   16   80   42    40    58   31    21

Post.   56   50   30   25   90   44    60    71   32    22

———————————————————————–

+     +     –     +     +     +      +      +    +      –

The general format for the sign test is as follows:

binom.test(x, n, p =.5, alternative = “two.sided”, “less”, “greater”, conf.level = .95)

where: x = number of successes

n = number of trials

alternative = indicates the alternative hypthesis as directional or nondirectional

conf.level = the confidence level for the returned confidence interval.

In the example as described above we have 8 pluses and 2 minuses.  We will use the “two.sided” option for the alternative hypothesis a probability of success of .50, and a conf.level of .95. The following is entered into R:

Under a nondirectional alternative hypothesis we are testing the probability of obtaining 0, 1, 2, 8, 9, 10 pluses or:

> binom.test(8, 10, p=.5, alternative=”two.sided”, conf.level=.95)

Exact binomial test

data:  8 and 10
number of successes = 8, number of trials = 10,
p-value = 0.1094
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.4439045 0.9747893
sample estimates:
probability of success
0.8
 >

Under a nondirectional alternative hypothesis we are testing the probability of obtaining 0, 1, 2, 8, 9, 10 pluses or:

p(o, 1, 2, 8, 9, 10 pluses)  = .1094

If we had set an alpha of ά=.05 then we would retain the null hypothesis as p(obt) > .05.  We could not conclude that the experimental criterion has an effect on behavior.  R has many other nonparametric statistical tests that are easy to use from the command line.  These are topics for future tutorials.

More to Follow:

# Using R for Basic Cross Tabulation Analysis: Part Three, Using the xtabs Function

Using R to Work with GSS Survey Data Part Three: Using xtabs to Create and Analyze Tables

A tutorial by D. M. Wiig
In Part Two of this series of tutorials I discussed how to find and import a data set from the NORC GSS survey. The focus of that tutorial was on the GSS2010 data set that was imported into the R workspace in SPSS format and then loaded into an R data frame for analysis.

Use the following code to load the data set into an R workspace:

>install.packages(“Hmisc”) #need for file import
>install.packages(“foreign”) #need for file import
>#get spss gss file and put into data frame
>library(Hmisc)
>gssdataframe <- spss.get(“/path-to-your-file/GSS2010.sav”, use.value.labels=TRUE)

The xtabs function provides a quick way to generate and view a cross tabulation of two variables and allows the user to specify one or more control variables in the cross tabulation. Using the variables “ partyid” and “polviews” the cross tablulation is generated with:

>#use xtabs to produce a table
>gsstab <- xtabs(~ partyid + polviews, data=gssdataframe)

To view the resulting table use:

>gsstab #show table

To view summary statistics generated use:

summary(gsstab)

This summary shows the number of cases in the table, the number of factors and the Chi-square value for the table.

Variables used in social science research are often interrelated so it is desirable to control for one or more variables in order to further examine the variables of interest. The table created in the gsstab data frame shows the relationship between political ideology and political party affiliation. To look at the relationship by gender use the following:

>#use xtabs to produce a table with a control variable
>gsstab2 <- xtabs(~ partyid + polviews+ sex, data=gssdataframe)

To view the new table use:

>gsstab2

To view summary statistics for the table enter:

>summary(gsstab2)

As noted above xtabs is a quick and powerful function to create N x N tables with or without control variables. In the next tutorial I explore the use of the ca function to produce a basic Correspondence analysis of underlying dimensions in an N x N table.

# Using R to Work with GSS Survey Data: Cross Tabulation Tables

Using R to Work with GSS Survey Data: Viewing Datasets and Performing Cross Tabulations

A tutorial by D. M. Wiig

In a previous tutorial I discussed how to import datasets from the NORC General Social Science Survey using R to write the SPSS formatted data to an R data frame. Once the data has been imported into the R working environment it can be viewed and analyzed. There is a wealth of survey research data available at the NORC web site located at www.norc.org. In this tutorial the dataset gss2010.sav will be used. The dataset is available from www3.norc.org/GSS+Website.

From that page click on the “Quick Downloads” link on the right hand side of the page to access the list of available datasets. From the next page choose SPSS to access ‘.sav’ format files and finally “2010” under the heading “GSS 1972-2012 Release 6.” Please note that this is a rather large data file with 2044 observations of 794 variables. Download the file to a directory that you can access from your R console.

As discussed in a previous tutorial the SPSS format file can be loaded into an R data frame. Make sure that the R packages Hmisc and foreign have been installed and loaded before attempting to import the SPSS file. The following code will load the ‘.sav’ file:

>install.packages(“Hmisc”) #need for file import

>install.packages(“foreign”) #need for file import

>#get spss gss file and put into data frame

>library(Hmisc)

>gssdataframe <- spss.get(“/path-to-your-file/GSS2010.sav”, use.value.labels=TRUE)

Once the file is read into an R data frame it can be viewed in a spreadsheet like interface by using the command:

>View(gssdataframe)

Using the arrow keys, the home key, end key, and the page up and page down keys allows navigating and browsing the file.

Survey data such as that found in the GSS file is usually a mixture of data types ranging from ratio level numbers to categorical data. Cross tabulations are often used to explore relationships among variables that are ordinal or categorical in nature. R has a number of functions available for cross tabulations. The Table function is a quick way to generate a cross tabulation table with a number of options available. The following results in a frequency table of the variables “partyid” and “polviews” both of which are measured in categories:

>#use the gssdataframe

>#the variables partyid and polviews are used

>attach(gssdataframe)

>#create a table named ‘gsstable’

>gsstable <- table(partyid, polviews)

>gsstable #print table frequencies

The following output results:

```                   polviews
partyid              EXTREMELY LIBERAL LIBERAL SLIGHTLY LIBERAL MODERATE
STRONG DEMOCRAT                   41     105               42       94
NOT STR DEMOCRAT                  14      62               57      154
IND,NEAR DEM                      11      47               57      103
INDEPENDENT                        5      20               33      189
IND,NEAR REP                       1       4               16       74
NOT STR REPUBLICAN                 2      10               16       88
STRONG REPUBLICAN                  0       5                5       22
OTHER PARTY                        1       5                6       16
polviews
partyid              SLGHTLY CONSERVATIVE CONSERVATIVE EXTRMLY CONSERVATIVE
STRONG DEMOCRAT                      22           25                    6
NOT STR DEMOCRAT                     28           16                    7
IND,NEAR DEM                         25           11                    5
INDEPENDENT                          43           32                    9
IND,NEAR REP                         49           43                    8
NOT STR REPUBLICAN                   72           72                   13
STRONG REPUBLICAN                    23          101                   27
OTHER PARTY                           3           12                    4```
 >

There are options available with the Table function that include calculating row and column marginal totals as well a cell percentages. Another quick method to generate tables is with the CrossTable function. The function is contained in the gmodels package and can be used on the table generated with the Table function above. Use the following lines of code to generate a cross table between ‘polviews’ and ‘partyid’ using the gsstable created above:

>library(gmodels)

>#produce basic crosstabs

>CrossTable(gsstable,prop.t=FALSE,prop.r=FALSE,prop.c=FALSE,chisq=TRUE,format=c(“SPSS”))

>

```Cell Contents
|-------------------------|
|                   Count |
| Chi-square contribution |
|-------------------------|

Total Observations in Table:  1961

| polviews
partyid |    EXTREMELY LIBERAL  |              LIBERAL  |     SLIGHTLY LIBERAL  |             MODERATE  | SLGHTLY CONSERVATIVE  |         CONSERVATIVE  | EXTRMLY CONSERVATIVE  |            Row Total |
-------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|
STRONG DEMOCRAT |                  41  |                 105  |                  42  |                  94  |                  22  |                  25  |                   6  |                 335  |
|              62.014  |              84.219  |               0.141  |               8.312  |              11.962  |              15.026  |               4.163  |                      |
-------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|
NOT STR DEMOCRAT |                  14  |                  62  |                  57  |                 154  |                  28  |                  16  |                   7  |                 338  |
|               0.089  |               6.911  |               7.238  |               5.486  |               6.840  |              26.537  |               3.215  |                      |
-------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|
IND,NEAR DEM |                  11  |                  47  |                  57  |                 103  |                  25  |                  11  |                   5  |                 259  |
|               0.121  |               4.902  |              22.674  |               0.284  |               2.857  |              22.144  |               2.830  |                      |
-------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|
INDEPENDENT |                   5  |                  20  |                  33  |                 189  |                  43  |                  32  |                   9  |                 331  |
|               4.634  |              12.733  |               0.969  |              32.889  |               0.067  |               8.107  |               1.409  |                      |
-------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|
IND,NEAR REP |                   1  |                   4  |                  16  |                  74  |                  49  |                  43  |                   8  |                 195  |
|               5.592  |              18.279  |               2.167  |               0.002  |              19.466  |               4.622  |               0.003  |                      |
-------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|
NOT STR REPUBLICAN |                   2  |                  10  |                  16  |                  88  |                  72  |                  72  |                  13  |                 273  |
|               6.824  |              18.702  |               8.224  |               2.190  |              33.411  |              18.786  |               0.364  |                      |
-------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|
STRONG REPUBLICAN |                   0  |                   5  |                   5  |                  22  |                  23  |                 101  |                  27  |                 183  |
|               6.999  |              15.115  |              12.805  |              32.065  |               0.121  |             177.476  |              52.256  |                      |
-------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|
OTHER PARTY |                   1  |                   5  |                   6  |                  16  |                   3  |                  12  |                   4  |                  47  |
|               0.354  |               0.227  |               0.035  |               0.170  |               1.768  |               2.735  |               2.344  |                      |
-------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|
Column Total |                  75  |                 258  |                 232  |                 740  |                 265  |                 312  |                  79  |                1961  |
-------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|

Statistics for All Table Factors

Pearson's Chi-squared test
------------------------------------------------------------
Chi^2 =  801.8746     d.f. =  42     p =  3.738705e-141

Minimum expected frequency: 1.797552
Cells with Expected Frequency < 5: 2 of 56 (3.571429%)

Warning message:
In chisq.test(t, correct = FALSE, ...) :
Chi-squared approximation may be incorrect```
 >

This code produces a table of frequencies along with a basic Ch-squared test. Other options include generating cell percentages and using either SPSS or SAS table format. This is accomplished by changing the appropriate flag from FALSE to TRUE and specifying either SPSS or SAS for the format flag. The table formatting is compressed in this example due to the narrow margin requirements of the web page.  Use the scroll bar at the bottom of the page to view the entire table.

There are many functions available in R to analyze data in tabular format. In my next tutorial I will examine using the xtabs function to produce basic cross tabulation with control variables.

# R Tutorial: Using R to Work With Datasets From the NORC General Social Science Survey

R Tutorial: Using R to Work With Datasets From the NORC General Social Science Survey

A tutorial by D. M. Wiig

Part One:

When I teach classes in social science statistics and social science research methods I like to use “live” data as much as possible both in classroom lectures and in homework assignments. For the social sciences one excellent and readily available source of live data is the ongoing General Social Science Survey project, The National Data Program for the Sciences. This is a project of NORC, a National Science Research Center at the University of Chicago (see www.norc.org for the projects main web site.)

There a a number of datasets available in different formats. The quick download datasets that I like to use are primarily SPSS data files. Many institutions have SPSS available for students and faculty but the use of SPSS is my no means universal. I have found that it is easy to use R to read the .sav format files into an R data frame and then write the file out to a comma separated value, .csv format that can be read my almost any statistics software package. As I will discuss in this an future tutorials it is also quite effective to use R to analyze the GSS files.

To create R datasets using the GSS files we can use some of the file import/export features available in R. To begin, make sure that the R packages “Hmisc” and “foreign” are installed and loaded in your R session environment. This can be accomplished using:

> install.packages(“Hmisc”) #need for file import

> install.packages(“foreign”) #need for file import

As an example, the following code will load the GSS data file “gss2010x.sav” into an R data frame using the spss.get function:

>library(Hmisc)

>gssdataframe <- spss.get(“/path-to-your-file/gss2010x.sav”, use.value.labels=TRUE)

The file “gss2010x.sav” contains 500 observations of 47 variables. Codebooks and other information about the data in these datasets is readily avaiable for download from the NORC web site. After the data is loaded into the data frame it can be viewed using:

>gssdataframe

To convert and save the file to a comma separated value (.csv) format use the following use the write.table function:

>#write dataframe to .csv file

>write.table(gssdataframe, “/path-to-your-file/gss2010x.csv”,sep=”,”)

The file, now in a .csv format can be accessed with virtually any statistics package or other software. In my next tutorial I will discuss working with GSS data using the various table and cross table functions available in R.

# Using R for Nonparametric Statistical Analysis: Nonparametric Correlation

Using R for Nonparametric Statistical Analysis: Nonparametric Correlation

A Tutorial by D.M. Wiig

In previous tutorials I discussed how the download and install R on a Linux Debian operating system and how to use R to perform Kendall’s Concordance analysis. This tutorial explores some basic R commands to open a built-in dataset, produce a simple scatter plot of the data and perform a nonparametric correlation using Kendall’s and Spearman’s rank order correlations. Before beginning this tutorial open a terminal window and start R.

One of the packages t hat is downloaded with the R distribution is called “datasets.” One of the files in the dataset, USJudgeRatings, contains a data frame that measures lawyer’s rating of 43 state judges on 12 numeric variables. Since the scale used in these ratings is ordinal it is appropriate to use rank order correlation to analyze the data. To examine the data in the USJudgeRatings file use the command sequence:

> data(USJudgeRatings, package=”datasets”)

```	> print(USJudgeRatings)

CONT INTG DMNR DILG CFMG DECI PREP FAMI ORAL WRIT PHYS RTEN
AARONSON,L.H.    5.7  7.9  7.7  7.3  7.1  7.4  7.1  7.1  7.1  7.0  8.3  7.8
ALEXANDER,J.M.   6.8  8.9  8.8  8.5  7.8  8.1  8.0  8.0  7.8  7.9  8.5  8.7
ARMENTANO,A.J.   7.2  8.1  7.8  7.8  7.5  7.6  7.5  7.5  7.3  7.4  7.9  7.8
BERDON,R.I.      6.8  8.8  8.5  8.8  8.3  8.5  8.7  8.7  8.4  8.5  8.8  8.7
BRACKEN,J.J.     7.3  6.4  4.3  6.5  6.0  6.2  5.7  5.7  5.1  5.3  5.5  4.8
BURNS,E.B.       6.2  8.8  8.7  8.5  7.9  8.0  8.1  8.0  8.0  8.0  8.6  8.6
CALLAHAN,R.J.   10.6  9.0  8.9  8.7  8.5  8.5  8.5  8.5  8.6  8.4  9.1  9.0```

……………

You will see all 43 cases in the output. To save space here I have just shown a portion of the output. Please note that file names in R are case sensitive so be sure to use capital letters where shown.

The basic R distribution has fairly extensive graphing capabilities. To produce

a simple scatter diagram of the variables PHYS and RTEN that graphs RTEN on the

X axis and PHYS on the Y axis use the following line of code:

`	> plot(PHYS~RTEN, log="xy", data=USJudgeRatings)`

You should see a scatter plot similar to the one below: (yours will be larger, I reduced this to save space)

Scatter plot did not show in this html markup

We can perform a correlation analysis on the data using either Kendall’s rank order correlation or Spearman’s Rho. For a Kendall correlation make sure the file USJudgeRatings is loaded into memory by using the command:

>data(USJudgeRatings, package=”datasets”)

Now perform the analysis with the command:

> cor(USJudgeRatings[,c(“PHYS”,”RTEN”)], use=”complete.obs”, method=”kendall”)

```   	       PHYS      RTEN
PHYS 1.0000000 0.7659126
RTEN 0.7659126 1.0000000```

As seen above we specify the two variable we want to correlate and indicate that all oberservations are to be used. Running a Spearman’s on the same variables is a matter of changing the “method =” designator:

> cor(USJudgeRatings[,c(“PHYS”,”RTEN”)], use=”complete.obs”, method=”spearman”)

```             PHYS      RTEN
PHYS 1.0000000 0.9031373
RTEN 0.9031373 1.0000000```

To produce a kendall’s correlation matrix of all 12 of the variables use:

```> cor(USJudgeRatings[,c("CONT","INTG","DMNR","DILG","CFMG", "DECI",
+                       "ORAL","WRIT","PHYS","RTEN")], use="complete.obs", method="kendall")
CONT       INTG       DMNR         DILG       CFMG       DECI
CONT  1.000000000 -0.1203440 -0.1162402 -0.001142206 0.09409104 0.05498285
INTG -0.120344017  1.0000000  0.8607446  0.689935415 0.60919580 0.64371783
DMNR -0.116240241  0.8607446  1.0000000  0.662117755 0.60801429 0.63320857
DILG -0.001142206  0.6899354  0.6621178  1.000000000 0.86484298 0.89194190
CFMG  0.094091035  0.6091958  0.6080143  0.864842984 1.00000000 0.91212083
DECI  0.054982854  0.6437178  0.6332086  0.891941895 0.91212083 1.00000000
ORAL -0.027381743  0.7451506  0.7272732  0.859909442 0.82495629 0.83952698
WRIT -0.028474100  0.7187820  0.6942712  0.877775007 0.83497447 0.85064096
PHYS -0.066667371  0.6309756  0.6296740  0.752740177 0.72853135 0.77215650
RTEN -0.021652594  0.8013829  0.7979569  0.822527726 0.76344652 0.80206419
ORAL       WRIT        PHYS        RTEN
CONT -0.02738174 -0.0284741 -0.06666737 -0.02165259
INTG  0.74515064  0.7187820  0.63097556  0.80138292
DMNR  0.72727320  0.6942712  0.62967404  0.79795687
DILG  0.85990944  0.8777750  0.75274018  0.82252773
CFMG  0.82495629  0.8349745  0.72853135  0.76344652
DECI  0.83952698  0.8506410  0.77215650  0.80206419
ORAL  1.00000000  0.9596834  0.79429138  0.90227331
WRIT  0.95968339  1.0000000  0.77463199  0.85309146
PHYS  0.79429138  0.7746320  1.00000000  0.76591261
RTEN  0.90227331  0.8530915  0.76591261  1.00000000```
 >

If the data you are using is measured at the interval or ratio level just change the “method=” designator to “Pearson” to produce a product-moment correlation.

More to Come:

(A Tutorial by D.M. Wiig)

I have several computers that use Linux operating systems and I have installed R on all of them. I use Debian on some of the machines and Ubuntu on others. When downloading R using the distribution’s package manager or from the command line I have notice that I will get versions of R ranging from 2.13.xx to 2.15.xxx depending on the Linux distribution. That has not been a problem until the release of the current version of R, version 3.0.3. Since this version is not backwards compatible with earlier releases it is necessary to upgrade to the new version to take advantage of new packages that are rapidly being developed as well as modification to existing packages to accommodate R 3.0.3. This tutorial will cover the installation of R 3.0.3 on the Ubuntu distribution of Linux.

When installing R 3.0.3 it is necessary to make sure that the current binaries are installed to your version of the Linux OS. If you are running a Ubuntu distribution you can edit the sources.list file on your computer to access the most up to date CRANs. Open a terminal program and enter the following from the command line:

\$ cd /etc/apt/

\$ dir

Make sure the file sources.list is in the directory and then edit the file opening the nano editor:

\$ sudo nano sources.list

You should see a file in the editor that is similar to the file shown below:

—————————————————————————————————-

deb cdrom:[Kubuntu 11.10 _Oneiric Ocelot_ – Release i386 (20111012)]/ oneiric main restricted

# newer versions of the distribution.

deb http://us.archive.ubuntu.com/ubuntu/ precise main restricted

deb-src http://us.archive.ubuntu.com/ubuntu/ precise main restricted

## Major bug fix updates produced after the final release of the

## distribution.

## N.B. software from this repository is ENTIRELY UNSUPPORTED by the Ubuntu

## team. Also, please note that software in universe WILL NOT receive any

## review or updates from the Ubuntu security team.

deb http://us.archive.ubuntu.com/ubuntu/ precise universe

deb-src http://us.archive.ubuntu.com/ubuntu/ precise universe

## N.B. software from this repository is ENTIRELY UNSUPPORTED by the Ubuntu

^G Get Help ^O WriteOut ^R Read File ^Y Prev Page ^K Cut Text ^C Cur Pos

^X Exit ^J Justify ^W Where Is ^V Next Page ^U UnCut Text ^T To Spell

—————————————————————————

I have highlighted the line that I added to this file. This line will force Linux to access the CRAN for the latest version of R in a library that is not normally searched for updates if you have an earlier version of R installed. For an Ubuntu distribution change the line to one of the following depending on the distribution that you have installed:

http://<myfavorite-cran-mirror&gt; /bin/linux/ubuntu saucy/

http://<myfavorite-cran-mirror&gt; /bin/linux/ubuntu quantal/

http://<myfavorite-cran-mirror&gt; /bin/linux/ubuntu precise/

http://<myfavorite-cran-mirror&gt; /bin/linux/ubuntu lucid/

Replace <myfavorite-cran-mirror> with the CRAN repository of your choice found at the web site http://cran.r.project.org/mirrors.html. In my case as shown above I used a CRAN repository here in Iowa at Iowa State University. Once the line has been entered in your sources.list file press ctrl-o to save the file, and press ctrl-x to exit the editor. Be sure when you invoke nano that you have root privileges (by using sudo nano) or you will not be able to write out the modified file.

Once you have successfully modified the sources.list file proceed with the R 3.0.3 installation by issuing the command:

\$ sudo apt-get update (to make sure all supporting files are current)

and then:

\$ sudo apt-get install r-base

When the update runs you should see that R 3.0.x is downloaded and is being installed. After the installation is complete test it by issuing the command:

\$ R

You will see the output as shown below:

———————————————

R version 3.0.3 (2014-03-06) — “Warm Puppy”

Copyright (C) 2014 The R Foundation for Statistical Computing

Platform: i686-pc-linux-gnu (32-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.

You are welcome to redistribute it under certain conditions.

Natural language support but running in an English locale

R is a collaborative project with many contributors.

‘citation()’ on how to cite R or R packages in publications.

Type ‘demo()’ for some demos, ‘help()’ for on-line help, or

‘help.start()’ for an HTML browser interface to help.

Type ‘q()’ to quit R.

>

You are now up and running with the latest version of R. The process for installation of R 3.0.x is similar for Debian and Fedora distributions. Each of these will be covered in a future tutorial.

# Using R in Nonparametric Statistics: Basic Table Analysis, Part Two

Using R in Nonparametric Statistics: Basic Table Analysis, Part Two

A Tutorial by D.M. Wiig

As discussed in a previous tutorial one of the most common methods display ng and analyzing data is through the use of tables. In this tutorial I will discuss setting up a basic table using R and exploring the use of the CrossTable function that is available in the R ‘gmodel’ package. I will use the same hypothetical data table that I created in Part One of this tutorial, data that examines the relationship between income and political party identification among a group of registered voters. The variable “income” will be considered ordinal in nature and consists of categories of income in thousands as follows:

“< 25”; “25-50”; “51-100” and “>100”

Political party identification is nominal in nature with the following categories:

“Dem”, “Rep”, “Indep”

Frequency counts of individuals that fall into each category are numeric. In the first example we will create a table by entering the data as a data frame and displaying the results. When using this method it is a good idea to set up the table on paper before entering the data into R. This will help to make sure that all cases and factors are entered correctly. The table I want to generate will look like this:

party
income                Dem Rep Indep
<25 1                          5     5      10
26-50                      20    15    15
51-100                  10     20    10
>100                        5       30    10

When using the CrossTable() function the data should be entered in matrix format. Enter the data from the table above as follows:

>#enter data as table matrix creating the variable ‘Partyid’
>#enter the frequencies
>Partyid <-matrix(c(15,20,10,5, 5,15,20,30, 10,15,10,10),4,3)
>#enter the column dimension names and column heading categories
>dimnames(Partyid) = list(income=c(“<25”, “25-50″,”51-100”, “>100”), party=c(“Dem”,”Rep”,”Indep”))

To view the structue of the created data matrix use the command:

> str(Partyid)
num [1:4, 1:3] 15 20 10 5 5 15 20 30 10 15 …
– attr(*, “dimnames”)=List of 2
..\$ income: chr [1:4] “<25” “25-50” “51-100” “>100”
..\$ party : chr [1:3] “Dem” “Rep” “Indep”
>

To view the table use the command:

> Partyid
party
income                       Dem Rep Indep
<25                                   15     5      10
25-50                             20     15    15
51-100                         10      20   10
>100                               5        30   10
>

Remember that R is case sensitive so make sure you use upper case if you named your variable ‘Partyid.’

Once the table has been entered as a matrix it can be displayed with a number of available options using the CrossTable() function. In this example I will produce a table in SAS format(default format), display both observed and expected cell frequencies, the proportion of the Chi-square total contributed by each cell, and the results of the chi-square analysis. The script is:
> #make sure gmodels package is loaded
> require(gmodels)
> #CrossTable analysis
> CrossTable(Partyid,prop.t=FALSE,prop.r=FALSE,prop.c=FALSE,expected=TRUE,chisq=TRUE,prop.chisq=TRUE)

Cell Contents
|—————————–|
|                                                    N |
|                             Expected N |
| Chi-square contribution |
|—————————-|
Total Observations in Table: 165
| party
income | Dem | Rep | Indep | Row Total |
<25        |    15     | 5              | 10        | 30                   |
| 9.091 | 12.727 |8.182  |                          |
| 3.841 | 4.692 | 0.404 |                             |

25-50 |      20             15             | 15 |      |50

15.152 | 21.212 | 13.636 | |
| 1.552   | 1.819    | 0.136 | |

51-100 | 10           | 20            | 10 |         40 |
| 12.121 | 16.970 | 10.909 | |
|                 0.371 |   0.541 |    0.076 | |
————-|———–|———–|———–|———–|
>100 |        5 |          30             | 10 |        45 |
| 13.636 | 19.091 |    12.273 | |
| 5.470 |   6.234 |         0.421 | |
————-|———–|———–|———–|———–|
Column Total | 50 | 70 | 45 | 165 |
————-|———–|———–|———–|———–|
Statistics for All Table Factors
Pearson’s Chi-squared test
————————————————————
Chi^2 = 25.55608 d.f. = 6 p = 0.0002692734

>

As seen above row marginal totals and column marginal totals are displayed by default with the SAS format. There are other options available for the CrossTable() function. See the CRAN documentation for a detailed description of all of the options available. In the next installment of this tutorial I will examine some of the measures of association that are available in R for nominal and ordinal data displayed in a table format.

# Using R in Nonparametric Statistics: Basic Table Analysis, Part One

Using R in Nonparametric Statistics: Basic Table Analysis, Part One

A Tutorial by D.M. Wiig
One of the most common methods displaying and analyzing data is through the use of tables. In this tutorial I will discuss setting up a basic table using R and performing an initial Chi-Square test on the table. R has an extensive set of tools for manipulating data in the form of a matrix, table, or data frame. The package ‘vcd’ is specifically designed to provide tools for table analysis. Before beginning this tutorial open an R session in your terminal window. You can install the vcd package using the following command:

>install.packages()

In social science research we often use data that is nominal or ordinal in nature. Data is displayed in categories with associated frequency counts. In this tutorial I will use a set of hypothetical data that examines the relationship between income and political party identification among a group of registered voters. The variable “income” will be considered ordinal in nature and consists of categories of income in thousands as follows:

“< 25”; “25-50”; “51-100” and “>100”

Political party identification is nominal in nature with the following categories:

“Dem”, “Rep”, “Indep”

Frequency counts of individuals that fall into each category are numeric. In the first example we will create a table by entering the data as a data frame and displaying the results. When using this method it is a good idea to set up the table on paper before entering the data into R. This will help to make sure that all cases and factors are entered correctly. The table I want to generate will look like this:

party
income                 Dem Rep Indep
<25                             15    5      10
26-50                        20   15    15
51-100                     10   20    10
>100                            5    30    10

To enter the above into a data frame use the following on the command line:

> partydata <- data.frame(expand.grid(income=c(“<25″,”25-50″,”51-100″,”>100″), party=c(“Dem”,”Rep”, “Indep”)),count=c(15,20,10,5,5,15,20,30,10,15,10,10))
>

Make sure the syntax is exactly as shown and make sure the entire script is on the same line or has done an automatic return to the next line in your R console. When the command runs without error you can view the data by entering:

> partydata

The following output is produced:

> partydata
income                    party         count
1 <25                         Dem            15
2 25-50                    Dem            20
3 51-100                 Dem           10
4 >100                      Dem             5
5 <25                         Rep               5
6 25-50                    Rep              15
7 51-100                 Rep              20
8 >100                      Rep              30
9 <25                         Indep          10
10 25-50                 Indep          15
11 51-100              Indep          10
12 >100                   Indep          10
>

At this point the data is in frequency rather that table or matrix form. To view a summary of information about the data use the command:

>str(partydata)

You will see:

> str(partydata)
‘data.frame’: 12 obs. of 3 variables:
\$ income: Factor w/ 4 levels “<25″,”26-50”,..: 1 2 3 4 1 2 3 4 1 2 …
\$ party : Factor w/ 3 levels “Dem”,”Rep”,”Indep”: 1 1 1 1 2 2 2 2 3 3 …
\$ count : num 15 20 10 5 5 15 20 30 10 15 …

To convert the data into tabular format use the command xtabs to perform a cross tabulation. I have named the resulting table “tabs”:

>tabs <- xtabs(count ~income + party, data=partydata)

To view the resulting table use:

> tabs
party
income                              Dem Rep Indep
<25                                        15      5        10
26-50                                   20      15      15
51-100                                10      20      10
>100                                       5       30      10
>

This produces a table in the desired format. To do a quick analysis of the table that produces a Chi-square statistic use the command:

> summary(tabs)

The output is

> summary(tabs)
Call: xtabs(formula = count ~ income + party, data = partydata)
Number of cases in table: 165
Number of factors: 2
Test for independence of all factors:
Chisq = 25.556, df = 6, p-value = 0.0002693
>

In future tutorials I will discuss many of the other resources that are available with the vcd package for manipulating and analyzing data in a tabular format.

# Using R in Nonparametric Statistics: Basic Table Analysis, Part Three, Using assocstats and collapse.table

A tutorial by D.M. Wiig

As discussed in a previous tutorial one of the most common methods displaying and analyzing data is through the use of tables. In this tutorial I will discuss setting up a basic table using R and exploring the use of the assocstats function to generate several commonly used nonparametric measures of association. The assocstats function will generate the association measures of the Phi-coefficient, the Contingency Coefficient and Cramer’s V, in addition to the Likelihood Ratio and Pearson’s Chi-Squared for independence. Cramer’s V and the Contigency Coefficient are commonly applied to r x c tables while the Phi-coefficient is used in the case of dichotomous variables in a 2 x 2 table.

To illustrate the use of assocstats I will use hypthetical data exploring the relationship between level of education and average annual income. Education will be measured using the nominal categories “High School”, “College”, and “Graduate”. Average annual income will be measured using ordinal categories and expressed in thousands:

“< 25”; “25-50”; “51-100” and “>100”

Frequency counts of individuals that fall into each category are numeric.

In the first example a 4 x 3 table created with hypothetical frequencies as shown below:

Income                                Education

<25                                    15                       8                  5

26-50                              12                       12                8

51-100                           10                       22                25

>100                                  5                       10                 32
The first table, table1, is entered into R as a data frame using the following commands:

#create 4 x 3 data frame
#enter table1 in frequency form

Check to make sure the data are in the right row and column categories. Notice that the data are entered in the ‘count’ list by columns.

> table1
income  education     count
1 <25             HS                  15
2 25-50        HS                  12
3 51-100     HS                 10
4 >100          HS                   5
5 <25             College         8
6 25-50        College        12
7 51-100     College        22
8 >100          College        10
>

If the stable structure looks correct generate the table, tab1, using the xtabs function:

> #create table tab1 from data.frame
> tab1 <- xtabs(count ~income + education, data=table1)
Show the table using the command:

>tab1
education
<25                   15     8             5
25-50             12     12           8
51-100          10     22          25
>100                 5     10          32
>
Use the assocstats function to generate measures of association for the table. Make sure that you have loaded the vcd package and the vcdExtras packages. Run assocstats with the following commands:

> assocstats(tab1)
X^2 df P(> X^2)
Likelihood Ratio 31.949 6 1.6689e-05
Pearson 32.279 6 1.4426e-05

Phi-Coefficient : 0.444
Contingency Coeff.: 0.406
Cramer’s V : 0.314
>

The measures show an association between the two variables. My intent is not to provide an analysis of how to evaluate each of the measures. There are excellent sources of documention on each measure of association in the R CRAN Literature. Since the Phi-coefficient is designed primarily to measure association between dichotomous variables in a 2 x 2 table,collapse the 4 x 3 table using the collapse.table function to get a more accurate Phi-coefficient. Since we want to go from a 4 x 3 to a 2 x 2 table we essentially collapse the table in two stages. The first stage collapses the table to a 2 x 3 table by combining the “<25” with the “25-50” and the “51-100” with the “>100” categories of income.

The resulting 2 x 3 table is seen below:

Education

<50                                 27                        20                    13

>50                                15                        32                     57

To collapse the table use the R function collapse.table to combine the “<25” and “26-50” categories and the “50-100” and “>100” categories as discussed above:

> #collapse table tab1 to a 2 x 3 table, table2
> table2 <-collapse.table(tab1, income=c(“<50″,”<50″,”>50″,”>50″))

View the resulting table, table2, with:

> table2
education
<50                  27             20                   13
>50                  15             32                   57
>

Now collapse the table to a 2 x 2 table by combining the “College” and “Graduate” columns:
> #collapse 2 x 3 table2 to a 2 x2 table, table3
> table3 <-collapse.table(table2, education=c(“HS”,”College”,”College”))

View the resulting table, table3, with:

> table3
education
income             HS             College
<25                     27                  33
>100                  15                  89
>

Use the assocstats function to evaluated the 2 x 2 table:

> #use assocstats on the 2 x 2 table, table3
> assocstats(table3)
X^2 df P(> X^2)
Likelihood Ratio 18.220 1 1.9684e-05
Pearson 18.673 1 1.5519e-05

Phi-Coefficient : 0.337
Contingency Coeff.: 0.32
Cramer’s V : 0.337
>

There are many other table manipulation function available in the R vcd and vcdExtras packages and well as other packages to provide analysis of nonparametric data. This series of tutorials hopefully serves to illustrate some of the more basic and common table functions using these packages. The next tutorial looks at the use of the ca function to perform and graph the results of a basic Correspondence Analysis.