Tag Archives: r data analysis

R For Beginners: Installing the latest version of R on a Linux platform

November 11, 2016 dmwiig Leave a comment

R for Beginners: Installing the latest version of R on a Linux platform

A tutorial by D. M. Wiig

One of the nice characteristics of open source software such as R is the rapid development of new releases and updates. While the base core remains stable for a period of time there is a considerable amount of updating, adding, and removing the component packages. At the time of this writing the latest iteration is R version 3.3.1, “Bug in Your Hair.” If you are using a Windows platform you will likely go directly to the archive web site and download the latest distribution as a Windows executable installation package.

If you are using a Linux distribution such as Ubuntu or Debian, the process of adding software is usually accomplished via the menu based installer. These software installers allow R and its dependencies to be downloaded from the community archive.

One of the disadvantages of using this approach is that the versions of some software in the archives may not be updated to the latest version. This is often the case with R.

To insure that you are downloading the latest R version you need to use the platform’s command line to install what is needed. Regradless of which Linux distribution you are using first open a command console from the desktop menu. Make sure all is up to date by using the command:

pi@raspberrypi:~ $ sudo apt-get update
This will insure all appropriate packages currently installed are running the latest updates. If you are running a Debian distribution such as jessie you will need to edit the /etc/apt/sources.list file to add a backport to the latest version of R. Use the nano editor by using the command:

sudo nano /etc/apt/sources.list

This should produce the output as seen below:

pi@raspberrypi:~ $ sudo nano /etc/apt/sources.list

------------------------------------------------
GNU nano 2.2.6 File: /etc/apt/sources.list

deb http://mirrordirector.raspbian.org/raspbian/ jessie main contrib non-free r$
# Uncomment line below then 'apt-get update' to enable 'apt-get source'
deb-src http://archive.raspbian.org/raspbian/ jessie main contrib non-free rpi
deb http://archive.raspbian.org/raspbian/ stretch main
deb http://mirror.las.iastate.edu/CRAN/bin/linux/debian/ jessie main
deb http://mirror.las.iastate.edu/CRAN/bin/linux/ubuntu xenial/

[ Read 8 lines ]
^G Get Help ^O WriteOut ^R Read File ^Y Prev Page ^K Cut Text ^C Cur Pos
^X Exit ^J Justify ^W Where Is ^V Next Page ^U UnCut Text^T To Spell

If you are using a Debian distribution you would add the line to the file

http://mirror.las.iastate.edu/CRAN/bin/linux/debian/ jessie main

Replace the mirror portion with <URL of your favorite CRAN mirror>.  Replace the 'jessie' portion with the name of the specific Debian distribution you are using.

If you are using an Ubuntu distribution add a line with the appropriate changes for the specific Ubuntu distribution that you are using.

Once these changes are made exit the nano editor using the ^O key command to write the file and then the ^X key command to return to the command line.  You should now be able to issue the command:

pi@raspberrypi:~ $ sudo apt-get install r-base r-base-core r-base-dev

Once the download and install processes have completed you should now be able to invoke R from the command line or menu and see the latest version:

pi@raspberrypi:~ $ R

R version 3.3.2 RC (2016-10-23 r71578) -- "Sincere Pumpkin Patch"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: arm-unknown-linux-gnueabihf (32-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

 Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> 


For other Linux distributions you would add a line similar to the above examples in the /etc/apt/sources.list. Check the documentation for your specific Linux platform for further information.

R Code Development, R Tutorials

R for Beginners: Using R Commander in an Introductory Statistics Course

September 28, 2016 dmwiig Leave a comment

R for beginners: Using R Commander in introductory statistics courses

A tutorial by D. M. Wiig

As with previous tutorials in this series this document is an embedded Word documents. To view the document full screen click on the icon in the lower right corner of the window.

R Tutorials

R Tutorial: Using R to Analyze the NORC GSS2014 Database, Selecting Subsets and Comparing Means Using Student’s t Test

June 8, 2015 dmwiig Leave a comment

R Tutorial Part Three: Selecting Subsets and Comparing Means Using an Independent Sample t Test

A tutorial by Douglas M. Wiig

As discussed in previous segments of this tutorial, for anyone interested in researching social science questions there is a wealth of survey data available through the National Opinion Research Center (NORC) and its associated research universities. The Center has been conducting a national survey each year since 1972 and has compiled a massive database of data from these surveys. Most if not all of these data files can be accessed and downloaded without charge. I have been working with the 2014 edition of the data and for all part of this tutorial will use the GSS2014 data file that is available for download on the Center’s web site. (See the NORC main website at http://www.norc.org/Research/Projects/Pages/general-social-survey.aspx and at http://www3.norc.org/GSS+Website ).

Accessing and loading the NORC GSS2014 data set was discussed in part one of this tutorial. Refer to it if you need specific information on downloading the data set in STATA or SPSS format.  In this segment we will use the subset function to select a desired set of cases from all of the cases in the data file that meet certain criteria.  As indicated in my previous tutorial the GSS2014 data set contains a total of 2588 cases and 866 variables.

Before starting this segment of the tutorial be sure that the foreign package is installed and loaded into your R session.  Import the GSS 2014 data file and load it into the data frame ‘Dataset’ using:

########################################################
#import GSS2014 file in SPSS .sav format
#uses foreign package
########################################################
require(foreign)
Dataset <- read.spss("/path to your location/GSS2014.sav", 
                     use.value.labels=TRUE, max.value.labels=Inf, to.data.frame=TRUE)

###########################################################

In the previous segment of this tutorial we started to investigate whether or not an individual’s education had an effect on their response to a NORC survey item dealing with abortion. The item asked respondents to either ‘AGREE’ or ‘DISAGREE’ with the statement ‘A women should be allowed to obtain an abortion under any circumstances.’ We selected a subset of all of the respondents who answered ‘AGREE’ and a second subset of all the respondents who answered ‘DISAGREE’ using the following code:

##############################################

#select subset from Dataset and write to data frame SS1

###################################################
SS1 <- subset(Dataset, abany == "YES", select=educ)

View(SS1)

#######################################################

######################################################
#select subset from Dataset and write to data frame SS2
######################################################
SS2 <- subset(Dataset, abany == "NO", select=educ)
View(SS2)

A mean number of years of education can be calculated for each of the subsets using the following:

#calculate descriptive statistics for SS1 and SS2

####################################################

summary(SS1)

summary(SS2)

####################################################

Output from the above for SS1 is:

> summary(SS1)

educ

Min. : 0.0

1st Qu.:12.0

Median :15.0

Mean :14.6

3rd Qu.:16.0

Max. :20.0

Output for SS2 is:

> summary(SS2)

educ

Min. : 0.00

1st Qu.:12.00

Median :12.00

Mean :12.93

3rd Qu.:15.00

Max. :20.00

NA’s :1

As seen above there is a difference in mean years of education for the two subsets. We can use a two independent sample t test to determine whether or not the difference is large enough to not be due to chance.

In this tutorial I will use the Student’s t test function t.test that is found in the stats package. The function is used in the following form:

t.test =(x,y, alternative = c(“two.sided”, “less”, “greater”), mu=0, paired = FALSE, var.equal = FALSE, conf.level = .95)

where x and y = numeric vectors of data values

alternative = specification of a one-tailed or two-tailed test

mu = 0 specification that true difference between means is zero

paired = FALSE specification of a two independent sample test; if TRUE a paired samples test will be used

var.equal = specification of equal variances of the two samples; if TRUE the pooled variance is used otherwise a Welsh approximation of degrees of freedom is used

conf.level = confidence level of the interval

For further information see the documentation in CRAN help files for the function t.test().

Using the vectors selected from the dataset SS1, and SS2 the t test is performed using:

###########################################################

#perform a t test to compare sample means

#########################################################

t.test(SS1,SS2, alternative = c(“two.sided”), mu=0, paired=FALSE, var.equal = TRUE, conf.level = .95)

###########################################################

Resulting in output of:

        Two Sample t-test

data:  SS1 and SS2 
t = 11.1356, df = 1650, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 1.369673 1.955333 
sample estimates:
mean of x mean of y 
 14.59517  12.93267

We can see that the difference between the mean years of education for the ‘YES’ and the ‘NO’ samples is significant at an alpha level of p=.05. Subsets can also be used to compare means involving more than two samples and using simple one-way Analysis of Variance. This will be covered in the next part of the tutorial.

R Tutorials

Using R for Nonparametric Analysis: The Kruskal-Wallis Test, Part One

March 11, 2015 dmwiig Leave a comment

Using R for Nonparametric Data Analysis: The Kruskal-Wallis Test

A tutorial by Douglas M. Wiig

Analysis of variance(ANOVA) is a commonly used technique for examining the effect of an independent variable on three or more dependent variables. There are several types of ANOVA ranging from simple one-way ANOVA to the more complex multiple analysis of variance, MANOVA. ANOVA makes several assumptions about the sample data being used such as the assumption of normal distribution of the variables in the parent population, underlying continuous distribution of the variables, and interval or ratio level measurement of all variables. If any of these assumptions cannot be met a researcher can turn to a nonparametric counterpart to ANOVA for the analysis. This tutorial will discuss the use of the Kruskal-Wallis test, the nonparametric counterpart to analysis of variance.

In this tutorial I will explore a simple example and discuss entering the sample data into a data file using the R data editor. I will then discuss setting up the data for analysis and using the Kruskal-Wallis test.

I am going to assume that the reader has a working knowledge of ANOVA with parametric data. Since ANOVA uses sample means and variances as the basis of the statistical test interval or ratio level measurement is necessary to insure valid results in addition to the assumptions indicated above. With the nonparametric Kruskal-Wallis test the only assumptions to be met are ordinal or better measurement and the assumption of an underlying continuous measurement. The example to be used here is taken from a book on nonparametric statistics by Sidney Seigel.(Sidney Seigel, Nonparametric Statistics for the Behavioral Sciences, New York: McGraw-Hill, 1956, pp-184-196).

A researcher wishes to test the hypothesis that school administrators are typically more authoritarian than classroom teachers. He also believes that many classroom teachers are adminstration-oriented in their professional aspirations which may, in turn, have an effect on their authoritarianism. 14 subjects are selected and divided into three groups: teaching-oriented teachers (classroom teachers who wish to remain in a teaching position), administration-oriented teachers (classroom teachers who aspire to become administrators), and practicing administrators.(Seigel, p. 186). The level of authoritarianism of each subject is measured through a survey that assigns an authoritarianism score that is considered to be at least ordinal in nature. Higher scores indicate higher levels of authoritarianism. (Siegel, p. 186). The null hypothesis is that there is no difference in mean authoritarianism scores among the three groups. The alternative hypothesis is that the mean authoritarianism scores among the three groups are different. The alpha level for rejecting the null hypothesis is p = .05. (Seigel, p. 186).

Since we make no assumption about a normal distribution of scores, have a small sample size of n = 14, and ordinal measure we will use the nonparametric test which is based on median scores and ranks rather than means and variances as used in parametric ANOVA. The mathematical details of how this is done is beyond the scope of this tutorial. See Seigel, p. 187-189 for details. The authoritarian scores for the three groups are shown below:

Authoritarianism Scores of Three Groups of Educators

Teacher-Oriented Admin-oriented Administrators

teachers n=5 teachers n=5 n=4

—————————————————————————————-

96 82 115

128 124 149

83 132 166

61 123 147

101 109

—————————————————————————————-

(Seigel, p. 187)

The first task is to create an R data frame with the scores from the table. We will enter the scores using the R data editor. We will name the data frame ‘kruskal.’ Invoke the editor using the following commands:

> kruskal <- data.frame()

> kruskal <- edit(kruskal)

You should see the data entry editor open in a separate window. In order to process the data properly it needs to be entered into two columns. The first column will be the factors (which group the scores belong to), and the second column will contain the actual scores. Label column 1 ‘Group’ and column 2 ‘authscore.’ When the data are entered your editor should look like this:

———————-

Group authscore

1 1 96

2 1 128

3 1 83

4 1 61

5 1 101

6 2 82

7 2 121

8 2 132

9 2 135

10 2 109

11 3 115

12 3 149

13 3 166

14 3 147

———————-

Make sure that each column of numbers is of the data type “Real.” l Close the data editor by clicking ‘Quit’ and the data will be saved in the working directory for access. To see what has been entered in the data editor use the command:

> kruskal

Group authscore

1 1 96

2 1 128

3 1 83

4 1 61

5 1 101

6 2 82

7 2 121

8 2 132

9 2 135

10 2 109

11 3 115

12 3 149

13 3 166

You should see the output as above. If you need to make changes simple invoke the editor with:

> kruskal <-edit(kruskal)

The editor will open and you can make any changes you need to. Be sure to click on ‘Quit’ to save the changes to the working directory.

Part Two will continue the analysis

	Olavi Koskela on This Site Now Updating With Ne…
	Hydra Themes on R for Beginners: Some Simple C…
	Juan Carlos Rubio Po… on Ternary Diagrams Using R: An E…
	Nicholas Beltran on R Video Tutorial: Basic R Code…
	Ellena Field on Using R for Basic Cross Tabula…

R Statistics and Programming

Tag Archives: r data analysis

R For Beginners: Installing the latest version of R on a Linux platform

R for Beginners: Using R Commander in an Introductory Statistics Course

R Tutorial: Using R to Analyze the NORC GSS2014 Database, Selecting Subsets and Comparing Means Using Student’s t Test

Using R for Nonparametric Analysis: The Kruskal-Wallis Test, Part One

Resources and Information About R Statistics and Programming

Share this:

Share this:

Share this:

Share this:

Resources and Information About R Statistics and Programming