R Tutorial: A Simple Script to Create and Analyze a Data File, Part Two


A simple R script to create and analyze a data file:part two:    A tutorial by D.M. Wiig

In part one I discussed creating a simple data file containing the height and weight of 10 subjects.  In part two I will discuss the script needed to create a simple scatter diagram of the data and perform a basic Pearson correlation.  Before attempting to continue the script in this tutorial make sure that you have created and save the data file as discussed in part one.

To conduct a correlation/regression analysis of the data we want to first view a simple scatter plot. Load a library named ‘car’ into R memory. Use the command:

> library(car)

Then issue the following command to plot the graph:

> plot(Height~Weight, log=”xy”, data=Sampledatafile)

The output is seen below:

scatter1

We can calculate a Pearson’s Product Moment correlation coefficient by using the command:

> # Pearson rank-order correlations between height and weight

> cor(Sampledatafile[,c(“Height”,”Weight”)], use=”complete.obs”, method=”pearson”)

Which results in:

Height Weight

Height 1.0000000 0.8813799

Weight 0.8813799 1.0000000

To run a simple linear regression for Height and Weight use the following code. Note that the dependent variable (Weight) is listed firt:

> model <-lm(Weight~Height, data=Sampledatafile)

> summary(model)

Call:

lm(formula = Weight ~ Height, data = Sampledatafile)

Residuals:

Min 1Q Median 3Q Max

-30.6800 -16.9749 -0.8774 19.9982 25.3200

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -337.986 98.403 -3.435 0.008893 **

Height 7.518 1.425 5.277 0.000749 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 21.93 on 8 degrees of freedom

Multiple R-squared: 0.7768, Adjusted R-squared: 0.7489

F-statistic: 27.85 on 1 and 8 DF, p-value: 0.0007489

>

To plot a regression line on the scatter diagram use the following command line. Note that we enter the y (dependent)variable first and then the x (independent)variable:

> scatterplot(Weight~Height, log=”xy”, reg.line=lm, smooth=FALSE, spread=FALSE,

+ data=Sampledatafile)

>

This will produce a graph as seen below. Note that box plots have also been included in the output:

scatter2

This tutorial has hopefully demonstrated that complex tasks can be accomplished with relatively simple command line script. I will explore more of these simple scripts in future tutorials.

More to Come:

 

Book Review: R High Performance Programming


A book review by Douglas M. Wiig

Aloysius Lim and William Tjhi. R High Performance Programming. Birmingham, UK: Packt Publishing Ltd., 2015. bit.ly/14Rhpp

R High Performance Programming is a well written, informative book most suited for the experienced R programmer. This book offers a handy guide for R users who need speed and efficiency for the tasks that they perform.

The authors begin with an informative chapter discussing some of the inherent constraints on R’s computing performance such as CPU and RAM usage, and how R code is interpreted on the fly rather than compiled. A guide to several methods of profiling R’s code execution time, memory allocation and CPU usage is discussed in the next chapter. Sample code included in the chapter allows the reader to experiment with various benchmarking techniques to measure processing time and memory usage. This chapter provides the reader with some good tools for benchmarking R projects and identifying areas where improvements in processing can be made.

As is always the case with technical books from Packt Publishing, ample code examples are used in the chapter and the complete code used in each chapter is available for download with the book. This is a very handy feature and allows readers to do some live programming with R as the book is read.

The authors discuss a number of simple tweaks that can be easily performed to increase processing speed such as using built in functions and using hash tables. The hash table technique is useful for applications that use frequent lookups and can dramatically reduce processing time when compared to the use of lists. Running example code using this technique shows a large decrease in processing time when using the hash table approach as compared to straight list processing lookups.

In chapter 4 the authors discuss the use of compiled R code and integrating compiled languages into R code. They show several examples of using the R package inline that allows users to embed C, C++, Objective-C, Objective-C++ and Fortran code within R. Once again there are ample code examples to illustrate the use of this technique. For more advanced uses of compiled code the authors discuss how to create entire modules coded in C++ using the Rcpp package. Several completed code examples are included to illustrate the technique.

Another interesting approach to speeding up R is discussed in a chapter that explores several R packages designed to exploit the capability of GPU’s (Graphic Processing Cards) that are a used in many computers. These techniques can facilitate creating very fast and efficient statistical modeling code using R and the GPU.

As indicated above, readers can download the code package included with the book and find a well-organized set of ten folders (one for each chapter) containing 51 files. These files contain the sample code from the book as well as other code segments and benchmark code discussed in the book. The authors indicate that the code has been tested on R 3.1.1, Ubuntu 14.04 Trusty Tahr, Mac OS X 10.9 Mavericks, and Windows 8.1. This allows integration of these code segments into the reader’s own projects with minimal changes.

Other chapters in R High Performance Programming discuss simple tweaks to use less memory, techniques to speed processing of large datasets and using parallel processing and clustering techniques. The last chapter contains a discussion of using R and Hadoop to process Big Data (massive datasets with sizes measured in petabytes -one petabyes is 1,048,576 gigabytes). Processing data of this magnitude presents many challenges and is an area that is currently the subject of much program development.

I found R High Performance Programming to be a useful and informative book for the advanced user of R. A working knowledge of statistics, R and other programming languages such as C++ or Java is necessary to realize the full benefit of the techniques presented in the book. The book also serves as a good learning tool for less knowledgeable R users who are seeking to advance their programming skills.

Readers who are interested in the use of Hadoop and cluster computer processing might find the book Raspberry Pi Super Cluster by Andrew K. Dennis of interest. (Packt Publishing, 2013

PAC-14-1987838-1387169). A review of this book can be found on my web site at http://dmwiig.net.

Reviewer Information:

Douglas M. Wiig, Professor of Political Science

Grand View University

Teaching areas include social science statistics and research methods, comparative politics, international politics.

Long time user and developer of computer and statistical applications

Host of Open Source Technology in Higher Education web site at http://dmwiig.net

Creator and moderator of LinkedIn discussion forum “Open Source Technology in Higher Education”

Regular contributor to several LinkedIn discussion forums

Author of numerous tutorials on using the R statistical programming language and Raspberry Pi computer

R Tutorial: A Script to Create and Analyze a Simple Data File, Part One


R Tutorial: A Simple Script to Create and Analyze a Data File, Part One

By D.M. Wiig

In this tutorial I will walk you through a simple script that will show you how to create a data file and perform some simple statistical procedures on the file. I will break the code into segments and discuss what each segment does. Before starting this tutorial make sure you have a terminal window open and open R from the command line.

The first task is to create a simple data file. Let’s assume that we have some data from 10 individuals measuring each person’s height and weight. The data is shown below:

Height(inches) Weight(lbs)

72               225

60               128

65               176

75               215

66               145

65               120

70               210

71               176

68               155

77               250

We can enter the data into a data matrix by invoking the data editor and entering the values. Please note that the lines of code preceded by a # are comments and are ignored by R:

#Create a new file and invoke the data editor to enter data

#Create the file Sampledatafile, height and weight of 10 s subjects

Sampledatafile <-data.frame()

Sampledatafile <-edit(Sampledatafile)

You will see a window open that is the R Data Editor. Click on the column heading ‘var1’ and you will see several different data types in the drop down menu. Choose the ‘real’ data type. Follow the same procedure to set the data type for the second column. Enter the data pairs in the columns, with height in the first column and weight in the second column. When the data have been entered click on the var1 heading for column 1 and click ‘Change Name.’ Enter ‘Height’ to label the first column. Follow the same steps to rename the second column ‘Weight.’

Once both columns of data have been entered you can click ‘Quit.’ The datafile ‘Sampledatafile’ is now loaded into memory.

To run so me basic descriptive statistics use the following code:

> #Run descriptives on the data

> summary(Sampledatafile)

The output from this code will be:

  Height                Weight

Min. :60.00          Min. :120.0

1st Qu.:65.25        1st Qu.:147.5

Median :69.00        Median :176.0

Mean :68.90          Mean :180.0

3rd Qu.:71.75        3rd Qu.:213.8

Max. :77.00          Max. :250.0

>

To view the data file use the following lines of code:

>#print the datafile ‘Sampledatafile’ on the screen

> print(Sampledatafile)

You will see the output:

Height          Weight

1 72             225

2 60             128

3 65             176

4 75             215

5 66             145

6 65             120

7 70             210

8 71             176

9 68             155

10 77            250

In Part Two I will discuss an R script to do a simple correlation and scatter diagram.  Check back later!

Nonparametric Statistical Analysis Using R: The Sign Test


Using R in Nonparametic Statistical Analysis:  The Binomial Sign Test

A tutorial by D.M. Wiig

One of the core competencies that students master in introductory social science statistics is to create a null and alternative hypothesis pair relative to a research question and to use a statistical test to evaluate and make a decision about rejecting or retaining the null hypothesis.  I have found that one of the easiest statistical tests to use when teaching these concepts is the sign test.  This is a very easy test to use and students seem to intuitively grasp the concepts of trials and binomial outcomes as these are easily related to the common and familiar event of ‘flipping a coin.’

 

While it is possible to use the sign test by looking up probabilities of outcomes in a table of the binomial distribution I have found that using R to perform the analysis is a good way to get them involved in using statistics software to solve the problem.  R has an easy to use sign test routine that is called with the binom.test command.  To illustrate the use of the test consider an experiment where the researcher has randomly assigned 10 individuals to a group and observes them in both a control and experimental condition.  The researcher measures the criterion variable of interest in each condition for each subject and measures the effect on each subject’s behavior using a relative scale of effect.

 

The researcher at this point is only interested in whether or not the criterion variable has an effect on behavior, so a non-directional hypothesis is used.  The data collected is shown in the following table:

 

Subject   1     2     3     4     5     6      7      8     9     10

———————————————————————-

Pre      50   49   37   16   80   42    40    58   31    21

Post.   56   50   30   25   90   44    60    71   32    22

———————————————————————–

+     +     –     +     +     +      +      +    +      –

The general format for the sign test is as follows:

 

binom.test(x, n, p =.5, alternative = “two.sided”, “less”, “greater”, conf.level = .95)

 

where: x = number of successes

n = number of trials

alternative = indicates the alternative hypthesis as directional or nondirectional

conf.level = the confidence level for the returned confidence interval.

 

In the example as described above we have 8 pluses and 2 minuses.  We will use the “two.sided” option for the alternative hypothesis a probability of success of .50, and a conf.level of .95. The following is entered into R:

Under a nondirectional alternative hypothesis we are testing the probability of obtaining 0, 1, 2, 8, 9, 10 pluses or:

> binom.test(8, 10, p=.5, alternative=”two.sided”, conf.level=.95)

Exact binomial test

data:  8 and 10
number of successes = 8, number of trials = 10,
p-value = 0.1094
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.4439045 0.9747893
sample estimates:
probability of success
                  0.8
>

Under a nondirectional alternative hypothesis we are testing the probability of obtaining 0, 1, 2, 8, 9, 10 pluses or:

 

p(o, 1, 2, 8, 9, 10 pluses)  = .1094

 

If we had set an alpha of ά=.05 then we would retain the null hypothesis as p(obt) > .05.  We could not conclude that the experimental criterion has an effect on behavior.  R has many other nonparametric statistical tests that are easy to use from the command line.  These are topics for future tutorials.

 

More to Follow:

Book Review: Mastering Beaglebone Robotics


Richard Grimmett. Mastering Beaglebone Robotics. Birmingham, UK: Packt Publishing Ltd., 2014. ISBN #978-1-78398-890-7 http://bit.ly/MBbR8907

Book Review by Douglas M. Wiig

With the release of the Raspberry Pi single board computer a new generation of single board multi-platform and multi-use computers has rapidly developed. One of the newer boards to be developed is the Beaglebone Black which is a low cost, multi-functional package that has a number of core functionalities that facilitate building robotic projects. Grimmett’s Mastering Beaglebone Robotics is a very informative and readable guide to the development and implementation of several such projects. The finished projects are sophisticated, functional and educational. They also lend themselves to expansion into even more complex applications if the reader is so inclined.

This book is not intended for beginners with single board computing platforms or robotics but the author does go through the basics of setting up the Beaglebone and installing the necessary software to accommodate the projects in the book. If you are not yet comfortable with installing and configuring hardware and software or working with the Linux command line you should have a basic reference handy as you work through the initial hardware and software setup in chapter one of the book. The author does provide numerous photos and screen shots to help with the process. The author also uses very clear indications of how and what command line actions are used in installing and configuring various programs needed to set up the Beaglebone for the projects in the book.

Once the basic hardware and software are installed and running the author begins a discussion of robotics by taking the reader through a step by step process to create a movable project based on two tank tracks. The chapter covers the basics of using a motor and controller to power the project, the development and use of programs to control the vehicle and the use of voice commands to control the vehicle.

The author provides a detailed description along with numerous photos showing the build as it progresses. In the sections of the chapter where Beaglebone programming is covered the author uses very clear descriptions of the code that make the process easy to follow. Another nice feature of this book as well as other technical books in the Packt library is the availability for download of all of the code used in the book. This is a very handy feature and helps to prevent the frustration of coding errors that are inherent in entering the code from scratch on a keyboard. It also facilitates the debugging phase of the projects.

Once the basic mobile project platform is functional the author devotes two additional chapters to adding sensors of various kinds such as distance object detection, and adding vision and vision processing capabilities. Once again, the author uses numerous detailed photos, screen shots and programming detail in discussing these phases of the project. By the time the reader finishes chapter four of the book a fully functional, programmable movable platform has been developed.

Subsequent chapters of the book are devoted to additional projects that incorporate the basic principles of robotics learned in the initial project. The author discusses building robots that can walk, sail, and use GPS for navigation. There is also a discussion of a project robot that can be submerged and controlled remotely while under water.

The final two chapters of the book detail a quadcopter that is remotely controlled and an autonomous quadcopter that features programmed flight controlled by GPS. I found these chapters particularly interesting as one of my hobbies is flying radio controlled aircraft of various types. These two projects are rather advanced in nature and are more for readers interested in contributing to the development of such projects. In both projects the Beaglebone is used for higher level function such as GPS navigation, path planning and communications. Most of the low level functioning such as controlling the servo motors and other mechanical functions is accomplished by programming and incorporating a separate flight controller board.

As I mentioned earlier, one of the handy features of this book as well as others offered by Packt Publishing is the availability of the computer code used in each chapter of the book. The code used in Mastering Beaglebone Robotics is written in Python and there are files for each of the chapters (with the exception of chapters one and six). This is a useful feature not only for debugging purposes but for those readers who wish to develop other projects or add to the projects detailed in the book.

I found Mastering Beaglebone Robotics to be a good read and a readily usable guide to some of the more complex robotics concepts and construction practices. As indicated earlier this would not be a first book for one starting in either robotics or single-board computing platforms. For the reader with some experience in programming and construction practices the book is an interesting and informative source of information about a rapidly growing field in computer science technology and robotics.

——————————————————————