Tag Archives: open source

R for Beginners: Some Simple Code to Produce Informative Graphs, Part One


A Tutorial by D. M. Wiig

The R programming language has a multitude of packages that can be used to display various types of graph. For a new user looking to display data in a meaningful way graphing functions can look very intimidating. When using a statistics package such as SPSS, Stata, Minitab or even some of the R Gui’s such R Commander sophisticated graphs can be produced but with a limited range of options. When using the R command line to produce graphics output the user has virtually 100 percent control over every aspect of the graphics output.

For new R users there are some basic commands that can be used that are easy to understand and offer a large degree of control over customisation of the graphical output. In part one of this tutorial I will discuss some R scripts that can be used to show typical output from a basic correlation and regression analysis.

For the first example I will use one of the datasets from the R MASS dataset package. The dataset is ‘UScrime´ which contains data on certain factors and their relationship to violent crime. In the first example I will produce a simple scatter plot using the variables ‘GDP’ as the independent variable and ´crimerate´ the dependent variable which is represented by the letter ‘y’ in the dataset.

Before starting on this project install and load the R package ‘MASS.’ Other needed packages are loaded when R is started. The scatter plot is produced using the following code:

####################################################
### make sure that the MASS package is installed
###################################################
library(MASS)   ## load MASS
attach(UScrime)   ## use the UScrime dataset
## plot the two dimensional scatterplot and add appropriate #labels
#
plot(GDP, y,
main=”Basic Scatterplot of Crime Rate vs. GDP”,
xlab=”GDP”,
ylab=”Crime Rate”)
#
####################################################

The above code produces a two-dimensional plot of GDP vs. Crimerate. A regression line can be added to the graph produced by including the following code:

####################################################
## add a regression line to the scatter plot by using simple bivariate #linear model
## lm generates the coefficients for the regression model.extract
## col sets color; lwd sets line width; lty sets line type
#
abline(lm(y ~ GDP), col=”red”, lwd=2, lty=1)
#
####################################################

As is often the case in behavioral research we want to evaluate models that involve more than two variables. For multivariate models scatter plots can be generated using a 3 dimensional version of the R plot() function. For the above model we can add a third variable ‘Ineq’ from the dataset which is a measure the distribution of wealth in the population. Since we are now working with a multivariate linear model of the form ‘y = b1(x1) + b2(x2) + a’ we can use the R function scatterplot3d() to generate a 3 dimensional representation of the variables.

Once again we use the MASS package and the dataset  ‘UScrime’ for the graph data. The code is seen below:

####################################################
## create a 3d graph using the variables y, GDP, and Ineq
####################################################
#
library(scatterplot3d)   ##load scatterplot3d function
require(MASS)
attach(UScrime)   ## use data from UScrime dataset
scatterplot3d(y,GDP, Ineq,
main=”Basic 3D Scatterplot”) ## graph 3 variables, y
#
###################################################

The following graph is produced:

screenshot-graphics-device-number-2-active-%27rkward%27

The above code will generate a basic 3d plot using default values. We can add straight lines from the plane of the graph to each of the data points by setting the graph type option as ‘type=”h”, as seen in the code below:

##############################################

require(MASS)
library(scatterplot3d)
attach(UScrime)
model <- scatterplot3d(GDP, Ineq, y,
type=”h”, ## add vertical lines from plane with this option
main=”3D Scatterplot with Vertical Lines”)
####################################################

This results in the graph:

screenshot-graphics-device-number-2-active-%27rkward%27-1

There are numerous options that can be used to go beyond the basic 3d plot. Refer to CRAN documentation to see these. A final addition to the 3d plot as discussed here is the code needed to generate the regression plane of our linear regression model using the y (crimerate), GDP, and Ineq variables. This is accomplished using the plane3d() option that will draw a plane through the data points of the existing plot. The code to do this is shown below:

##############################################
require(MASS)
library(scatterplot3d)
attach(UScrime)
model <- scatterplot3d(GDP, Ineq, y,
type=”h”,   ## add vertical line from plane to data points with this #option
main=”3D Scatterplot with Vertical Lines”)
## now calculate and add the linear regression data
model1 <- lm(y ~ GDP + Ineq)   #
model$plane3d(model1)   ## link the 3d scatterplot in ‘model’ to the ‘plane3d’ option with ‘model1’ regression information
#
####################################################

The resulting graph is:

screenshot-graphics-device-number-2-active-%27rkward%27-2

To draw a regression plane through the data points only change the ‘type’ option to ‘type=”p” to show the data points without vertical lines to the plane. There are also many other options that can be used. See the CRAN documentation to review them.

I have hopefully shown that relatively simple R code can be used to generate some informative and useful graphs. Once you start to become aware of how to use the multitude of options for these functions you can have virtually total control of the visual presentation of data. I will discuss some additional simple graphs in the next tutorial that I post.

Advertisements

R For Beginners: Installing the JGR GUI On a Linux Platform


A Tutorial by D. M. Wiig

This is an embedded Word document.  To view it full screen click on the icon in the lower right cornet of the document.

Watch for more tutorials discussing  R statistics on a Linux platform.

R For Beginners: Installing and Using the R Console in a Windows Environment


An R tutorial by D. M. Wiig

This tutorial is posted as an embedded Word document. To view the document full screen click on the icon in the lower-right corner of the document window.

My next post covering installing and using the Rcommander GUI will be out in a day or two.

Peppermint 6 OS is one Sweet Platform


I have spent the past week or two working with Peppermint operating system  v. 6.  This is an operating system that is a hybrid with Google’s Chrome operating system integrated with an Ubuntu platform.  Peppermint OS provides a fast easy to use GUI and is optimized for using Google based cloud tools.

The idea is to keep the clutter to a minimum on the home computer and utilize the cloud to advantage with the numerous Google tools and applications that are available.  Of course, Peppermint OS is free and open source.  The degree of customization possible is only limited by the users imagination and programming abilities.

I will write more on this later as I  utilize more of the programs features.  Check out the Peppermint OS web site at:

http://www.peppermintos.com

 

Book Review: R High Performance Programming


A book review by Douglas M. Wiig

Aloysius Lim and William Tjhi. R High Performance Programming. Birmingham, UK: Packt Publishing Ltd., 2015. bit.ly/14Rhpp

R High Performance Programming is a well written, informative book most suited for the experienced R programmer. This book offers a handy guide for R users who need speed and efficiency for the tasks that they perform.

The authors begin with an informative chapter discussing some of the inherent constraints on R’s computing performance such as CPU and RAM usage, and how R code is interpreted on the fly rather than compiled. A guide to several methods of profiling R’s code execution time, memory allocation and CPU usage is discussed in the next chapter. Sample code included in the chapter allows the reader to experiment with various benchmarking techniques to measure processing time and memory usage. This chapter provides the reader with some good tools for benchmarking R projects and identifying areas where improvements in processing can be made.

As is always the case with technical books from Packt Publishing, ample code examples are used in the chapter and the complete code used in each chapter is available for download with the book. This is a very handy feature and allows readers to do some live programming with R as the book is read.

The authors discuss a number of simple tweaks that can be easily performed to increase processing speed such as using built in functions and using hash tables. The hash table technique is useful for applications that use frequent lookups and can dramatically reduce processing time when compared to the use of lists. Running example code using this technique shows a large decrease in processing time when using the hash table approach as compared to straight list processing lookups.

In chapter 4 the authors discuss the use of compiled R code and integrating compiled languages into R code. They show several examples of using the R package inline that allows users to embed C, C++, Objective-C, Objective-C++ and Fortran code within R. Once again there are ample code examples to illustrate the use of this technique. For more advanced uses of compiled code the authors discuss how to create entire modules coded in C++ using the Rcpp package. Several completed code examples are included to illustrate the technique.

Another interesting approach to speeding up R is discussed in a chapter that explores several R packages designed to exploit the capability of GPU’s (Graphic Processing Cards) that are a used in many computers. These techniques can facilitate creating very fast and efficient statistical modeling code using R and the GPU.

As indicated above, readers can download the code package included with the book and find a well-organized set of ten folders (one for each chapter) containing 51 files. These files contain the sample code from the book as well as other code segments and benchmark code discussed in the book. The authors indicate that the code has been tested on R 3.1.1, Ubuntu 14.04 Trusty Tahr, Mac OS X 10.9 Mavericks, and Windows 8.1. This allows integration of these code segments into the reader’s own projects with minimal changes.

Other chapters in R High Performance Programming discuss simple tweaks to use less memory, techniques to speed processing of large datasets and using parallel processing and clustering techniques. The last chapter contains a discussion of using R and Hadoop to process Big Data (massive datasets with sizes measured in petabytes -one petabyes is 1,048,576 gigabytes). Processing data of this magnitude presents many challenges and is an area that is currently the subject of much program development.

I found R High Performance Programming to be a useful and informative book for the advanced user of R. A working knowledge of statistics, R and other programming languages such as C++ or Java is necessary to realize the full benefit of the techniques presented in the book. The book also serves as a good learning tool for less knowledgeable R users who are seeking to advance their programming skills.

Readers who are interested in the use of Hadoop and cluster computer processing might find the book Raspberry Pi Super Cluster by Andrew K. Dennis of interest. (Packt Publishing, 2013

PAC-14-1987838-1387169). A review of this book can be found on my web site at http://dmwiig.net.

Reviewer Information:

Douglas M. Wiig, Professor of Political Science

Grand View University

Teaching areas include social science statistics and research methods, comparative politics, international politics.

Long time user and developer of computer and statistical applications

Host of Open Source Technology in Higher Education web site at http://dmwiig.net

Creator and moderator of LinkedIn discussion forum “Open Source Technology in Higher Education”

Regular contributor to several LinkedIn discussion forums

Author of numerous tutorials on using the R statistical programming language and Raspberry Pi computer