# R For Beginners: Basic Graphics Code to Produce Informative Graphs, Part Two, Working With Big Data

R for beginners: Some basic graphics code to produce informative graphs, part two, working with big data

A tutorial by D. M. Wiig

In part one of this tutorial I discussed the use of R code to produce 3d scatterplots. This is a useful way to produce visual results of multi- variate linear regression models. While visual displays using scatterplots is a useful tool when using most datasets it becomes much more of a challenge when analyzing big data. These types of databases can contain tens of thousands or even millions of cases and hundreds of variables.

Working with these types of data sets involves a number of challenges. If a researcher is interested in using visual presentations such as scatterplots this can be a daunting task. I will start by discussing how scatterplots can be used to provide meaningful visual representation of the relationship between two variables in a simple bivariate model.

To start I will construct a theoretical data set that consists of ten thousand x and y pairs of observations. One method that can be used to accomplish this is to use the R rnorm() function to generate a set of random integers with a specified mean and standard deviation. I will use this function to generate both the x and y variable.

Before starting this tutorial make sure that R is running and that the datasets, LSD, and stats packages have been installed. Use the following code to generate the x and y values such that the mean of x= 10 with a standard deviation of 7, and the mean of y=7 with a standard deviation of 3:

###### ############################################## ## make sure package LSD is loaded ## library(LSD) x <- rnorm(50000, mean=10, sd=15)   # # generates x values #stores results in variable x y <- rnorm(50000, mean=7, sd=3)    ## generates y values #stores results in variable y ####################################################

Now the scatterplot can be created using the code:

###### ############################################## ## plot randomly generated x and y values ## plot(x,y, main=”Scatterplot of 50,000 points”) ####################################################

As can be seen the resulting plot is mostly a mass of black with relatively few individual x and y points shown other than the outliers.  We can do a quick histogram on the x values and the y values to check the normality of the resulting distribution. This shown in the code below:
####################################################
## show histogram of x and y distribution
####################################################
hist(x)   ## histogram for x mean=10; sd=15; n=50,000
##
hist(y)   ## histogram for y mean=7; sd=3; n-50,000
####################################################

The histogram shows a normal distribution for both variables. As is expected, in the x vs. y scatterplot the center mass of points is located at the x = 10; y=7 coordinate of the graph as this coordinate contains the mean of each distribution. A more meaningful scatterplot of the dataset can be generated using a the R functions smoothScatter() and heatscatter(). The smoothScatter() function is located in the graphics package and the heatscatter() function is located in the LSD package.

The smoothScatter() function creates a smoothed color density representation of a scatterplot. This allows for a better visual representation of the density of individual values for the x and y pairs. To use the smoothScatter() function with the large dataset created above use the following code:

###### ############################################## ## use smoothScatter function to visualize the scatterplot of #50,000 x ## and y values ## the x and y values should still be in the workspace as #created  above with the rnorm() function ## smoothScatter(x, y, main = “Smoothed Color Density Representation of 50,000 (x,y) Coordinates”) ## ####################################################

The resulting plot shows several bands of density surrounding the coordinates x=10, y=7 which are the means of the two distributions rather than an indistinguishable mass of dark points.

Similar results can be obtained using the heatscatter() function. This function produces a similar visual based on densities that are represented as color bands. As indicated above, the LSD package should be installed and loaded to access the heatscatter() function. The resulting code is:

###### ############################################## ## produce a heatscatter plot of x and y ## library(LSD) heatscatter(x,y, main=”Heat Color Density Representation of 50,000 (x, y) Coordinates”) ## function heatscatter() with #n=50,000 ####################################################

In comparing this plot with the smoothScatter() plot one can more clearly see the distinctive density bands surrounding the coordinates x=10, y=7. You may also notice depending on the computer you are using that there is a noticeably longer processing time required to produce the heatscatter() plot.

This tutorial has hopefully provided some useful information relative to visual displays of large data sets. In the next segment I will discuss how these techniques can be used on a live database containing millions of cases.

# R for Beginners: Some Simple Code to Produce Informative Graphs, Part One

A Tutorial by D. M. Wiig

The R programming language has a multitude of packages that can be used to display various types of graph. For a new user looking to display data in a meaningful way graphing functions can look very intimidating. When using a statistics package such as SPSS, Stata, Minitab or even some of the R Gui’s such R Commander sophisticated graphs can be produced but with a limited range of options. When using the R command line to produce graphics output the user has virtually 100 percent control over every aspect of the graphics output.

For new R users there are some basic commands that can be used that are easy to understand and offer a large degree of control over customisation of the graphical output. In part one of this tutorial I will discuss some R scripts that can be used to show typical output from a basic correlation and regression analysis.

For the first example I will use one of the datasets from the R MASS dataset package. The dataset is ‘UScrime´ which contains data on certain factors and their relationship to violent crime. In the first example I will produce a simple scatter plot using the variables ‘GDP’ as the independent variable and ´crimerate´ the dependent variable which is represented by the letter ‘y’ in the dataset.

Before starting on this project install and load the R package ‘MASS.’ Other needed packages are loaded when R is started. The scatter plot is produced using the following code:

####################################################
### make sure that the MASS package is installed
###################################################
attach(UScrime)   ## use the UScrime dataset
## plot the two dimensional scatterplot and add appropriate #labels
#
plot(GDP, y,
main=”Basic Scatterplot of Crime Rate vs. GDP”,
xlab=”GDP”,
ylab=”Crime Rate”)
#
####################################################

The above code produces a two-dimensional plot of GDP vs. Crimerate. A regression line can be added to the graph produced by including the following code:

####################################################
## add a regression line to the scatter plot by using simple bivariate #linear model
## lm generates the coefficients for the regression model.extract
## col sets color; lwd sets line width; lty sets line type
#
abline(lm(y ~ GDP), col=”red”, lwd=2, lty=1)
#
####################################################

As is often the case in behavioral research we want to evaluate models that involve more than two variables. For multivariate models scatter plots can be generated using a 3 dimensional version of the R plot() function. For the above model we can add a third variable ‘Ineq’ from the dataset which is a measure the distribution of wealth in the population. Since we are now working with a multivariate linear model of the form ‘y = b1(x1) + b2(x2) + a’ we can use the R function scatterplot3d() to generate a 3 dimensional representation of the variables.

Once again we use the MASS package and the dataset  ‘UScrime’ for the graph data. The code is seen below:

####################################################
## create a 3d graph using the variables y, GDP, and Ineq
####################################################
#
require(MASS)
attach(UScrime)   ## use data from UScrime dataset
scatterplot3d(y,GDP, Ineq,
main=”Basic 3D Scatterplot”) ## graph 3 variables, y
#
###################################################

The following graph is produced:

The above code will generate a basic 3d plot using default values. We can add straight lines from the plane of the graph to each of the data points by setting the graph type option as ‘type=”h”, as seen in the code below:

###### ##############################################

require(MASS)
library(scatterplot3d)
attach(UScrime)
model <- scatterplot3d(GDP, Ineq, y,
type=”h”, ## add vertical lines from plane with this option
main=”3D Scatterplot with Vertical Lines”)
####################################################

This results in the graph:

There are numerous options that can be used to go beyond the basic 3d plot. Refer to CRAN documentation to see these. A final addition to the 3d plot as discussed here is the code needed to generate the regression plane of our linear regression model using the y (crimerate), GDP, and Ineq variables. This is accomplished using the plane3d() option that will draw a plane through the data points of the existing plot. The code to do this is shown below:

###### ##############################################require(MASS)library(scatterplot3d)attach(UScrime)model <- scatterplot3d(GDP, Ineq, y, type=”h”,   ## add vertical line from plane to data points with this #option main=”3D Scatterplot with Vertical Lines”)## now calculate and add the linear regression datamodel1 <- lm(y ~ GDP + Ineq)   #model\$plane3d(model1)   ## link the 3d scatterplot in ‘model’ to the ‘plane3d’ option with ‘model1’ regression information # ####################################################

The resulting graph is:

To draw a regression plane through the data points only change the ‘type’ option to ‘type=”p” to show the data points without vertical lines to the plane. There are also many other options that can be used. See the CRAN documentation to review them.

I have hopefully shown that relatively simple R code can be used to generate some informative and useful graphs. Once you start to become aware of how to use the multitude of options for these functions you can have virtually total control of the visual presentation of data. I will discuss some additional simple graphs in the next tutorial that I post.

# R For Beginners: Some Simple R Code to do Common Statistical Procedures, Part Two

An R tutorial by D. M. Wiig

This posting contains an embedded Word document. To view the document full screen click on the icon in the lower right hand corner of the embedded document.

# R For Beginners: A Video Tutorial on Installing and Using the Deducer Statistics Package

R For Beginners:  A Video Tutorial on Installing and Using the Deducer Statistics Package with the R Console

In previous tutorials I have discussed the use of R Commander and Deducer statistical packages that provide a menu based GUI for R.  In this video tutorial I will discuss downloading and installing the Deducer statistics package.  This video is designed to support my previous tutorial on the same subject.

I have embedded the video below,   I hope you find this tutorial  a useful adjunct to installing and using the menu based Deducer package.

This document is an embedded Word document.  To view it full screen click on the icon in the lower right corner of the screen

# R For Beginners: Installing the JGR GUI On a Linux Platform

A Tutorial by D. M. Wiig

This is an embedded Word document.  To view it full screen click on the icon in the lower right cornet of the document.

Watch for more tutorials discussing  R statistics on a Linux platform.

A tutorial by D. M. Wiig

One of the nice characteristics of open source software such as R is the rapid development of new releases and updates.  While the base core remains stable for a period of time there is a considerable amount of updating,  adding, and removing the component packages.  At the time of this writing the latest iteration is R version 3.3.1, “Bug in Your Hair.” If you are using a Windows platform you will likely go directly to the archive web site and download the latest distribution as a Windows executable installation package.

If you are using a Linux distribution  such as Ubuntu or Debian, the process of adding software is usually accomplished via the menu based installer.  These software installers allow  R and its dependencies to be downloaded from the community archive.

One of the disadvantages of using this approach is that the versions of some software in the community archives may not be updated to the latest version.  This is often the case with R as well as with many other software packages.

To insure that you are downloading the latest R version you need to use the platform’s command line to install what is needed.  You can add the URL’s of some backport archives that are more likely to be kept up to date with current releases.  As an example In this tutorial I will use the R statistical software that I am running on my Raspberry Pi 3 board with a Raspbian OS and the new PIXEL desktop.

Regardless of which Linux distribution you are using first open a command console from the desktop menu. Make sure all is up to date by using the command:

pi@raspberrypi:~ \$ sudo apt-get update
This will insure all appropriate packages currently installed are running the latest updates.  If you are running a Raspbian distribution such as jessie you will need to edit the /etc/apt/sources.list file to add a backport to the latest version of R.  Start the nano editor by using the command:

sudo nano /etc/apt/sources.list

This should produce the output as seen below:

```pi@raspberrypi:~ \$ sudo nano /etc/apt/sources.list

------------------------------------------------
GNU nano 2.2.6 File: /etc/apt/sources.list

deb http://mirrordirector.raspbian.org/raspbian/ jessie main contrib non-free r\$
# Uncomment line below then 'apt-get update' to enable 'apt-get source'
deb-src http://archive.raspbian.org/raspbian/ jessie main contrib non-free rpi
deb http://archive.raspbian.org/raspbian/ stretch main
deb http://mirror.las.iastate.edu/CRAN/bin/linux/debian/ jessie main
deb http://mirror.las.iastate.edu/CRAN/bin/linux/ubuntu xenial/```

^G Get Help ^O WriteOut ^R Read File ^Y Prev Page ^K Cut Text ^C Cur Pos
^X Exit ^J Justify ^W Where Is ^V Next Page ^U UnCut Text^T To Spell

As is seen above there are several lines containing the standard  Raspbian archives to search.

If you are using a Debian distribution you would add the following line to the file:

```http://mirror.las.iastate.edu/CRAN/bin/linux/debian/ jessie main

Replace the 'jessie' portion with the name of the specific Debian distribution you are using replace the 'mirror' portion with the R CRAN mirror that you use.  You also need to add the line that provides the URL of a Raspian 'stretch' archive that contains the most recent updates of many different software packages.  In my case I was looking for the latest R release, but you should search this this archive for the latest version of any software package you are installing.

If you are using an Ubuntu distribution add a line with the appropriate changes for the specific Ubuntu distribution that you are using.
Check with the documentation provided with your specific Linux distribution to see if there is also a 'stretch' archive maintained for new versions.

Once these changes are made exit the nano editor using the ^O key command to write the file and then the ^X key command to return to the command line.  You should now be able to issue the command:

pi@raspberrypi:~ \$ sudo apt-get install r-base r-base-core r-base-dev

pi@raspberrypi:~ \$ R

R version 3.3.2 RC (2016-10-23 r71578) -- "Sincere Pumpkin Patch"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: arm-unknown-linux-gnueabihf (32-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors.
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

>

For other Linux distributions you would add a line similar to the above examples in the /etc/apt/sources.list. Check the documentation for your specific Linux platform for further information about backport archives.

```

# R Video Tutorial For Beginners: Installing And Using the Rcommander GUI

R Video Tutorial For Beginners: Installing And Using the Rcommander GUI

A tutorial video by D. M. Wiig

In my recent series of tutorials for those interested in the R statistical programming language I have discussed both the installation and use of the R console and R Commander statistics GUI.  Before viewing the tutorial make sure the R Commander package has been download into your R library via the Install Packages menu option.  This procedure was discussed in the previously posted R Commander tutorial.

Relative to this first tutorial I have have created a video that covers the initial installation of R Commander.  The video is seen below:

Click the icon in the lower right side of the screen to view the tutorial in full screen mode.

I hope that you find this useful in your pursuit of learning about  R statistics.

# R for Beginners: Using R Commander in an Introductory Statistics Course

R for beginners:  Using R Commander in introductory statistics courses

A tutorial by D. M. Wiig

As with previous tutorials in this series this document is an embedded Word documents.  To view the document full screen click on the icon in the lower right corner of the window.

# R Tutorial: A Simple Script to Create and Analyze a Data File, Part Two

A simple R script to create and analyze a data file:part two:    A tutorial by D.M. Wiig

In part one I discussed creating a simple data file containing the height and weight of 10 subjects.  In part two I will discuss the script needed to create a simple scatter diagram of the data and perform a basic Pearson correlation.  Before attempting to continue the script in this tutorial make sure that you have created and save the data file as discussed in part one.

To conduct a correlation/regression analysis of the data we want to first view a simple scatter plot. Load a library named ‘car’ into R memory. Use the command:

> library(car)

Then issue the following command to plot the graph:

> plot(Height~Weight, log=”xy”, data=Sampledatafile)

The output is seen below:

We can calculate a Pearson’s Product Moment correlation coefficient by using the command:

> # Pearson rank-order correlations between height and weight

> cor(Sampledatafile[,c(“Height”,”Weight”)], use=”complete.obs”, method=”pearson”)

Which results in:

Height Weight

Height 1.0000000 0.8813799

Weight 0.8813799 1.0000000

To run a simple linear regression for Height and Weight use the following code. Note that the dependent variable (Weight) is listed firt:

> model <-lm(Weight~Height, data=Sampledatafile)

> summary(model)

Call:

lm(formula = Weight ~ Height, data = Sampledatafile)

Residuals:

Min 1Q Median 3Q Max

-30.6800 -16.9749 -0.8774 19.9982 25.3200

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -337.986 98.403 -3.435 0.008893 **

Height 7.518 1.425 5.277 0.000749 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 21.93 on 8 degrees of freedom

Multiple R-squared: 0.7768, Adjusted R-squared: 0.7489

F-statistic: 27.85 on 1 and 8 DF, p-value: 0.0007489

>

To plot a regression line on the scatter diagram use the following command line. Note that we enter the y (dependent)variable first and then the x (independent)variable:

> scatterplot(Weight~Height, log=”xy”, reg.line=lm, smooth=FALSE, spread=FALSE,

+ data=Sampledatafile)

>

This will produce a graph as seen below. Note that box plots have also been included in the output:

This tutorial has hopefully demonstrated that complex tasks can be accomplished with relatively simple command line script. I will explore more of these simple scripts in future tutorials.

More to Come:

# R Tutorial: A Script to Create and Analyze a Simple Data File, Part One

R Tutorial: A Simple Script to Create and Analyze a Data File, Part One

By D.M. Wiig

In this tutorial I will walk you through a simple script that will show you how to create a data file and perform some simple statistical procedures on the file. I will break the code into segments and discuss what each segment does. Before starting this tutorial make sure you have a terminal window open and open R from the command line.

The first task is to create a simple data file. Let’s assume that we have some data from 10 individuals measuring each person’s height and weight. The data is shown below:

Height(inches) Weight(lbs)

72               225

60               128

65               176

75               215

66               145

65               120

70               210

71               176

68               155

77               250

We can enter the data into a data matrix by invoking the data editor and entering the values. Please note that the lines of code preceded by a # are comments and are ignored by R:

#Create a new file and invoke the data editor to enter data

#Create the file Sampledatafile, height and weight of 10 s subjects

Sampledatafile <-data.frame()

Sampledatafile <-edit(Sampledatafile)

You will see a window open that is the R Data Editor. Click on the column heading ‘var1’ and you will see several different data types in the drop down menu. Choose the ‘real’ data type. Follow the same procedure to set the data type for the second column. Enter the data pairs in the columns, with height in the first column and weight in the second column. When the data have been entered click on the var1 heading for column 1 and click ‘Change Name.’ Enter ‘Height’ to label the first column. Follow the same steps to rename the second column ‘Weight.’

Once both columns of data have been entered you can click ‘Quit.’ The datafile ‘Sampledatafile’ is now loaded into memory.

To run so me basic descriptive statistics use the following code:

> #Run descriptives on the data

> summary(Sampledatafile)

The output from this code will be:

Height                Weight

Min. :60.00          Min. :120.0

1st Qu.:65.25        1st Qu.:147.5

Median :69.00        Median :176.0

Mean :68.90          Mean :180.0

3rd Qu.:71.75        3rd Qu.:213.8

Max. :77.00          Max. :250.0

>

To view the data file use the following lines of code:

>#print the datafile ‘Sampledatafile’ on the screen

> print(Sampledatafile)

You will see the output:

Height          Weight

1 72             225

2 60             128

3 65             176

4 75             215

5 66             145

6 65             120

7 70             210

8 71             176

9 68             155

10 77            250

In Part Two I will discuss an R script to do a simple correlation and scatter diagram.  Check back later!