Tag Archives: big data

The R qgraph Package: Using R to Visualize Complex Relationships Among Variables in a Large Dataset, Part One

March 10, 2017 dmwiig 5 Comments

The R qgraph Package: Using R to Visualize Complex Relationships Among Variables in a Large Dataset, Part One

A Tutorial by D. M. Wiig, Professor of Political Science, Grand View University

In my most recent tutorials I have discussed the use of the tabplot() package to visualize multivariate mixed data types in large datasets. This type of table display is a handy way to identify possible relationships among variables, but is limited in terms of interpretation and the number of variables that can be meaningfully displayed.

Social science research projects often start out with many potential independent predictor variables for a given dependant variable. If these variables are all measured at the interval or ratio level a correlation matrix often serves as a starting point to begin analyzing relationships among variables.

In this tutorial I will use the R packages SemiPar, qgraph and Hmisc in addition to the basic packages loaded when R is started. The code is as follows:

###################################################
#data from package SemiPar; dataset milan.mort
#dataset has 3652 cases and 9 vars
##################################################
install.packages(“SemiPar”)
install.packages(“Hmisc”)
install.packages(“qgraph”)
library(SemiPar)
####################################################

One of the datasets contained in the SemiPar packages is milan.mort. This dataset contains nine variables and data from 3652 consecutive days for the city of Milan, Italy. The nine variables in the dataset are as follows:

rel.humid (relative humidity)
tot.mort (total number of deaths)
resp.mort (total number of respiratory deaths)
SO2 (measure of sulphur dioxide level in ambient air)
TSP (total suspended particles in ambient air)
day.num (number of days since 31st December, 1979)
day.of.week (1=Monday; 2=Tuesday; 3=Wednesday; 4=Thursday; 5=Friday; 6=Saturday; 7=Sunday
holiday (indicator of public holiday: 1=public holiday, 0=otherwise
mean.temp (mean daily temperature in degrees celsius)

To look at the structure of the dataset use the following

#########################################
library(SemiPar)
data(milan.mort)
str(milan.mort)
###############################################

Resulting in the output:

> str(milan.mort)
‘data.frame’: 3652 obs. of 9 variables:
$ day.num : int 1 2 3 4 5 6 7 8 9 10 …
$ day.of.week: int 2 3 4 5 6 7 1 2 3 4 …
$ holiday : int 1 0 0 0 0 0 0 0 0 0 …
$ mean.temp : num 5.6 4.1 4.6 2.9 2.2 0.7 -0.6 -0.5 0.2 1.7 …
$ rel.humid : num 30 26 29.7 32.7 71.3 80.7 82 82.7 79.3 69.3 …
$ tot.mort : num 45 32 37 33 36 45 46 38 29 39 …
$ resp.mort : int 2 5 0 1 1 6 2 4 1 4 …
$ SO2 : num 267 375 276 440 354 …
$ TSP : num 110 153 162 198 235 …

As is seen above, the dataset contains 9 variables all measured at the ratio level and 3652 cases.

In doing exploratory research a correlation matrix is often generated as a first attempt to look at inter-relationships among the variables in the dataset. In this particular case a researcher might be interested in looking at factors that are related to total mortality as well as respiratory mortality rates.

A correlation matrix can be generated using the cor function which is contained in the stats package. There are a variety of functions for various types of correlation analysis. The cor function provides a fast method to calculate Pearson’s r with a large dataset such as the one used in this example.

To generate a zero order Pearson’s correlation matrix use the following:

###############################################
#round the corr output to 2 decimal places
#put output into variable cormatround
#coerce data to matrix
#########################################
library(Hmisc)
cormatround round(cormatround, 2)
#################################################

The output is:

> cormatround > round(cormatround, 2)
Day.num day.of.week holiday mean.temp rel.humid tot.mort resp.mort  SO2   TSP
day.num     1.00       0.00    0.01      0.02      0.12    -0.28  0.22 -0.34  0.07
day.of.week    0.00       1.00    0.00      0.00      0.00    -0.05  0.03 -0.05 -0.05
holiday        0.01       0.00    1.00     -0.07      0.01     0.00  0.01  0.00 -0.01
mean.temp      0.02       0.00   -0.07      1.00     -0.25    -0.43 -0.26 -0.66 -0.44
rel.humid      0.12       0.00    0.01     -0.25      1.00     0.01 -0.03  0.15  0.17
tot.mort      -0.28      -0.05    0.00     -0.43      0.01     1.00  0.47  0.44  0.25
resp.mort     -0.22      -0.03   -0.01     -0.26     -0.03     0.47  1.00  0.32  0.15
SO2           -0.34      -0.05    0.00     -0.66      0.15     0.44  0.32  1.00  0.63
TSP            0.07      -0.05   -0.01     -0.44      0.17     0.25  0.15  0.63  1.00

The matrix can be examined to look at intercorrelations among the nine variables, but it is very difficult to detect patterns of correlations within the matrix. Also, when using the cor() function raw Pearson’s coefficients are reported, but significance levels are not.

A correlation matrix with significance can be generated by using the rcorr() function, also found in the Hmisc package. The code is:

#############################################
library(Hmisc)
rcorr(as.matrix(milan.mort, type=”pearson”))
###################################################

The output is:

> rcorr(as.matrix(milan.mort, type="pearson"))
           day.num day.of.week holiday mean.temp rel.humid tot.mort resp.mort   SO2   TSP
day.num       1.00       0.00    0.01      0.02      0.12    -0.28  -0.22 -0.34  0.07
day.of.week   0.00        1.00    0.00      0.00      0.00    -0.05 -0.03 -0.05 -0.05
holiday       0.01        0.00    1.00     -0.07      0.01     0.00 -0.01  0.00 -0.01
mean.temp     0.02        0.00   -0.07      1.00     -0.25    -0.43 -0.26 -0.66 -0.44
rel.humid     0.12        0.00    0.01     -0.25      1.00     0.01 -0.03  0.15  0.17
tot.mort     -0.28       -0.05    0.00     -0.43      0.01     1.00  0.47  0.44  0.25
resp.mort    -0.22       -0.03   -0.01     -0.26     -0.03     0.47  1.00  0.32  0.15
SO2          -0.34       -0.05    0.00     -0.66      0.15     0.44  0.32  1.00  0.63
TSP           0.07       -0.05   -0.01     -0.44      0.17     0.25  0.15  0.63  1.00

n= 3652 


P
            day.num day.of.week holiday mean.temp rel.humid tot.mort resp.mort SO2    TSP   
day.num             0.9771     0.5349   0.2220    0.0000    0.0000  0.0000  0.0000
day.of.week 0.9771              0.7632  0.8727    0.8670    0.0045  0.1175   0.0061
holiday     0.5349  0.7632              0.0000    0.4648    0.8506  0.6115    0.7793 0.4108
mean.temp   0.2220  0.8727      0.0000            0.0000    0.0000  0.0000    0.0000 0.0000
rel.humid   0.0000  0.8670      0.4648  0.0000              0.3661  0.1096    0.0000 0.0000
tot.mort    0.0000  0.0045      0.8506  0.0000    0.3661            0.0000    0.0000 0.0000
resp.mort   0.0000  0.1175      0.6115  0.0000    0.1096    0.0000            0.0000 0.0000
SO2         0.0000  0.0024      0.7793  0.0000    0.0000    0.0000  0.0000           0.0000
TSP         0.0000  0.0061      0.4108  0.0000    0.0000    0.0000  0.0000    0.0000

In a future tutorial I will discuss using significance levels and correlation strengths as methods of reducing complexity in very large correlation network structures.

The recently released package qgraph () provides a number of interesting functions that are useful in visualizing complex inter-relationships among a large number of variables. To quote from the CRAN documentation file qraph() “Can be used to visualize data networks as well as provides an interface for visualizing weighted graphical models.” (see CRAN documentation for ‘qgraph” version 1.4.2. See also http://sachaepskamp.com/qgraph).

The qgraph() function has a variety of options that can be used to produce specific types of graphical representations. In this first tutorial segment I will use the milan.mort dataset and the most basic qgraph functions to produce a visual graphic network of intercorrelations among the 9 variables in the dataset.

The code is as follows:

###################################################
library(qgraph)
#use cor function to create a correlation matrix with milan.mort dataset
#and put into cormat variable
###################################################
cormat=cor(milan.mort) #correlation matrix generated
###################################################
###################################################
#now plot a graph of the correlation matrix
###################################################
qgraph(cormat, shape=”circle”, posCol=”darkgreen”, negCol=”darkred”, layout=”groups”, vsize=10)
###################################################

This code produces the following correlation network:

The correlation network provides a very useful visual picture of the intercorrelations as well as positive and negative correlations. The relative thickness and color density of the bands indicates strength of Pearson’s r and the color of each band indicates a positive or negative correlation – red for negative and green for positive.

By changing the “layout=” option from “groups” to “spring” a slightly different perspective can be seen. The code is:

########################################################
#Code to produce alternative correlation network:
#######################################################
library(qgraph)
#use cor function to create a correlation matrix with milan.mort dataset
#and put into cormat variable
##############################################################
cormat=cor(milan.mort) #correlation matrix generated
##############################################################
###############################################################
#now plot a circle graph of the correlation matrix
##########################################################
qgraph(cormat, shape=”circle”, posCol=”darkgreen”, negCol=”darkred”, layout=”spring”, vsize=10)
###############################################################

The graph produced is below:

Once again the intercorrelations, strength of r and positive and negative correlations can be easily identified. There are many more options, types of graph and procedures for analysis that can be accomplished with the qgraph() package. In future tutorials I will discuss some of these.

R Code Development, R Tutorials

R Tutorial: Visualizing Multivariate Relationships in Large Datasets

February 6, 2017 dmwiig 7 Comments

R Tutorial: Visualizing multivariate relationships in Large Datasets

A tutorial by D.M. Wiig

In two previous blog posts I discussed some techniques for visualizing relationships involving two or three variables and a large number of cases. In this tutorial I will extend that discussion to show some techniques that can be used on large datasets and complex multivariate relationships involving three or more variables.

In this tutorial I will use the R package nmle which contains the dataset MathAchieve. Use the code below to install the package and load it into the R environment:

####################################################
#code for visual large dataset MathAchieve
#first show 3d scatterplot; then show tableplot variations
####################################################
install.packages(“nmle”) #install nmle package
library(nlme) #load the package into the R environment
####################################################

Once the package is installed take a look at the structure of the data set by using:

####################################################
attach(MathAchieve) #take a look at the structure of the dataset
str(MathAchieve)
####################################################

Classes ‘nfnGroupedData’, ‘nfGroupedData’, ‘groupedData’ and ‘data.frame’: 7185 obs. of 6 variables:
$ School : Ord.factor w/ 160 levels “8367”<“8854″<..: 59 59 59 59 59 59 59 59 59 59 …
$ Minority: Factor w/ 2 levels “No”,”Yes”: 1 1 1 1 1 1 1 1 1 1 …
$ Sex : Factor w/ 2 levels “Male”,”Female”: 2 2 1 1 1 1 2 1 2 1 …
$ SES : num -1.528 -0.588 -0.528 -0.668 -0.158 …
$ MathAch : num 5.88 19.71 20.35 8.78 17.9 …
$ MEANSES : num -0.428 -0.428 -0.428 -0.428 -0.428 -0.428 -0.428 -0.428 -0.428 -0.428 …
– attr(*, “formula”)=Class ‘formula’ language MathAch ~ SES | School
.. ..- attr(*, “.Environment”)=<environment: R_GlobalEnv>
– attr(*, “labels”)=List of 2
..$ y: chr “Mathematics Achievement score”
..$ x: chr “Socio-economic score”
– attr(*, “FUN”)=function (x)
..- attr(*, “source”)= chr “function (x) max(x, na.rm = TRUE)”
– attr(*, “order.groups”)= logi TRUE
>

As can be seen from the output shown above the MathAchieve dataset consists of 7185 observations and six variables. Three of these variables are numeric and three are factors. This presents some difficulties when visualizing the data. With over 7000 cases a two-dimensional scatterplot showing bivariate correlations among the three numeric variables is of limited utility.

We can use a 3D scatterplot and a linear regression model to more clearly visualize and examine relationships among the three numeric variables. The variable SES is a vector measuring socio-economic status, MathAch is a numeric vector measuring mathematics achievment scores, and MEANSES is a vector measuring the mean SES for the school attended by each student in the sample.

We can look at the correlation matrix of these 3 variables to get a sense of the relationships among the variables:

> ####################################################
> #do a correlation matrix with the 3 numeric vars;
> ###################################################
> data(“MathAchieve”)
> cor(as.matrix(MathAchieve[c(4,5,6)]), method=”pearson”)

SES MathAch MEANSES
SES 1.0000000 0.3607556 0.5306221
MathAch 0.3607556 1.0000000 0.3437221
MEANSES 0.5306221 0.3437221 1.0000000
>

In using the cor() function as seen above we can determine the variables used by specifying the column that each numeric variable is in as shown in the output from the str() function. The 3 numeric variables, for example, are in columns 4, 5, and 6 of the matrix.

As discussed in previous tutorials we can visualize the relationship among these three variable by using a 3D scatterplot. Use the code as seen below:

####################################################
#install.packages(“nlme”)
install.packages(“scatterplot3d”)
library(scatterplot3d)
library(nlme) #load nmle package
attach(MathAchieve) #MathAchive dataset is in environment
scatterplot3d(SES, MEANSES, MathAch, main=”Basic 3D Scatterplot”) #do the plot with default options
####################################################

The resulting plot is:

Even though the scatter plot lacks detail due to the large sample size it is still possible to see the moderate correlations shown in the correlation matrix by noting the shape and direction of the data points . A regression plane can be calculated and added to the plot using the following code:

scatterplot3d(SES, MEANSES, MathAch, main=”Basic 3D Scatterplot”) #do the plot with default options
####################################################
##use a linear regression model to plot a regression plane
#y=MathAchieve, SES, MEANSES are predictor variables
####################################################
model1=lm(MathAch ~ SES + MEANSES) ## generate a regression
#take a look at the regression output
summary(model1)
#run scatterplot again putting results in model
model <- scatterplot3d(SES, MEANSES, MathAch, main=”Basic 3D Scatterplot”) #do the plot with default options
#link the scatterplot and linear model using the plane3d function
model$plane3d(model1) ## link the 3d scatterplot in ‘model’ to the ‘plane3d’ option with ‘model1’ regression information
####################################################

The resulting output is seen below:

Call:
lm(formula = MathAch ~ SES + MEANSES)

Residuals:
Min 1Q Median 3Q Max
-20.4242 -4.6365 0.1403 4.8534 17.0496

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.72590 0.07429 171.31 <2e-16 ***
SES 2.19115 0.11244 19.49 <2e-16 ***
MEANSES 3.52571 0.21190 16.64 <2e-16 ***
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.296 on 7182 degrees of freedom
Multiple R-squared: 0.1624, Adjusted R-squared: 0.1622
F-statistic: 696.4 on 2 and 7182 DF, p-value: < 2.2e-16

and the plot with the plane is:

While the above analysis gives us useful information, it is limited by the mixture of numeric values and factors. A more detailed visual analysis that will allow the display and comparison of all six of the variables is possible by using the functions available in the R package Tableplots. This package was created to aid in the visualization and inspection of large datasets with multiple variables.

The MathAchieve contains a total of six variables and 7185 cases. The Tableplots package can be used with datasets larger than 10,000 observations and up to 12 or so variables. It can be used visualize relationships among variables using the same measurement scale or mixed measurement types.

To look at a comparisons of each data type and then view all 6 together begin with the following:

####################################################
attach(MathAchieve) #attach the dataset
#set up 3 data frames with numeric, factors, and mixed
####################################################
mathmix <- data.frame(SES,MathAch,MEANSES,School=factor(School),Minority=factor(Minority),Sex=factor(Sex)) #all 6 vars
mathfact <- data.frame(School=factor(School),Minority=factor(Minority),Sex=factor(Sex)) #3 factor vars
mathnum <- data.frame(SES,MathAch,MEANSES) #3 numeric vars
####################################################

To view a comparison of the 3 numeric variables use:

####################################################
require(tabplot) #load tabplot package
tableplot(mathnum) #generate a table plot with numeric vars only
####################################################

resulting in the following output:

To view only the 3 factor variables use:

####################################################
require(tabplot) #load tabplot package
tableplot(mathfact) #generate a table plot with factors only
####################################################

Resulting in:

To view and compare table plots of all six variables use:

####################################################
require(tabplot) #load tabplot package
tableplot(mathmix) #generate a table plot with all six variables
####################################################

Resulting in:

Using tableplots is useful in visualizing relationships among a set of variabes. The fact that comparisons can be made using mixed levels of measurement and very large sample sizes provides a tool that the researcher can use for initial exploratory data analysis.

The above visual table comparisons agree with the moderate correlation among the three numeric variables found in the correlation and regression models discussed above. It is also possible to add some additional interpretation by viewing and comparing the mix of both factor and numeric variables.

In this tutorial I have provided a very basic introduction to the use of table plots in visualizing data. Interested readers can find an abundance of information about Tableplot options and interpretations in the CRAN documentation.

In my next tutorial I will continue a discussion of methods to visualize large and complex datasets by looking at some techniques that allow exploration of very large datasets and up to 12 variables or more.

R Code Development, R Tutorials

R For Beginners: Basic Graphics Code to Produce Informative Graphs, Part Two, Working With Big Data

January 16, 2017 dmwiig 1 Comment

R for beginners: Some basic graphics code to produce informative graphs, part two, working with big data

A tutorial by D. M. Wiig

In part one of this tutorial I discussed the use of R code to produce 3d scatterplots. This is a useful way to produce visual results of multi- variate linear regression models. While visual displays using scatterplots is a useful tool when using most datasets it becomes much more of a challenge when analyzing big data. These types of databases can contain tens of thousands or even millions of cases and hundreds of variables.

Working with these types of data sets involves a number of challenges. If a researcher is interested in using visual presentations such as scatterplots this can be a daunting task. I will start by discussing how scatterplots can be used to provide meaningful visual representation of the relationship between two variables in a simple bivariate model.

To start I will construct a theoretical data set that consists of ten thousand x and y pairs of observations. One method that can be used to accomplish this is to use the R rnorm() function to generate a set of random integers with a specified mean and standard deviation. I will use this function to generate both the x and y variable.

Before starting this tutorial make sure that R is running and that the datasets, LSD, and stats packages have been installed. Use the following code to generate the x and y values such that the mean of x= 10 with a standard deviation of 7, and the mean of y=7 with a standard deviation of 3:

##############################################
## make sure package LSD is loaded
##
library(LSD)
x <- rnorm(50000, mean=10, sd=15) # # generates x values #stores results in variable x
y <- rnorm(50000, mean=7, sd=3) ## generates y values #stores results in variable y
####################################################

Now the scatterplot can be created using the code:

##############################################
## plot randomly generated x and y values
##
plot(x,y, main=”Scatterplot of 50,000 points”)
####################################################

As can be seen the resulting plot is mostly a mass of black with relatively few individual x and y points shown other than the outliers. We can do a quick histogram on the x values and the y values to check the normality of the resulting distribution. This shown in the code below:
####################################################
## show histogram of x and y distribution
####################################################
hist(x) ## histogram for x mean=10; sd=15; n=50,000
##
hist(y) ## histogram for y mean=7; sd=3; n-50,000
####################################################

The histogram shows a normal distribution for both variables. As is expected, in the x vs. y scatterplot the center mass of points is located at the x = 10; y=7 coordinate of the graph as this coordinate contains the mean of each distribution. A more meaningful scatterplot of the dataset can be generated using a the R functions smoothScatter() and heatscatter(). The smoothScatter() function is located in the graphics package and the heatscatter() function is located in the LSD package.

The smoothScatter() function creates a smoothed color density representation of a scatterplot. This allows for a better visual representation of the density of individual values for the x and y pairs. To use the smoothScatter() function with the large dataset created above use the following code:

##############################################
## use smoothScatter function to visualize the scatterplot of #50,000 x ## and y values
## the x and y values should still be in the workspace as #created above with the rnorm() function
##
smoothScatter(x, y, main = “Smoothed Color Density Representation of 50,000 (x,y) Coordinates”)
##
####################################################

The resulting plot shows several bands of density surrounding the coordinates x=10, y=7 which are the means of the two distributions rather than an indistinguishable mass of dark points.

Similar results can be obtained using the heatscatter() function. This function produces a similar visual based on densities that are represented as color bands. As indicated above, the LSD package should be installed and loaded to access the heatscatter() function. The resulting code is:

##############################################
## produce a heatscatter plot of x and y
##
library(LSD)
heatscatter(x,y, main=”Heat Color Density Representation of 50,000 (x, y) Coordinates”) ## function heatscatter() with #n=50,000
####################################################

In comparing this plot with the smoothScatter() plot one can more clearly see the distinctive density bands surrounding the coordinates x=10, y=7. You may also notice depending on the computer you are using that there is a noticeably longer processing time required to produce the heatscatter() plot.

This tutorial has hopefully provided some useful information relative to visual displays of large data sets. In the next segment I will discuss how these techniques can be used on a live database containing millions of cases.

Book Reviews

Book Review: R High Performance Programming

February 22, 2015 dmwiig 1 Comment

A book review by Douglas M. Wiig

Aloysius Lim and William Tjhi. R High Performance Programming. Birmingham, UK: Packt Publishing Ltd., 2015. bit.ly/14Rhpp

R High Performance Programming is a well written, informative book most suited for the experienced R programmer. This book offers a handy guide for R users who need speed and efficiency for the tasks that they perform.

The authors begin with an informative chapter discussing some of the inherent constraints on R’s computing performance such as CPU and RAM usage, and how R code is interpreted on the fly rather than compiled. A guide to several methods of profiling R’s code execution time, memory allocation and CPU usage is discussed in the next chapter. Sample code included in the chapter allows the reader to experiment with various benchmarking techniques to measure processing time and memory usage. This chapter provides the reader with some good tools for benchmarking R projects and identifying areas where improvements in processing can be made.

As is always the case with technical books from Packt Publishing, ample code examples are used in the chapter and the complete code used in each chapter is available for download with the book. This is a very handy feature and allows readers to do some live programming with R as the book is read.

The authors discuss a number of simple tweaks that can be easily performed to increase processing speed such as using built in functions and using hash tables. The hash table technique is useful for applications that use frequent lookups and can dramatically reduce processing time when compared to the use of lists. Running example code using this technique shows a large decrease in processing time when using the hash table approach as compared to straight list processing lookups.

In chapter 4 the authors discuss the use of compiled R code and integrating compiled languages into R code. They show several examples of using the R package inline that allows users to embed C, C++, Objective-C, Objective-C++ and Fortran code within R. Once again there are ample code examples to illustrate the use of this technique. For more advanced uses of compiled code the authors discuss how to create entire modules coded in C++ using the Rcpp package. Several completed code examples are included to illustrate the technique.

Another interesting approach to speeding up R is discussed in a chapter that explores several R packages designed to exploit the capability of GPU’s (Graphic Processing Cards) that are a used in many computers. These techniques can facilitate creating very fast and efficient statistical modeling code using R and the GPU.

As indicated above, readers can download the code package included with the book and find a well-organized set of ten folders (one for each chapter) containing 51 files. These files contain the sample code from the book as well as other code segments and benchmark code discussed in the book. The authors indicate that the code has been tested on R 3.1.1, Ubuntu 14.04 Trusty Tahr, Mac OS X 10.9 Mavericks, and Windows 8.1. This allows integration of these code segments into the reader’s own projects with minimal changes.

Other chapters in R High Performance Programming discuss simple tweaks to use less memory, techniques to speed processing of large datasets and using parallel processing and clustering techniques. The last chapter contains a discussion of using R and Hadoop to process Big Data (massive datasets with sizes measured in petabytes -one petabyes is 1,048,576 gigabytes). Processing data of this magnitude presents many challenges and is an area that is currently the subject of much program development.

I found R High Performance Programming to be a useful and informative book for the advanced user of R. A working knowledge of statistics, R and other programming languages such as C++ or Java is necessary to realize the full benefit of the techniques presented in the book. The book also serves as a good learning tool for less knowledgeable R users who are seeking to advance their programming skills.

Readers who are interested in the use of Hadoop and cluster computer processing might find the book Raspberry Pi Super Cluster by Andrew K. Dennis of interest. (Packt Publishing, 2013

PAC-14-1987838-1387169). A review of this book can be found on my web site at http://dmwiig.net.

Reviewer Information:

Douglas M. Wiig, Professor of Political Science

Grand View University

Teaching areas include social science statistics and research methods, comparative politics, international politics.

Long time user and developer of computer and statistical applications

Host of Open Source Technology in Higher Education web site at http://dmwiig.net

Creator and moderator of LinkedIn discussion forum “Open Source Technology in Higher Education”

Regular contributor to several LinkedIn discussion forums

Author of numerous tutorials on using the R statistical programming language and Raspberry Pi computer

	Hydra Themes on R for Beginners: Some Simple C…
	Juan Carlos Rubio Po… on Ternary Diagrams Using R: An E…
	Nicholas Beltran on R Video Tutorial: Basic R Code…
	Ellena Field on Using R for Basic Cross Tabula…
	Dynamics Square on Thanks for Visiting This Blog

R Statistics and Programming

Tag Archives: big data

The R qgraph Package: Using R to Visualize Complex Relationships Among Variables in a Large Dataset, Part One

R Tutorial: Visualizing Multivariate Relationships in Large Datasets

R For Beginners: Basic Graphics Code to Produce Informative Graphs, Part Two, Working With Big Data

##############################################
## plot randomly generated x and y values
##
plot(x,y, main=”Scatterplot of 50,000 points”)
####################################################

##############################################
## produce a heatscatter plot of x and y
##
library(LSD)
heatscatter(x,y, main=”Heat Color Density Representation of 50,000 (x, y) Coordinates”) ## function heatscatter() with #n=50,000
####################################################

Book Review: R High Performance Programming

Resources and Information About R Statistics and Programming

Share this:

Share this:

############################################## ## plot randomly generated x and y values ## plot(x,y, main=”Scatterplot of 50,000 points”) ####################################################

############################################## ## produce a heatscatter plot of x and y ## library(LSD) heatscatter(x,y, main=”Heat Color Density Representation of 50,000 (x, y) Coordinates”) ## function heatscatter() with #n=50,000 ####################################################

Share this:

Share this:

Resources and Information About R Statistics and Programming

##############################################
## plot randomly generated x and y values
##
plot(x,y, main=”Scatterplot of 50,000 points”)
####################################################

##############################################
## produce a heatscatter plot of x and y
##
library(LSD)
heatscatter(x,y, main=”Heat Color Density Representation of 50,000 (x, y) Coordinates”) ## function heatscatter() with #n=50,000
####################################################