Tag Archives: tutorial

The R qgraph Package: Using R to Visualize Complex Relationships Among Variables in a Large Dataset, Part One


The R qgraph Package: Using R to Visualize Complex Relationships Among Variables in a Large Dataset, Part One

A Tutorial by D. M. Wiig, Professor of Political Science, Grand View University

In my most recent tutorials I have discussed the use of the tabplot() package to visualize multivariate mixed data types in large datasets. This type of table display is a handy way to identify possible relationships among variables, but is limited in terms of interpretation and the number of variables that can be meaningfully displayed.

Social science research projects often start out with many potential independent predictor variables for a given dependant variable. If these variables are all measured at the interval or ratio level a correlation matrix often serves as a starting point to begin analyzing relationships among variables.

In this tutorial I will use the R packages SemiPar, qgraph and Hmisc in addition to the basic packages loaded when R is started. The code is as follows:

###################################################
#data from package SemiPar; dataset milan.mort
#dataset has 3652 cases and 9 vars
##################################################
install.packages(“SemiPar”)
install.packages(“Hmisc”)
install.packages(“qgraph”)
library(SemiPar)
####################################################

One of the datasets contained in the SemiPar packages is milan.mort. This dataset contains nine variables and data from 3652 consecutive days for the city of Milan, Italy. The nine variables in the dataset are as follows:

rel.humid (relative humidity)
tot.mort (total number of deaths)
resp.mort (total number of respiratory deaths)
SO2 (measure of sulphur dioxide level in ambient air)
TSP (total suspended particles in ambient air)
day.num (number of days since 31st December, 1979)
day.of.week (1=Monday; 2=Tuesday; 3=Wednesday; 4=Thursday; 5=Friday; 6=Saturday; 7=Sunday
holiday (indicator of public holiday: 1=public holiday, 0=otherwise
mean.temp (mean daily temperature in degrees celsius)

To look at the structure of the dataset use the following

#########################################
library(SemiPar)
data(milan.mort)
str(milan.mort)
###############################################

Resulting in the output:

> str(milan.mort)
‘data.frame’: 3652 obs. of 9 variables:
$ day.num : int 1 2 3 4 5 6 7 8 9 10 …
$ day.of.week: int 2 3 4 5 6 7 1 2 3 4 …
$ holiday : int 1 0 0 0 0 0 0 0 0 0 …
$ mean.temp : num 5.6 4.1 4.6 2.9 2.2 0.7 -0.6 -0.5 0.2 1.7 …
$ rel.humid : num 30 26 29.7 32.7 71.3 80.7 82 82.7 79.3 69.3 …
$ tot.mort : num 45 32 37 33 36 45 46 38 29 39 …
$ resp.mort : int 2 5 0 1 1 6 2 4 1 4 …
$ SO2 : num 267 375 276 440 354 …
$ TSP : num 110 153 162 198 235 …

As is seen above, the dataset contains 9 variables all measured at the ratio level and 3652 cases.

In doing exploratory research a correlation matrix is often generated as a first attempt to look at inter-relationships among the variables in the dataset. In this particular case a researcher might be interested in looking at factors that are related to total mortality as well as respiratory mortality rates.

A correlation matrix can be generated using the cor function which is contained in the stats package. There are a variety of functions for various types of correlation analysis. The cor function provides a fast method to calculate Pearson’s r with a large dataset such as the one used in this example.

To generate a zero order Pearson’s correlation  matrix use the following:

###############################################
#round the corr output to 2 decimal places
#put output into variable cormatround
#coerce data to matrix

#########################################
library(Hmisc)
cormatround round(cormatround, 2)
#################################################

The output is:

> cormatround > round(cormatround, 2)
Day.num day.of.week holiday mean.temp rel.humid tot.mort resp.mort  SO2   TSP
day.num     1.00       0.00    0.01      0.02      0.12    -0.28  0.22 -0.34  0.07
day.of.week    0.00       1.00    0.00      0.00      0.00    -0.05  0.03 -0.05 -0.05
holiday        0.01       0.00    1.00     -0.07      0.01     0.00  0.01  0.00 -0.01
mean.temp      0.02       0.00   -0.07      1.00     -0.25    -0.43 -0.26 -0.66 -0.44
rel.humid      0.12       0.00    0.01     -0.25      1.00     0.01 -0.03  0.15  0.17
tot.mort      -0.28      -0.05    0.00     -0.43      0.01     1.00  0.47  0.44  0.25
resp.mort     -0.22      -0.03   -0.01     -0.26     -0.03     0.47  1.00  0.32  0.15
SO2           -0.34      -0.05    0.00     -0.66      0.15     0.44  0.32  1.00  0.63
TSP            0.07      -0.05   -0.01     -0.44      0.17     0.25  0.15  0.63  1.00
>

The matrix can be examined to look at intercorrelations among the nine variables, but it is very difficult to detect patterns of correlations within the matrix.  Also, when using the cor() function raw Pearson’s coefficients are reported, but significance levels are not.

A correlation matrix with significance can be generated by using the rcorr() function, also found in the Hmisc package. The code is:

#############################################
library(Hmisc)
rcorr(as.matrix(milan.mort, type=”pearson”))
###################################################

The output is:

> rcorr(as.matrix(milan.mort, type="pearson"))
           day.num day.of.week holiday mean.temp rel.humid tot.mort resp.mort   SO2   TSP
day.num       1.00       0.00    0.01      0.02      0.12    -0.28  -0.22 -0.34  0.07
day.of.week   0.00        1.00    0.00      0.00      0.00    -0.05 -0.03 -0.05 -0.05
holiday       0.01        0.00    1.00     -0.07      0.01     0.00 -0.01  0.00 -0.01
mean.temp     0.02        0.00   -0.07      1.00     -0.25    -0.43 -0.26 -0.66 -0.44
rel.humid     0.12        0.00    0.01     -0.25      1.00     0.01 -0.03  0.15  0.17
tot.mort     -0.28       -0.05    0.00     -0.43      0.01     1.00  0.47  0.44  0.25
resp.mort    -0.22       -0.03   -0.01     -0.26     -0.03     0.47  1.00  0.32  0.15
SO2          -0.34       -0.05    0.00     -0.66      0.15     0.44  0.32  1.00  0.63
TSP           0.07       -0.05   -0.01     -0.44      0.17     0.25  0.15  0.63  1.00

n= 3652 


P
            day.num day.of.week holiday mean.temp rel.humid tot.mort resp.mort SO2    TSP   
day.num             0.9771     0.5349   0.2220    0.0000    0.0000  0.0000  0.0000
day.of.week 0.9771              0.7632  0.8727    0.8670    0.0045  0.1175   0.0061
holiday     0.5349  0.7632              0.0000    0.4648    0.8506  0.6115    0.7793 0.4108
mean.temp   0.2220  0.8727      0.0000            0.0000    0.0000  0.0000    0.0000 0.0000
rel.humid   0.0000  0.8670      0.4648  0.0000              0.3661  0.1096    0.0000 0.0000
tot.mort    0.0000  0.0045      0.8506  0.0000    0.3661            0.0000    0.0000 0.0000
resp.mort   0.0000  0.1175      0.6115  0.0000    0.1096    0.0000            0.0000 0.0000
SO2         0.0000  0.0024      0.7793  0.0000    0.0000    0.0000  0.0000           0.0000
TSP         0.0000  0.0061      0.4108  0.0000    0.0000    0.0000  0.0000    0.0000       
>

In a future tutorial I will discuss using significance levels and correlation strengths as methods of reducing complexity in very large correlation network structures.

The recently released package qgraph () provides a number of interesting functions that are useful in visualizing complex inter-relationships among a large number of variables. To quote from the CRAN documentation file qraph() “Can be used to visualize data networks as well as provides an interface for visualizing weighted graphical models.” (see CRAN documentation for ‘qgraph” version 1.4.2. See also http://sachaepskamp.com/qgraph).

The qgraph() function has a variety of options that can be used to produce specific types of graphical representations. In this first tutorial segment I will use the milan.mort dataset and the most basic qgraph functions to produce a visual graphic network of intercorrelations among the 9 variables in the dataset.

The code is as follows:

###################################################
library(qgraph)
#use cor function to create a correlation matrix with milan.mort dataset
#and put into cormat variable
###################################################
cormat=cor(milan.mort)  #correlation matrix generated
###################################################
###################################################
#now plot a graph of the correlation matrix
###################################################
qgraph(cormat, shape=”circle”, posCol=”darkgreen”, negCol=”darkred”, layout=”groups”, vsize=10)
###################################################

This code produces the following correlation network:

The correlation network provides a very useful visual picture of the intercorrelations as well as positive and negative correlations. The relative thickness and color density of the bands indicates strength of Pearson’s r and the color of each band indicates a positive or negative correlation – red for negative and green for positive.

By changing the “layout=” option from “groups” to “spring” a slightly different perspective can be seen. The code is:

########################################################
#Code to produce alternative correlation network:
#######################################################
library(qgraph)
#use cor function to create a correlation matrix with milan.mort dataset
#and put into cormat variable
##############################################################
cormat=cor(milan.mort) #correlation matrix generated
##############################################################
###############################################################
#now plot a circle graph of the correlation matrix
##########################################################
qgraph(cormat, shape=”circle”, posCol=”darkgreen”, negCol=”darkred”, layout=”spring”, vsize=10)
###############################################################

The graph produced is below:

Once again the intercorrelations, strength of r and positive and negative correlations can be easily identified. There are many more options, types of graph and procedures for analysis that can be accomplished with the qgraph() package. In future tutorials I will discuss some of these.

Advertisements

Using R: Random Sample Selection and One-Way ANOVA


A tutorial by Douglas M. Wiig

In the previous tutorial we looked at the hypothesis that one’s outlook on life is influenced by the amount of education attained. Using the GSS 2014 data file we looked at the education variable ‘educ’, and the outlook on life variable ‘life’, a measure of outlook on life as ‘DULL’, ‘ROUTINE’, or ‘EXCITING.’ We selected a subset for each response category and found that there appeared to be
differences among the mean level of education measured in years for each of the categories of outlook on life. To further examine this we will first randomly select a sample from the data file, look at the
mean education for each category of outlook on life, and evaluate the means using simple one-way ANOVA.

To randomly select a sample from a population of values we can use the sample() function. There are a number of options and variations of the function that are beyond the scope of this tutorial. Since the
variable ‘educ’ is measured in years we can use the sample.int() function which is designed for use with integer values. The general format of the function is:

sample.int(n, size = n, replace = FALSE)

where: n = the size of population the sample is from
size = the size of the sample
replace = FALSE if sampling without replacement; TRUE if sampling with replacement

For this example I will select a sample of n=500 without replacement from the data file containing a total of 2538 cases. The sample data is loaded into a data matrix as it is selected. This will be
accomplished in two steps. In the first step we will load the sample.int() function with the values to use for selecting the sample and put the vector in ‘randsamp2. The code is:

randsamp2 <- sample.int(2538, size=500, replace=FALSE)

To select the sample make sure that make sure that the GSS2014 data file is loaded into the R environment. I previously loaded the data file into a data frame ‘gss14.’  To select the sample and load
it into a data frame ‘randgss2’ the code is:

randgss2 <- gss14[randsamp2,]

Once the sample has been generated we can look at the mean years of education for each of the three responses for outlook on life. We do this by selecting a subset for each response. Use the following
code:

###################################################
#look at educ means by life by selecting 3 subsets from randgss2
###################################################
life12 <- subset(randgss2, life == “DULL”, select=educ)
life22 <- subset(randgss2, life == “ROUTINE”, select=educ)
life32 <- subset(randgss2, life == “EXCITING”, select=educ)

Now run summary statistics for each subset to look at the means:

summary(life12)
summary(life22)
summary(life32)

We can now see that the means are as follows:

life12 = 13.0
life22 = 13.29
life 32 = 14.51

and we can generate a summary visual of the differences among the three subsets by doing a simple boxplot using:

###################################################
# do boxplots of the subsets to visualize differences
#boxplot using educ and life variables from the ‘randgss2’ data #frame
###################################################
boxplot(randgss2$educ ~ randgss2$life, main=”Education and View on Life n = 500″, xlab=”View of Life”,ylab=”Years of Education”)

The following graph will result:

Rplotrandgss2

As can be seen above there does appear to be a difference among these means, particularly for those who see life as ‘DULL.’ To see if these differences are significant an ANOVA will be run using the
simple one way ANOVA function aov(). The basic function is:
aov(formula, data = NULL)For our example we use:

model2 <- aov(educ ~ life, data=randgss2)

which analyzes the mean education by category of outlook on life using the randgss2 sample of n=500. The results are stored in ‘model2.’ The output from this operation is shown using the summary() function. This produces the following output:

summary(model2)

Df Sum Sq Mean Sq F value Pr(>F) life
2  171.5    85.77     10.42 4.08e-05 ***
Residuals  332     2732.2  8.23

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
165 observations deleted due to missingness

This output shows that at least one of the means differs significantly from the others. To test this difference further we can use a pair-wise comparison of means to see which means differ significantly
from each other. There are several options available. We will use a basic Tukey HSD comparison. This is accomplished using:

##################################################
#run HSD on sample
TukeyHSD(model2)
##################################################

producing the following output:

Tukey multiple comparisons of means
95% family-wise confidence level

Fit: aov(formula = educ ~ life, data = randgss2, projections = TRUE)

$life                               diff                     lwr                upr                    p adj
ROUTINE-EXCITING -1.074404 -1.833686 -0.31512199  .0027590
DULL-EXCITING          -2.729412 -4.447386 -1.01143754  .0006325
DULL-ROUTINE           -1.655008 -3.384551 0.07453525 00640910

By looking at the p value for each comparison it can be seen that both the ROUTINE-EXCITING and DULL-EXCITING means differ significantly at p ≤ .05

I might point out that if a researcher was using the GSS 2014 data file as we used here there would need to be more data preparation prior to running any analysis. For example, there is a fair amount of
missing data as indicated by NA in the raw data file. The missing data would need to be handled in some way. R has numerous functions and packages that can assist in resolving missing data issues of
various types, but a discussion of these is a subject for a future tutorial.
8/7/15                   Douglas M. Wiig                    http://dmwiig.net

Tutorial: Using R to Analyze NORC GSS Social Science Data, Part Six, R and ANOVA


Tutorial: Using R to Analyze NORC GSS Social Science Data, Part Six, R and ANOVA

A tutorial by Douglas M. Wiig

As discussed in previous segments of this tutorial, for anyone interested in researching social science questions there is a wealth of survey data available through the National Opinion Research Center (NORC) and its associated research universities. The Center has been conducting a national survey each year since 1972 and has compiled a massive database of data from these surveys. Most if not all of these data files can be accessed and downloaded without charge. I have been working with the 2014 edition of the data and for all part of this tutorial will use the GSS2014 data file that is available for download on the Center’s web site. (See the NORC main website at http://www.norc.org/Research/Projects/Pages/general-social-survey.aspx and at http://www3.norc.org/GSS+Website ).

Accessing and loading the NORC GSS2014 data set was discussed in part one of this tutorial. Refer to it if you need specific information on downloading the data set in STATA or SPSS format.  In this segment we will use the subset function to select a desired set of cases from all of the cases in the data file that meet certain criteria.  As indicated in my previous tutorial the GSS2014 data set contains a
total of 2588 cases and 866 variables.

Before starting this segment of the tutorial be sure that the foreign
package is installed and loaded into your R session. As I have indicated in previous tutorials, use of an IDE such as R Studio greatly facilitates entering and debugging R code when doing research such as is discussed in my tutorials.  

Import the GSS 2014 data file in SPSS format and load it into the data frame ‘gss14’ using:
#########################################################import  GSS2014 file in SPSS .sav format
#uses foreign package
########################################################require(foreign)
gss14<- read.spss("/path to your location/GSS2014.sav",                      use.value.labels=TRUE,max.value.labels=Inf, to.data.frame=TRUE)
#################################################

In this tutorial the analysis of this sample will focus on examining the hypothesis “An individual’s outlook on life is influenced by the amount of education the person has attained.” The GSS variables ‘educ’, education in number of years and ‘life’, whether the respondent rated life DULL, ROUTINE, or EXCITING. A simple approach to testing this hypothesis is to compare mean levels of education for each of the three categories of response. I will do this analysis in two stages. In the first stage I will use techniques discussed in a previous tutorial to select a subset of each response category and display the mean education level for each of the three categories.

The three subsets life1, life2, and life3 are generated using the following code:

###################################################
# create 3 subsets from gss14 view on life by years of education
##################################################

life1 <- subset(gss14, life == “DULL”, select=educ)

life2 <- subset(gss14, life ==”ROUTINE”, select=educ)

life3 <- subset(gss14, life == “EXCITING”, select=educ)

###################################################

#The three means of the subsets are displayed using the code:

# run summary statistics for each subgroup

##################################################

summary(life1)

summary(life2)

summary(life3)

###################################################

resulting the following output:
educ
Min. : 0.00
1st Qu.:10.00
Median :12.00
Mean :11.78
3rd Qu.:13.00
Max. :20.00

summary(life2)
educ
Min. : 0.00
1st Qu.:12.00
Median :13.00
Mean :13.22
3rd Qu.:16.00
Max. :20.00
NA’s :1

summary(life3)
educ
Min. : 2.00
1st Qu.:12.00
Median :14.00
Mean :14.31
3rd Qu.:16.00
Max. :20.00

As is seen above there does appear to be a difference among mean years of education and the corresponding outlook on life. In order to examine whether or not these observed differences are not due to chance a simple one-way Analysis of Variance can be generated.

In the next tutorial I will discuss performing the ANOVA and using pair-wise comparisons to determine which if any means are different.

R Tutorial: Using R to Analyze the NORC GSS2014 Database, Selecting Subsets and Comparing Means Using Student’s t Test


R Tutorial Part Three: Selecting Subsets and Comparing Means Using an Independent Sample t Test

A tutorial by Douglas M. Wiig

As discussed in previous segments of this tutorial, for anyone interested in researching social science questions there is a wealth of survey data available through the National Opinion Research Center (NORC) and its associated research universities. The Center has been conducting a national survey each year since 1972 and has compiled a massive database of data from these surveys. Most if not all of these data files can be accessed and downloaded without charge. I have been working with the 2014 edition of the data and for all part of this tutorial will use the GSS2014 data file that is available for download on the Center’s web site. (See the NORC main website at http://www.norc.org/Research/Projects/Pages/general-social-survey.aspx and at http://www3.norc.org/GSS+Website ).

Accessing and loading the NORC GSS2014 data set was discussed in part one of this tutorial. Refer to it if you need specific information on downloading the data set in STATA or SPSS format.  In this segment we will use the subset function to select a desired set of cases from all of the cases in the data file that meet certain criteria.  As indicated in my previous tutorial the GSS2014 data set contains a total of 2588 cases and 866 variables.
Before starting this segment of the tutorial be sure that the foreign package is installed and loaded into your R session.  Import the GSS 2014 data file and load it into the data frame ‘Dataset’ using:

########################################################
#import GSS2014 file in SPSS .sav format
#uses foreign package
########################################################
require(foreign)
Dataset <- read.spss("/path to your location/GSS2014.sav", 
                     use.value.labels=TRUE, max.value.labels=Inf, to.data.frame=TRUE)

###########################################################

In the previous segment of this tutorial we started to investigate whether or not an individual’s education had an effect on their response to a NORC survey item dealing with abortion. The item asked respondents to either ‘AGREE’ or ‘DISAGREE’ with the statement ‘A women should be allowed to obtain an abortion under any circumstances.’ We selected a subset of all of the respondents who answered ‘AGREE’ and a second subset of all the respondents who answered ‘DISAGREE’ using the following code:

##############################################

#select subset from Dataset and write to data frame SS1

###################################################
SS1 <- subset(Dataset, abany == "YES", select=educ)

View(SS1)

#######################################################

######################################################
#select subset from Dataset and write to data frame SS2
######################################################
SS2 <- subset(Dataset, abany == "NO", select=educ)
View(SS2)

A mean number of years of education can be calculated for each of the subsets using the following:

#calculate descriptive statistics for SS1 and SS2

####################################################

summary(SS1)

summary(SS2)

####################################################

Output from the above for SS1 is:

> summary(SS1)

educ

Min. : 0.0

1st Qu.:12.0

Median :15.0

Mean :14.6

3rd Qu.:16.0

Max. :20.0

Output for SS2 is:

> summary(SS2)

educ

Min. : 0.00

1st Qu.:12.00

Median :12.00

Mean :12.93

3rd Qu.:15.00

Max. :20.00

NA’s :1

As seen above there is a difference in mean years of education for the two subsets. We can use a two independent sample t test to determine whether or not the difference is large enough to not be due to chance.

In this tutorial I will use the Student’s t test function t.test that is found in the stats package. The function is used in the following form:

t.test =(x,y, alternative = c(“two.sided”, “less”, “greater”), mu=0, paired = FALSE, var.equal = FALSE, conf.level = .95)

where x and y = numeric vectors of data values

alternative = specification of a one-tailed or two-tailed test

mu = 0 specification that true difference between means is zero

paired = FALSE specification of a two independent sample test; if TRUE a paired samples test will be used

var.equal = specification of equal variances of the two samples; if TRUE the pooled variance is used otherwise a Welsh approximation of degrees of freedom is used

conf.level = confidence level of the interval

For further information see the documentation in CRAN help files for the function t.test().

Using the vectors selected from the dataset SS1, and SS2 the t test is performed using:

###########################################################

#perform a t test to compare sample means

#########################################################

t.test(SS1,SS2, alternative = c(“two.sided”), mu=0, paired=FALSE, var.equal = TRUE, conf.level = .95)

###########################################################

Resulting in output of:

        Two Sample t-test

data:  SS1 and SS2 
t = 11.1356, df = 1650, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 1.369673 1.955333 
sample estimates:
mean of x mean of y 
 14.59517  12.93267 

We can see that the difference between the mean years of education for the ‘YES’ and the ‘NO’ samples is significant at an alpha level of p=.05. Subsets can also be used to compare means involving more than two samples and using simple one-way Analysis of Variance. This will be covered in the next part of the tutorial.