Getting started with R

R is very easy to use. To perform 2+2 just type 2+2 and then hit Enter in the command prompt

2+2

## [1] 4

To find the mean of a bunch of numbers: Form a vector of numbers by the c() command and then try the mean function as below.

The mean is simply the arithmetic mean: \[ \bar{x} = \frac{1}{n} \sum_{i=1}^nx_i \] where $n$ is the sample size.

mean(c(38, 100, 64, 43, 63, 59, 107, 52, 86, 77))

## [1] 68.9

Data can be read in various ways, for example, by issuing the read.csv if its a comma separated value file or a read.table if its a tab delimited data file we would like to read. For this to work we first need to tell R which is our current working directoy. We can use the meus or set the working directory as:

setwd("C:/Users/sks/Dropbox/sks/math1024/rmarkdown") # Change this

Now we can read data files as follows. We first read the fast food data set. We typed ffood in the second line to see what’s in the object ffood.

 ffood <- read.csv("servicetime.csv", head=T) # csv stands for comma separated value file. 
# head =T says head = TRUe, so the first row of the file contains the column names. 
 ffood

##     AM PM
## 1   38 45
## 2  100 62
## 3   64 52
## 4   43 72
## 5   63 81
## 6   59 88
## 7  107 64
## 8   52 75
## 9   86 59
## 10  77 70

For large data sets, we can use the head or tail command to see the top or bottom part of the data set.

head(ffood) # Gets the head of the daya set

##    AM PM
## 1  38 45
## 2 100 62
## 3  64 52
## 4  43 72
## 5  63 81
## 6  59 88

tail(ffood) # Gets the tail

##     AM PM
## 5   63 81
## 6   59 88
## 7  107 64
## 8   52 75
## 9   86 59
## 10  77 70

The summary command can be used to produce columnwise summaries.

summary(ffood)

##        AM               PM       
##  Min.   : 38.00   Min.   :45.00  
##  1st Qu.: 53.75   1st Qu.:59.75  
##  Median : 63.50   Median :67.00  
##  Mean   : 68.90   Mean   :66.80  
##  3rd Qu.: 83.75   3rd Qu.:74.25  
##  Max.   :107.00   Max.   :88.00

To see what all columns are there we issue the names command. We access columns by the $ symbol, ffood$AM only gives the AM column of ffood. We can also use the box [] brackets as shown.

names(ffood)  ## Prints the column names of the argument data frame.

## [1] "AM" "PM"

ffood$AM # Prints the AM values

##  [1]  38 100  64  43  63  59 107  52  86  77

ffood[,1] # Gets the first column and all rows.

##  [1]  38 100  64  43  63  59 107  52  86  77

ffood[1:2, ] ## Gets the first two rows and all columns.

##    AM PM
## 1  38 45
## 2 100 62

ffood[1, 2] ## Gets the first row second column entry

## [1] 45

Can you see what values the commands are getting? x[,1] gets the first column and all rows, x[1:2, ] gets the first two rows but all the columns, x[1,2] gets the first row second column entry of x.

Now we read the computer failure data and calculate the data summaries.

cfail <- scan("compfail.txt")
summary(cfail)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    1.00    3.00    3.75    5.00   17.00

var(cfail)

## [1] 11.43204

We can tabulate the data by the table command and then we draw a frequency histogram to investigate shape of the data.

table(cfail)

## cfail
##  0  1  2  3  4  5  6  7  8  9 10 11 12 13 17 
## 12 16 21 12 11  8  7  2  4  2  3  2  2  1  1

hist(cfail)

## The Table command shows that the mode is at 2 failures per week. The histogram shows a very skewed distribution. We can draw the boxplot of the data by issuing the boxplot command.

boxplot(cfail)

The median of the data is the thick black line in the middle, and the first and third quartiles are the sides of the box. The two straightlines at the top and bottom connected by the dashed lines are furthest observations from the median but within 1.5 of the inter-quartile range from the median. Suspected outliers are shown as circles outside the whiskers. The boxplot shows positive skewness to the right as well. Now we go to read and explore the other data sets.

wgain <- read.table("wtgain.txt", head=T)
head(wgain)

##   student  initial    final
## 1       1 77.56423 76.20346
## 2       2 49.89512 50.34871
## 3       3 60.78133 61.68851
## 4       4 52.16308 53.97745
## 5       5 68.03880 70.30676
## 6       6 47.17357 48.08075

summary(wgain)

##     student         initial          final       
##  Min.   : 1.00   Min.   :42.64   Min.   : 43.54  
##  1st Qu.:17.75   1st Qu.:53.86   1st Qu.: 54.32  
##  Median :34.50   Median :60.78   Median : 60.78  
##  Mean   :34.50   Mean   :61.72   Mean   : 62.59  
##  3rd Qu.:51.25   3rd Qu.:68.04   3rd Qu.: 68.49  
##  Max.   :68.00   Max.   :99.79   Max.   :101.60

summary(wgain$final-wgain$initial)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2.2680  0.4536  0.9072  0.8672  1.3608  3.6287

We can calculate the difference and analyse.

gain <- wgain$final - wgain$initial
summary(gain)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2.2680  0.4536  0.9072  0.8672  1.3608  3.6287

hist(gain)

boxplot(gain)

We see a little bit of positive weight average weight gain and the histogram indicates a fairly symmetric distribution of differences. The difference itself spans from -2.26 to 3.63 kilograms.

We now read the age guessing data set.

guess <- read.table("guess.txt", head=TRUE, sep=",")
guess

##    group P1 P2 P3 P4 P5  P6  P7 P8  P9 P10 mae gsize  sex
## 1      1 14 -6  5 19  0  -5  -9 -1  -7  -7 7.3     3   F 
## 2      2  8  0  5  0  5  -1  -8 -1  -8   0 3.6     4   F 
## 3      3  6 -5  6  3  1  -8 -18  1  -9  -6 6.3     4   M 
## 4      4 10 -7  3  3  2  -2 -13  6  -7  -7 6.0     2  X  
## 5      5 11 -3  4  2 -1   0 -17  0 -14   3 5.5     2   F 
## 6      6 13 -3  3  5 -2  -8  -9 -1  -7   0 5.1     3   F 
## 7      7  9 -4  3  0  4 -13 -15  6  -7   5 6.6     4   M 
## 8      8 11  0  2  8  3   3 -15  1  -7   0 5.0     4   M 
## 9      9  6 -2  2  8  3  -8  -7 -1   1  -2 4.0     4   F 
## 10    10 11  2  3 11  1  -8 -14 -2  -1   0 5.3     4  F

summary(guess)

##      group             P1              P2              P3      
##  Min.   : 1.00   Min.   : 6.00   Min.   :-7.00   Min.   :2.00  
##  1st Qu.: 3.25   1st Qu.: 8.25   1st Qu.:-4.75   1st Qu.:3.00  
##  Median : 5.50   Median :10.50   Median :-3.00   Median :3.00  
##  Mean   : 5.50   Mean   : 9.90   Mean   :-2.80   Mean   :3.60  
##  3rd Qu.: 7.75   3rd Qu.:11.00   3rd Qu.:-0.50   3rd Qu.:4.75  
##  Max.   :10.00   Max.   :14.00   Max.   : 2.00   Max.   :6.00  
##        P4              P5              P6               P7       
##  Min.   : 0.00   Min.   :-2.00   Min.   :-13.00   Min.   :-18.0  
##  1st Qu.: 2.25   1st Qu.: 0.25   1st Qu.: -8.00   1st Qu.:-15.0  
##  Median : 4.00   Median : 1.50   Median : -6.50   Median :-13.5  
##  Mean   : 5.90   Mean   : 1.60   Mean   : -5.00   Mean   :-12.5  
##  3rd Qu.: 8.00   3rd Qu.: 3.00   3rd Qu.: -1.25   3rd Qu.: -9.0  
##  Max.   :19.00   Max.   : 5.00   Max.   :  3.00   Max.   : -7.0  
##        P8             P9              P10            mae       
##  Min.   :-2.0   Min.   :-14.00   Min.   :-7.0   Min.   :3.600  
##  1st Qu.:-1.0   1st Qu.: -7.75   1st Qu.:-5.0   1st Qu.:5.025  
##  Median :-0.5   Median : -7.00   Median : 0.0   Median :5.400  
##  Mean   : 0.8   Mean   : -6.60   Mean   :-1.4   Mean   :5.470  
##  3rd Qu.: 1.0   3rd Qu.: -7.00   3rd Qu.: 0.0   3rd Qu.:6.225  
##  Max.   : 6.0   Max.   :  1.00   Max.   : 5.0   Max.   :7.300  
##      gsize       sex   
##  Min.   :2.0    F  :5  
##  1st Qu.:3.0    F  :1  
##  Median :4.0    M  :3  
##  Mean   :3.4    X  :1  
##  3rd Qu.:4.0           
##  Max.   :4.0

Here the mae column is the mean absolute error. How can we re-calculate the column just to check that our data has been read correctly? There are many ways to do it. But the following chunk of code does it.

A <- guess[, c(2:11)] ## This only stores the score columns 2 to 11. 
A <- abs(A)  ## Calculats the absolute value of each entry.
newmae <- apply(A, 1, FUN=mean) ## Used to get the row means as instructed by the value 1. 
# Use ?apply to see what it doe
A  ## prints A on screen

##    P1 P2 P3 P4 P5 P6 P7 P8 P9 P10
## 1  14  6  5 19  0  5  9  1  7   7
## 2   8  0  5  0  5  1  8  1  8   0
## 3   6  5  6  3  1  8 18  1  9   6
## 4  10  7  3  3  2  2 13  6  7   7
## 5  11  3  4  2  1  0 17  0 14   3
## 6  13  3  3  5  2  8  9  1  7   0
## 7   9  4  3  0  4 13 15  6  7   5
## 8  11  0  2  8  3  3 15  1  7   0
## 9   6  2  2  8  3  8  7  1  1   2
## 10 11  2  3 11  1  8 14  2  1   0

newmae ## prints newmae on screen

##  [1] 7.3 3.6 6.3 6.0 5.5 5.1 6.6 5.0 4.0 5.3

newmae - guess$mae # check that all values are zero

##  [1] 0 0 0 0 0 0 0 0 0 0