Abstract

In this tutorial, you will perform a logistic regression with `R`

. This is the first exercice and we will do it together in class. At the end you can find an exercice with a simple linear regression.

*Credits for this lab*: **An Introduction to Statistical Learning: With Applications in R** book from Garet James, Daniela Witten, Trevor Hastie, Robert Tibshirani (in particular for the exercice on Logistic Regression and stock market) The exercice on linear model comes from Imke Mayer’s labs. Thanks to them.

In this part we use the `Smarket`

data, which is part of the `ISLR`

library. This data set consists of percentage returns for the S&P 500 stock index over 1250 days, from the beginning of 2001 until the end of 2005.

The S&P 500,or simply the S&P, is a stock market index that measures the stock performance of 500 large companies listed on stock exchanges in the United States. It is one of the most commonly followed equity indices. (I guess we can compare it with the French CAC 40)

Therefore you have 1250 observations on the following 9 variables.

`Year`

The year that the observation was recorded

`Lag1`

Percentage return for previous day

`Lag2`

Percentage return for 2 days previous

`Lag3`

Percentage return for 3 days previous

`Lag4`

Percentage return for 4 days previous

`Lag5`

Percentage return for 5 days previous

`Volume`

The number of shares traded

`Today`

The percentage return on the date in question

`Direction`

A factor with levels Down and Up indicating whether the market had a positive or negative return on a given day

Load the library `ISLR`

and inspect the data set. Do you see a link between returns? For example you can also look at correlation. What can you say on the volume of shares traded over year?

**Solution**

```
library(ISLR)
data(Smarket)
names(Smarket)
```

```
## [1] "Year" "Lag1" "Lag2" "Lag3" "Lag4" "Lag5"
## [7] "Volume" "Today" "Direction"
```

`summary(Smarket)`

```
## Year Lag1 Lag2 Lag3
## Min. :2001 Min. :-4.922000 Min. :-4.922000 Min. :-4.922000
## 1st Qu.:2002 1st Qu.:-0.639500 1st Qu.:-0.639500 1st Qu.:-0.640000
## Median :2003 Median : 0.039000 Median : 0.039000 Median : 0.038500
## Mean :2003 Mean : 0.003834 Mean : 0.003919 Mean : 0.001716
## 3rd Qu.:2004 3rd Qu.: 0.596750 3rd Qu.: 0.596750 3rd Qu.: 0.596750
## Max. :2005 Max. : 5.733000 Max. : 5.733000 Max. : 5.733000
## Lag4 Lag5 Volume Today
## Min. :-4.922000 Min. :-4.92200 Min. :0.3561 Min. :-4.922000
## 1st Qu.:-0.640000 1st Qu.:-0.64000 1st Qu.:1.2574 1st Qu.:-0.639500
## Median : 0.038500 Median : 0.03850 Median :1.4229 Median : 0.038500
## Mean : 0.001636 Mean : 0.00561 Mean :1.4783 Mean : 0.003138
## 3rd Qu.: 0.596750 3rd Qu.: 0.59700 3rd Qu.:1.6417 3rd Qu.: 0.596750
## Max. : 5.733000 Max. : 5.73300 Max. :3.1525 Max. : 5.733000
## Direction
## Down:602
## Up :648
##
##
##
##
```

`cor(Smarket[,-9])`

```
## Year Lag1 Lag2 Lag3 Lag4
## Year 1.00000000 0.029699649 0.030596422 0.033194581 0.035688718
## Lag1 0.02969965 1.000000000 -0.026294328 -0.010803402 -0.002985911
## Lag2 0.03059642 -0.026294328 1.000000000 -0.025896670 -0.010853533
## Lag3 0.03319458 -0.010803402 -0.025896670 1.000000000 -0.024051036
## Lag4 0.03568872 -0.002985911 -0.010853533 -0.024051036 1.000000000
## Lag5 0.02978799 -0.005674606 -0.003557949 -0.018808338 -0.027083641
## Volume 0.53900647 0.040909908 -0.043383215 -0.041823686 -0.048414246
## Today 0.03009523 -0.026155045 -0.010250033 -0.002447647 -0.006899527
## Lag5 Volume Today
## Year 0.029787995 0.53900647 0.030095229
## Lag1 -0.005674606 0.04090991 -0.026155045
## Lag2 -0.003557949 -0.04338321 -0.010250033
## Lag3 -0.018808338 -0.04182369 -0.002447647
## Lag4 -0.027083641 -0.04841425 -0.006899527
## Lag5 1.000000000 -0.02200231 -0.034860083
## Volume -0.022002315 1.00000000 0.014591823
## Today -0.034860083 0.01459182 1.000000000
```

```
# # why it does not work with Direction?
# tmp <- Smarket
# tmp$Direction <- ifelse(tmp$Direction == "Up", 1, 0)
# tmp$Direction
# cor(tmp)
```

```
lag.vector <- paste("Lag", 1:5, sep = "")
pairs(Smarket[, lag.vector])
```

As one would expect, the correlations between the lag variables and today’s returns are close to zero. In other words, there appears to be little correlation between today’s returns and previous days’ returns. The only substantial correlation is between `Year`

and `Volume`

. By plotting the data we see that `Volume`

is increasing over time. In other words, the average number of shares traded daily increased from 2001 to 2005.

```
library(ggplot2)
ggplot(Smarket, aes(as.numeric(row.names(Smarket)), y = Volume, color = Direction)) +
geom_point() +
theme_classic() +
xlab("Index")
```