Abstract
This Lab will allow you to discover R
and
Rstudio
through several exercices. The first one will be to
continue the running example of the class, plotting more complex data
from the house market Paris. We really encourage you to focus on
exercice 1. Mastering these two questions is already quite complex when
starting R
, it is normal to struggle if you start. Please,
don’t abandon and ask question. If you want to learn more, you can pick
one of the optional exercices below that will help you learn about doing
dashboard or dealing with API and JSON data. Both skills are very useful
when willing to communicate about your data or when you want to build
your own data base.
data.gouv.fr
This plot is a bit complicated, in particular the confidence intervals.
The different steps are:
stack the dataframes with rbind
filter a subtable without extreme sales and without the row
having missing values in valeur_fonciere
create a colunm year, for this you can use the package
lubridate
, but you can also do it alone with a package
extracting the first four characters of a string.
then use the function group_by
, to group by year the
observations and compute the mean and the standard deviation
then plot the average over years
the 95% confidence intervals are obtained from the standard deviation
Again, don’t worry if you don’t succeed to finish everything or understand all the lines. Please note your questions for the next sessions or send an email.
knitr::opts_chunk$set(echo = TRUE)
# We could have done a loop to avoid copy paste, but let's keep it like this for the first class
# Be careful to put the right path (here, I put all the raw data in a folder named 'data')
df_2016 <- read.csv("./data/2016.csv")
df_2017 <- read.csv("./data/2017.csv")
df_2018 <- read.csv("./data/2018.csv")
df_2019 <- read.csv("./data/2019.csv")
df_2020 <- read.csv("./data/2020.csv")
df_2021 <- read.csv("./data/2021.csv")
df_2022 <- read.csv("./data/2022.csv")
# handle something strange in 2021
df_2021 <- df_2021[, 2:41]
immo <- rbind(df_2016, df_2017, df_2018, df_2019, df_2020, df_2021, df_2022)
# at the beginning of your notebook
VALEUR_FONCIERE_MAX = 2000000
VALEUR_FONCIERE_MIN = 50000
# Libraries used
library(ggplot2)
library(dplyr) # group_by
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate) # function to treat automatically year
## Loading required package: timechange
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
# Data treatment
## Only keep costal code from Paris
df <- immo[immo$code_postal %in% c( paste0("7500",1:9),paste0("750",10:20)) ,]
## Define which columns we want to keep
useful_columns <- c("date_mutation",
"valeur_fonciere",
"nature_mutation",
"code_postal",
"nombre_pieces_principales",
#"surface_reelle_bati",
"lot1_surface_carrez",
"type_local",
"nom_commune",
"lot2_surface_carrez",
"adresse_numero",
"adresse_nom_voie")
## take subtable
immo <- immo[immo$valeur_fonciere < VALEUR_FONCIERE_MAX,]
immo <- immo[immo$valeur_fonciere > VALEUR_FONCIERE_MIN,]
## create a column year
immo$annee <- year(immo$date_mutation)
## remove duplicated lines
nrow(df)
## [1] 728863
df <- unique(df)
nrow(df)
## [1] 619443
## function from package lubridate to get the year and month
immo$month <- month(immo$date_mutation, label=TRUE, abbr=FALSE)
immo$annee <- as.factor(immo$annee)
# An alternative could be to use `gsub`
# Only keep housing
immo <- immo[immo$type_local %in% c("Appartement", "Maison"),]
# Only keep transactions
immo <- immo[immo$nature_mutation == "Vente",]
# Keep only observations for which valeur fonciere is observed
immo <- immo[!is.na(immo$valeur_fonciere),] # ! indicates "opposite to" so that we keep only rows for which "valeur_fonciere" is observed
Check how many transaction by year. Does it seem reasonnable?
table(immo$annee)
##
## 2016 2017 2018 2019 2020 2021 2022
## 64927 73959 70247 71531 58540 33657 17979
summary <- immo %>%
group_by(annee) %>%
summarise(mean = mean(valeur_fonciere),
sd = sd(valeur_fonciere))
summary$se = summary$sd / sqrt(nrow(immo))
ggplot(summary, aes(x = annee, y = mean)) +
geom_point()+
geom_errorbar(aes(ymin=mean-1.96*se, ymax=mean+1.96*se), width=.2,
position=position_dodge(0.05)) +
theme_bw() +
ylab("Valeurs moyennes des ventes")
💡 In reality, those confidence intervals do not really have an interest as we have access to the whole data set. A confidence intervals reflects the uncertainty, and more precisely the fact that you have sampled a part of an infinite (or way bigger) population. This is clearly not the case here as we have access to the whole transactions. But this is for you to practice.
First, we need to do other assumptions
## only keep element with surface
immo <- immo[!is.na(immo$lot1_surface_carrez),]
## Create price per squarre meters: assumptions here on the fact that the lot 1 is the principal one.
immo$prix.m2 <- immo$valeur_fonciere / immo$lot1_surface_carrez
ggplot(immo[immo$prix.m2 < 30000 & immo$lot1_surface_carrez < 130,], aes(x = lot1_surface_carrez, y = prix.m2, group = annee, color = annee)) +
geom_smooth() +
theme_classic() +
ylab("Prix au m2") +
xlab("Surface du premier lot (considéré comme le lot principal)")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Then, we can observe the number of transactions per month over year to understand what happened in 2020.
# 2021 is only first semester so we remove it from the plot
immo[immo$annee != 2022,] %>%
group_by(annee, month) %>%
summarise(count = n()) %>%
ggplot(aes(x = month, y = count, color = as.factor(annee), group = as.factor(annee))) +
geom_line(size = 1) +
theme_classic() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
ylab("Nombre de transactions par mois") +
xlab("")
## `summarise()` has grouped output by 'annee'. You can override using the
## `.groups` argument.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.