Lab 1: Vizualization

This plot is a bit complicated, in particular the confidence intervals. A first thing could be to plot the mean over year. The different steps are: - stack the dataframes with `rbind` - filter a subtable without extreme sales and without the row having missing values in `valeur_fonciere` - create a colunm year, for this you can use the package `lubridate`, but you can also do it alone with a package extracting the first four characters of a string. - then use the function `group_by`, to group by year the observations and compute the mean and the standard deviation - then plot the average over years - the 95% confidence intervals are obtained from the standard deviation Again, don't worry if you don't succeed to finish everything or understand all the lines. Please note your questions for the next sessions. ```{r setup} knitr::opts_chunk$set(echo = TRUE) # We could have done a loop to avoid copy paste, but let's keep it like this for the first class # Be careful to have all the data in a folder named data df_2016 <- read.csv("./data/2016.csv") df_2017 <- read.csv("./data/2017.csv") df_2018 <- read.csv("./data/2018.csv") df_2019 <- read.csv("./data/2019.csv") df_2020 <- read.csv("./data/2020.csv") df_2021 <- read.csv("./data/2021.csv") immo <- rbind(df_2016, df_2017, df_2018, df_2019, df_2020,df_2021) # at the beginning of your notebook VALEUR_FONCIERE_MAX = 2000000 VALEUR_FONCIERE_MIN = 50000 # Libraries used library(ggplot2) library(dplyr) # group_by library(lubridate) # function to treat automatically year # create a column year immo$annee <- year(immo$date_mutation) # take subtable immo <- immo[immo$valeur_fonciere < VALEUR_FONCIERE_MAX,] immo <- immo[immo$valeur_fonciere > VALEUR_FONCIERE_MIN,] # keep only observations for which valeur fonciere is observed immo <- immo[!is.na(immo$valeur_fonciere),] # ! indicates "opposite to" so that we keep only rows for which "valeur_fonciere" is observed summary <- immo %>% group_by(annee) %>% summarise(mean = mean(valeur_fonciere), sd = sd(valeur_fonciere)) summary$se = summary$sd / sqrt(nrow(immo)) ggplot(summary, aes(x = annee, y = mean)) + geom_line() + geom_point()+ geom_errorbar(aes(ymin=mean-1.96*se, ymax=mean+1.96*se), width=.2, position=position_dodge(0.05)) + theme_bw() + ylab("Valeurs moyennes des ventes") ``` Then, we can observe the number of transactions per month over year to understand what happened in 2020. ```{r} # function from package lubridate immo$month <- month(immo$date_mutation, label=TRUE, abbr=FALSE) ``` ```{r} # 2021 is only first semester so we remove it from the plot immo[immo$annee != 2021,] %>% group_by(annee, month) %>% summarise(count = n()) %>% ggplot(aes(x = month, y = count, color = as.factor(annee), group = as.factor(annee))) + geom_line(size = 1) + theme_classic() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + ylab("Nombre de transactions par mois") + xlab("") ``` You can further improve plots, customizing colors, title, font size, and so on...!