Data science

2/3 SSB API and Shiny- data prep

Cleaning data from SSB API

Introduction

This is the second document in our series on how to make interactive figures and tables using R, Shiny apps and the SSB API. The entire project is available at: https://github.com/andreasolden/ssb-api-og-shiny

Loading the data

df_10594 <- readRDS(here("data/raw", "df_10594.rds"))
df_07459 <- readRDS(here("data/raw", "df_07459.rds"))
df_06726 <- readRDS(here("data/raw", "df_06726.rds"))
df_europe_brent_spot <- readRDS(here("data/raw", "df_europe_brent_spot.rds"))

Looking, diagnozing and cleaning data

10594 unemployment

Let us start out by taking a look at the data frame, missing values and the regions

skim(df_10594) #Show nice summary stats
df_10594[is.na(df_10594$value),] #To locate and show the missing values 
unique(df_10594$region) #To show all the regions
## [1] Oslo municipality Stavanger         Bergen            Trondheim (-2017)
## Levels: Oslo municipality Stavanger Bergen Trondheim (-2017)

The data consists of five variables; region, sex, contents, month and value. All are unordered factors, except for value, which is integer. There are aslo 4 regions, Oslo muni, Stavanger, Bergen and Trondheim (-2017). There are also three missing values, all from Trondheim in 2018, despite the fact that the data states -2017. This is due to the previous issue of Trondheim being classified as 1601 and 5001 and presumably a mistake by SSB. By looking at the relevant table in the Statbank https://www.ssb.no/statbank/table/10594 we could replace them or merge the variables. However, since the housing price data only runs from 1992-2017 we will only use this time period, and not have to deal with it.

We also want to clean the data a bit. The goal is to merge all data frames into a single data frame, so we aim for long data sets with consistent names and values.

The contents variable provides a description of the value variable. The contents has the description (factor level) “Unemployed persons registered at the Employment Offices”. This is too long to work with and we will recode it as “reg_unemployed”. We also want to recode the factor variable month as a date format. Then we will recode the factor levels in region so that Oslo muni is simply Oslo and Trondheim (-2017) is only Trondheim. Then we also want the sex factor level “Both sexes” to be named population. Further we will recode the value variable as numeric rather than integer. Then we will filter the dataset so that only dates between 1992 and 2017 are included. Finally we save it in processed data. We go through most of these steps in all the SSB data.

df_unemployment <- df_10594 %>% 
        mutate( contents = fct_recode(contents, 
                                      reg_unemployed = "Unemployed persons registered at the Employment Offices"),
                date = lubridate::parse_date_time(month, "ym"),
                region = fct_recode(region, 
                                    Oslo = "Oslo municipality",
                                    Trondheim = "Trondheim (-2017)"),
                sex = fct_recode(sex,
                                 Population = "Both sexes"),
                value = as.numeric(value)
                ) %>%
        select(-month) %>%
        filter(date >= as.Date("1992-01-01") & date < as.Date("2018-01-01"))
        
saveRDS(df_unemployment, file = here("data/processed", "df_unemployment.rds"))

Let’s take a new look at the data

skim(df_unemployment)
ggplot(subset(df_unemployment, sex == "Population"), aes(x = date, y = value, colour = region)) + geom_point(aes(shape = region)) + geom_line()

07459 Population by region, age and gender

Let’s take a look

skim(df_07459)
unique(df_07459$region)
## [1] Oslo municipality Stavanger         Bergen            Trondheim (-2017)
## Levels: Oslo municipality Stavanger Bergen Trondheim (-2017)

This time we have six variables: age, contents, region, sex, year and value. There are no mising values, but there are a lot of zero values for the population, which can be correct (no 110 year olds in a given year), but it can also be missing, which in fact is the case for Trondheim in 2018. Luckily, we are not going to use 2018 so we do not have to worry about it.

We also see that we only have two sexes, males and females. We will transform the data to wide format to easily sum up the two to get the total population numbers. Then we will gather the data so that it is in the usual long format. We will then recode the factor level in contents from Persons to inhabitants, then recode the region factor levels as previously, and recode the value variable as numeric. This time the observational level is year. We will recode it as a date, but the default option in R is to use 1st of January in this case. As it is a yearly average both the start and the end of the year seems like strange choices. Hence, we will add 183 days to the date to get roughly in the middle of the year. We also will save the age variable as numeric for easy filtering later, ignoring the fact that ‘105 or older’ is not only 105 year olds, because they are not really that interesting for our analytic purposes. Then we will filter the data and save it in processed data.

#Create for total population using spread

df_population <- df_07459 %>% 
        spread( key = sex, value = value) %>%
        mutate( Population = Males + Females) %>%
        gather( key = "sex", value = "value", Males, Females, Population) %>%
        mutate( contents = fct_recode(contents,
                                      inhabitants = "Persons"),
                region = fct_recode(region, 
                                    Oslo = "Oslo municipality",
                                    Trondheim = "Trondheim (-2017)"),
                value = as.numeric(value),
                sex = as.factor(sex),
                date = (lubridate::parse_date_time(year, "y") + lubridate::days(183)),
                age = parse_number(as.character(age))
                ) %>%
        select(-year) %>%
        filter(date >= as.Date("1992-01-01") & date < as.Date("2018-01-01"))
        
saveRDS(df_population, file = here("data/processed", "df_population.rds"))

Let’s take a look:

skim(df_population)

Age cohorts

We might also like a version that does not include all age groups, but rather age intervals. We will use the raw data som that we can reuse the code from the previous data processing of population numbers.

df_population_cohorts <- df_07459 %>%
        mutate(
                age = as.numeric(age) - 1, #factor default level is 1 not zero, but age can be zero
                age = ifelse(age %in%  0:19, "under_20",
                            ifelse(age %in% 20:29, "20s",
                            ifelse(age %in% 30:39, "30s",
                            ifelse(age %in% 40:49, "40s",
                            ifelse(age %in% 50:59, "50s",
                            ifelse(age %in% 60:69, "60s", 
                            ifelse(age  >=  70   , "over_69", "Missing"   ))))))),
                age = factor(age, levels = c("under_20", "20s", "30s", "40s", "50s", "60s", "over_69" ) )
                ) %>%
        group_by(region, sex, age, contents, year) %>%
        summarise(value = sum(value)) %>%
        ungroup()
        
str(df_population_cohorts)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1904 obs. of  6 variables:
##  $ region  : Factor w/ 4 levels "Oslo municipality",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex     : Factor w/ 2 levels "Males","Females": 1 1 1 1 1 1 1 1 1 1 ...
##  $ age     : Factor w/ 7 levels "under_20","20s",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ contents: Factor w/ 1 level "Persons": 1 1 1 1 1 1 1 1 1 1 ...
##  $ year    : Factor w/ 34 levels "1986","1987",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ value   : int  44559 44563 44621 44928 45618 46513 47560 48749 50065 51171 ...
unique(df_population_cohorts$age)
## [1] under_20 20s      30s      40s      50s      60s      over_69 
## Levels: under_20 20s 30s 40s 50s 60s over_69
#It is also smart to check the aggreagation by comparing with the Statbank

Now we can literally paste the previous code in and only change names of objects

df_population_cohorts <- df_population_cohorts %>% #Here
        spread( key = sex, value = value) %>%
        mutate( Population = Males + Females) %>%
        gather( key = "sex", value = "value", Males, Females, Population) %>%
        mutate( contents = fct_recode(contents,
                                      inhabitants = "Persons"),
                region = fct_recode(region, 
                                    Oslo = "Oslo municipality",
                                    Trondheim = "Trondheim (-2017)"),
                value = as.numeric(value),
                sex = as.factor(sex),
                date = (lubridate::parse_date_time(year, "y") + lubridate::days(183))
                ) %>%
        select(-year) %>%
        filter(date >= as.Date("1992-01-01") & date < as.Date("2018-01-01"))
        
saveRDS(df_population_cohorts, file = here("data/processed", "df_population_cohorts.rds")) # And here

06726 housing prices by number and total value by region

skim(df_06726) #Show nice summary stats
unique(df_06726$region) #To show all the regions
## [1] The whole country Oslo municipality Stavanger         Bergen           
## [5] Trondheim (-2017)
## 5 Levels: The whole country Oslo municipality Stavanger ... Trondheim (-2017)
unique(df_06726$contents) #To show all info provided
## [1] Transfers (number)               Total purchase prise (1 000 NOK)
## Levels: Transfers (number) Total purchase prise (1 000 NOK)

This also looks a lot like the previous data sets. There are 4 variables: Contents, region, year and value. There are no missing values.

The contents variable has two pieces, transfers and total value. We want the average value as well and will spread the data set so that it becomes wide, then calculate the average value before we gather it back as a long data set. Along the way we will rename the variables when in the long format so that there are no spaces. Then we will recode the factor levels for region, and change the year to date format as previously. Finally, let’s save and have a look.

df_home_values <- df_06726 %>%
        spread( key = contents, value = value) %>%
        rename( home_transfers = "Transfers (number)", trans_val_in_1000 = "Total purchase prise (1 000 NOK)") %>% 
        mutate( ave_home_price = (trans_val_in_1000 / home_transfers)*1000 ) %>%
        gather( key = "contents", value = "value", home_transfers, trans_val_in_1000, ave_home_price, factor_key = TRUE ) %>%
        mutate(
                region = fct_recode(region, 
                                    Oslo = "Oslo municipality",
                                    Trondheim = "Trondheim (-2017)"),
                date = (lubridate::parse_date_time(year, "y") + lubridate::days(183))
        ) %>%
        select(-year) %>%
        filter(date >= as.Date("1992-01-01") & date < as.Date("2018-01-01") & region != "The whole country")
        
saveRDS(df_home_values, file = here("data/processed", "df_home_values.rds"))

Let’s take a look:

skim(df_home_values)
#The total transfer value for the four cities in log-scale
ggplot (subset(df_home_values, contents == "trans_val_in_1000"), aes(x = date, y = log(value), colour = region )) + geom_line() + geom_point(aes(shape=region))