Exploratory Analysis of the Rossmann store sales data from Kaggle

14 minutes read
A Scatterplot with smoothing

This is an exploratory analysis in R of the Rossmann Store Sales data which can be found here. The data isn’t huge but the speedup using data.table is noticeable. It is nice to have unmasked data which allows for some interpretations.

Read in the data:

library(data.table)
library(zoo)
library(forecast)
library(ggplot2)
test <- fread("test.csv")
train <- fread("train.csv")
store <- fread("store.csv")

Let’s have a first look at the train set:

str(train)
## Classes 'data.table' and 'data.frame':   1017209 obs. of  9 variables:
##  $ Store        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ DayOfWeek    : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Date         : chr  "2015-07-31" "2015-07-31" "2015-07-31" "2015-07-31" ...
##  $ Sales        : int  5263 6064 8314 13995 4822 5651 15344 8492 8565 7185 ...
##  $ Customers    : int  555 625 821 1498 559 589 1414 833 687 681 ...
##  $ Open         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Promo        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ StateHoliday : chr  "0" "0" "0" "0" ...
##  $ SchoolHoliday: int  1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>

…and at the test set:

str(test)
## Classes 'data.table' and 'data.frame':   41088 obs. of  8 variables:
##  $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Store        : int  1 3 7 8 9 10 11 12 13 14 ...
##  $ DayOfWeek    : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ Date         : chr  "2015-09-17" "2015-09-17" "2015-09-17" "2015-09-17" ...
##  $ Open         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Promo        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ StateHoliday : chr  "0" "0" "0" "0" ...
##  $ SchoolHoliday: int  0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Read more

EBRD: A summary of spatial distribution of projects and project size

9 minutes read
A map depicting project sizes per country

This post summarizes the energy related projects that were financed by the European Bank for Reconstruction and Development. The necessary data was downloaded from the publicly available project database using the following web scraper written in R that is also available on Github. It is not necessary to execute that code if you would like to reproduce the document. The resulting data is available as projects_costscurr.csv. The code that produces the graphs can be found in the Rmd-file of this project.

library(XML)
library(Quandl)
Sys.setlocale("LC_TIME", "C")

#----------------------------------------------------------------------------
# Function to translate German month abbreviations into English.
# Possibly not necessary, but does not do any harm if the dates
# are already in English.
# FILE ON GITHUB: translate_date.R
translate_date <- function(date){
    Sys.setlocale("LC_TIME", "C")
    foreign_dates <- c("Mrz", "Mai", "Okt", "Dez")
    eng_dates <- c("Mar", "May", "Oct", "Dec")
    split <- unlist(strsplit(date, split = " "))
    # split[2] is the month name (abbreviated)
    if (split[2] %in% foreign_dates){
        split[2] <- eng_dates[which(foreign_dates == split[2])]
    }
    date <- paste(split, collapse = " ")
    return(date)
}

#-----------------------------------------------------------------------------
# Function to load the manually saved webpages into R
# The extracted tables with columns date, project ID, country, 
# project title, sector, public/private and status are then 
# merged and returned
# FILE ON GITHUB: get_projects.R
get_projects <- function(pathtofiles){
    library(XML)
    files <- list.files(pathtofiles, pattern = "htm")
    files <- paste0(pathtofiles, files) # Full path 
    projects <- data.frame(matrix(NA, ncol = 7, nrow = length(files) * 25))
    for (file in seq_along(files)){
        doc <- htmlTreeParse(files[file], isURL = F, useInternalNodes = T)
        root <- xmlRoot(doc)
        table <- readHTMLTable(root)
        ifelse(file == 1, 
               projects <- table[2:26, ], 
               projects <- rbind(projects, table[2:26, ]))
    }

    colnames(projects) <- c("Date", "ProjectID", "Country", "Project Title", 
                            "Sector", "Public/Private", "Status")
    dates <- as.character(projects$Date)
    for (i in seq_along(dates)){
        dates[i] <- translate_date(dates[i])
    }
    projects$Date <- as.Date(dates, format = "%d %b %Y")
    return(projects)
}

# Store dataframe with project data in "projects" ----------------------------
projects <- get_projects("HTML/")
delete <- is.na(projects$ProjectID) | projects$ProjectID == ""
projects <- projects[-which(delete), ]; rm(delete)
length(unique(projects$ProjectID))
# 176 projects left, one project in Turkey had all missing values except for 
# the country name. One duplicate, will be removed later.

# ------------------------------------------------------------------------------
# Extract project cost from seperate project page
# Generate the URL, download the webpage and extract the paragraph about
# project costs.
# Function argument should be one row of the projects table
# TAKES SOME TIME, because all project pages have to be downloaded
# (alternatively just skip this and load costsCleaner.csv in the next step)
# FILE ON GITHUB: load_project_cost.R
load_project_cost <- function(project){
    library(XML)
    # Random sleep interval to stress the site less
    Sys.sleep(sample(seq(1, 3, by=0.01), 1))
    message("Loading project")
    date <- as.Date(project$Date, format = "%d %b %Y")
    year <- format(date, "%Y")
    id <- as.character(project$ProjectID)
    # URL scheme: http://www.ebrd.com/english/pages/project/psd/YEAR/PROJECTID.shtml
    url <- paste0("http://www.ebrd.com/english/pages/project/psd/", 
                  year, "/", id, ".shtml")
    # Check if the page can be loaded, if not return cost = NA
    doc <- try(htmlTreeParse(url, useInternalNodes = T))
    if (class(doc)[1] == "try-error") {
        return(c(ProjectID = project$ProjectID, cost = NA, url = url))
    } else {
        root <- xmlRoot(doc) 
        # This xpath statement looks for an h2 (heading) with the text 
        # "Project Cost" and if found returns the following sibling's text
        # by which is the paragraph under the heading by applying xmlValue()
        # (Not sure if normalize space is necessary)
        xpath <- paste0("//h2[text()[normalize-space(.)='Project Cost']]",
                        "/following-sibling::*[position()=1 and self::p]")
        cost <- xpathSApply(doc, 
                            path = xpath,
                            fun = xmlValue) 
        if (is.null(cost)) cost <- "Project cost not found"
        print(paste(id, cost, url))     
        return(c(ProjectID = id, cost = cost, url = url))
    }
}

#------------------------------------------------------------------------------
# Use load_project_cost() to loop over all projects in the projects 
# data frame and return all costs 
costmat <- matrix(NA, nrow = nrow(projects), ncol = 3)
colnames(costmat) <- c("ProjectID", "Cost", "URL")
for (i in 1:nrow(projects)){
    cost_and_url <- load_project_cost(projects[i, ])
    costmat[i, ] <- cost_and_url
}
# Backup in workspace folder if desired
# write.csv(costmat, "costs.csv", row.names = F)

# Convert costs paragraph to numeric -------------------------------------------
# The costs paragraph is very messy.
# I edited the file MANUALLY (sorry) and saved it as costsCleaner.csv.
# There are 5 new columns: Cost in millions of Rubles, USD, Euro, Romanian Leu,
# Turkish Lira and Polish zloty (all numeric).

costsCleaner <- read.csv("~/Dropbox/ProjectScrapers/EBRD/costsCleaner.csv",
                         dec=",", na.strings="", stringsAsFactors=FALSE)

# Merge dataframes
projects_costs <- merge(projects, costsCleaner, by = "ProjectID")
# Remove duplicates
projects_costs <- projects_costs[!duplicated(projects_costs$ProjectID), ]
dim(projects_costs)
# 176 unique projects

#-------------------------------------------------------------------------
# Exchange rate data is needed, download it from Quandl.
# 500 API calls per day if registered (otherwise 50).
# Convert USD and RUB to EUR. If necessary (prior to 2000) convert to DM first
# and then to EUR.
# The function argument should a data frame with at least the columns
# CostEURm, CostUSDm, CostRUBm and date
# Replaces CostEURm if that value is NA
# FILE ON GITHUB: costs_to_eur.R
costs_to_eur <- function(projects_costs = projects_costs){
    library(Quandl)
    token <- "yourtoken" # hidden
    # First year 1971
    DMUSD <- Quandl("FRED/EXGEUS", authcode = Quandl.auth(token)) 
    currencies <- c("EURRUB", "EURUSD", "EURPLN", "EURRON", "EURTRY")
    for (i in seq_along(currencies)){
        if (!exists(currencies[i])){
            fx <- Quandl(paste0("QUANDL/", currencies[i]),
                         authcode = Quandl.auth(token))
            fx <- fx[, c("Date", "Rate")]
            assign(x = currencies[i], value = fx)
        }
    }

    # All the time series start in 1999
    EURRUB <- Quandl("QUANDL/EURRUB", authcode = Quandl.auth(token))
    EURRUB <- EURRUB[, c("Date", "Rate")] # First Date 1999-09-06
    EURUSD <- Quandl("QUANDL/EURUSD", authcode = Quandl.auth(token))
    EURUSD <- EURUSD[, c("Date", "Rate")] # First Date 1999-09-06
    EURPLN <- Quandl("QUANDL/EURPLN", authcode = Quandl.auth(token))
    EURPLN <- EURPLN[, c("Date", "Rate")] # First Date 1999-09-06
    EURRON <- Quandl("QUANDL/EURRON", authcode = Quandl.auth(token))
    EURRON <- EURRON[, c("Date", "Rate")] # First Date 1999-09-06
    EURTRY <- Quandl("QUANDL/EURTRY", authcode = Quandl.auth(token))
    EURTRY <- EURTRY[, c("Date", "Rate")] # First Date 1999-09-06

    for (i in 1:nrow(projects_costs)){
        date <- projects_costs$Date[i]
        year <- as.numeric(format(date, format = "%Y"))
        # Convert TRY to EUR
        if (!is.na(projects_costs$CostTLm[i])){ # TL, not TRY in EBRD data
            # Use oldest available value if project date out of range
            ifelse(test = is.na(EURTRY[EURTRY$Date == date, ]$Rate), 
                   yes = rate <- tail(EURTRY$Rate, 1),
                   no = rate <- EURTRY[EURTRY$Date == date, ]$Rate)
            cost <- projects_costs$CostTLm[i] / rate
            projects_costs$CostEURm[i] <- cost
        }
        # Convert RON to EUR
        if (!is.na(projects_costs$CostRONm[i])){
            # Use oldest available value if project date out of range
            ifelse(test = is.na(EURRON[EURRON$Date == date, ]$Rate), 
                   yes = rate <- tail(EURRON$Rate, 1),
                   no = rate <- EURRON[EURRON$Date == date, ]$Rate)
            cost <- projects_costs$CostRONm[i] / rate
            projects_costs$CostEURm[i] <- cost
        }
        # Convert PLN to EUR
        if (!is.na(projects_costs$CostPLNm[i])){
            # Use oldest available value if project date out of range
            ifelse(test = is.na(EURPLN[EURPLN$Date == date, ]$Rate), 
                   yes = rate <- tail(EURPLN$Rate, 1),
                   no = rate <- EURPLN[EURPLN$Date == date, ]$Rate)
            cost <- projects_costs$CostPLNm[i] / rate
            projects_costs$CostEURm[i] <- cost
        }
        # Convert RUB to EUR
        if (!is.na(projects_costs$CostRUBm[i])){
            # Use oldest available value if project date out of range
            ifelse(test = is.na(EURRUB[EURRUB$Date == date, ]$Rate), 
                   yes = rate <- tail(EURRUB$Rate, 1),
                   no = rate <- EURRUB[EURRUB$Date == date, ]$Rate)
            cost <- projects_costs$CostRUBm[i] / rate
            projects_costs$CostEURm[i] <- cost
        }
        # Convert USD to EUR or DM
        if (!is.na(projects_costs$CostUSDm[i])){
            if (year < 2000){ # Convert USD to DM to EUR
                # Only monthly data
                date <- format(date, format = "%Y-%m")
                date <- paste0(date, "-01")
                rate <- DMUSD[DMUSD$Date == date, ]$Value
                cost <- projects_costs$CostUSDm[i] / rate
                # cost are DM now, convert to EUR
                cost <- cost / 1.9558
                projects_costs$CostEURm[i] <- cost
            } else {
                rate <- EURUSD[EURUSD$Date == date, ]$Rate
                cost <- projects_costs$CostUSDm[i] / rate
                projects_costs$CostEURm[i] <- cost
            }
        }
    }
    return(projects_costs)
}

# Apply currency conversion function ------------------------------------------
projects_costs <- costs_to_eur(projects_costs)

#-----------------------------------------------------------------------------
# Add column with prices in current values (adjusted for inflation) 
# Convert EUR to current EUR, taking inflation into account
# CPI Euro Area will be used for the adjustement.
# Convert all project costs in EUR to current EUR. "Current" refers to June 2014 here.
# "costs" in the following function should be a data frame with date in the
# first column ("YYYY-MM-DD" or "YYYY-MM") and the costs in EUR in the second column
# FILE ON GITHUB: eur_to_current.R
Sys.setlocale("LC_TIME", "C")
eur_to_current <- function(cost){
    library(Quandl)
    library(zoo)
    dates <- cost[,1]
    eur <- cost[,2]
    if(!exists("CPI")){
        token <- "yourtoken" # hidden
        # 1990 - 2014. 2005 = Index 100. Monthly data.
        CPI <- Quandl("RATEINF/CPI_EUR", collapse="monthly", 
                      authcode = Quandl.auth(token)) 
    }
    CPI$Date <- as.yearmon(CPI$Date)
    dates <- as.yearmon(dates) # drop day
    cpiCurr <- CPI[CPI$Date == "Jun 2014",]$CPI
    toCurr <- function(x){
        as.numeric(x[2]) * (cpiCurr / CPI[CPI$Date == x[1], ]$CPI)
    }
    eurCurr <- apply(X = data.frame(dates, eur), 
                     MARGIN = 1, 
                     FUN = toCurr)
    return(eurCurr)
}

# Apply inflation adjustment -------------------------------------------------
CostCurrEURm <- eur_to_current(data.frame(projects_costs$Date, 
                                          projects_costs$CostEURm))
projects_costs <- cbind(projects_costs, CostCurrEURm)
# Save resulting data if desired
# write.csv(projects_costs, "projects_costscurr.csv", row.names = F)

#----------------------------------------------------------------------------
# This is the resulting data frame
str(projects_costs)
# 'data.frame':      176 obs. of  16 variables:
# $ ProjectID     : Factor w/ 178 levels "41553","42363",..: 126 151 176 152 153 154 155 156 127 157 ...
# $ Date          : Date, format: "2005-05-12" "2000-07-24" "1996-10-15" ...
# $ Country       : Factor w/ 33 levels "Country","Egypt",..: 24 14 21 16 21 31 20 21 24 22 ...
# $ Project Title : Factor w/ 177 levels "ALPASLAN II DAM HYDRO PROJECT",..: 137 156 176 169 174 172 160 164 135 151 ...
# $ Sector        : Factor w/ 2 levels "Power and energy",..: 1 1 1 1 1 1 1 1 1 1 ...
# $ Public/Private: Factor w/ 3 levels "Private","Public",..: 1 2 2 1 2 2 2 2 2 2 ...
# $ Status        : Factor w/ 10 levels "","BA, PS","Cn",..: 7 7 9 9 9 9 7 3 3 9 ...
# $ CostRONm      : int  NA NA NA NA NA NA NA NA NA NA ...
# $ CostTLm       : int  NA NA NA NA NA NA NA NA NA NA ...
# $ CostPLNm      : num  NA NA NA NA NA NA NA NA NA NA ...
# $ CostRUBm      : int  NA NA NA NA NA NA NA NA NA NA ...
# $ CostUSDm      : num  NA NA 163 NA NA ...
# $ CostEURm      : num  1050 272 54.6 94.2 111 ...
# $ Cost          : chr  "EUR 1,050 million." "\nApproximately US$ 231 million (€272 million). \n" "US$ 163.20 million." "\nUS$ 78.0 million (€94.2 million).\n" ...
# $ URL           : chr  "http://www.ebrd.com/english/pages/project/psd/2005/11865.shtml" "http://www.ebrd.com/english/pages/project/psd/2000/12413.shtml" "http://www.ebrd.com/english/pages/project/psd/1996/1314.shtml" "http://www.ebrd.com/english/pages/project/psd/2000/13220.shtml" ...
# $ CostCurrEURm  : num  1241 358 76 123 145 ...

Since 1996 the EBRD has financed 176 energy related projects in 32 different countries with a median project cost of 115 million euros and a mean project cost of 234 million euros. Most project costs are below 200 million euros. The EBRD focuses on Eastern Europe and Russia as can be seen below.

Read more