Take home Exercise 3

Describe the health of the various employers within the city limits. What employment patterns do you obesrve? Do you notice any areas of particularly high or low turnover?

Bomin Kim
2022-04-25

Overview

In HW3 we will mainly focus on answering the 3rd part of challenge 3 in VAST 2022, ie.

  1. The employment pattern in city OHIO USA
  2. Turnover rate

Part 1. Data Preparation

Loading packages

packages = c('tidyverse', 'plotly', 'readxl', 'knitr', 'dplyr', 'ggplot2', 
             'grid','plotly','ggstatsplot','ggiraph','DT','patchwork',
             'gganimate','gifski','gapminder','treemap','treemapify','rPackedBar')
for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

Reading files

Buildings <- read_csv("dt/Buildings.csv")
Jobs <- read_csv("dt/Jobs.csv")
Employers <- read_csv("dt/Employers.csv")
Participants <- read_rds("dt/hw2_participants.rds")

Below code chuck works on TravelJournal.csv and CheckinJournal.csv to calculate the average traveling time and average checkin/working time.

avgcheckintime<-read_rds("dt/avgcheckintimehw3.rds")

avgtraveltime<-read_rds("dt/avgtraveltimehw3.rds")

avgcheckintime2<-spread(avgcheckintime, venueType.end, time)
colnames(avgcheckintime2)[2] <- "Apartmentavgduration"
colnames(avgcheckintime2)[3] <- "Pubavgduration"
colnames(avgcheckintime2)[4] <- "Restaurantavgduration"
colnames(avgcheckintime2)[5] <- "Workplaceavgduration"

workplaceaprtmenttraveltime<-subset(avgtraveltime, venueType.start=="Apartment")
workplaceaprtmenttraveltime<-subset(workplaceaprtmenttraveltime, venueType.end=="Workplace")

knitr::kable(head(avgcheckintime2), format = 'html')
participantId Apartmentavgduration Pubavgduration Restaurantavgduration Workplaceavgduration
0 7.511465 1.2423246 0.3333333 3.993798
1 5.733684 1.1098485 0.3333333 7.901840
2 5.721255 0.7616697 0.3333333 6.046948
3 7.470541 1.4908293 0.3333333 3.993798
4 6.703136 1.4824811 0.3333333 4.041796
5 5.838855 1.1943285 0.3333333 7.975232
knitr::kable(head(workplaceaprtmenttraveltime), format = 'html')
venueType.start venueType.end participantId time
9 Apartment Workplace 0 2.0000000
19 Apartment Workplace 1 0.5833333
29 Apartment Workplace 2 1.1666667
39 Apartment Workplace 3 1.0000000
47 Apartment Workplace 4 0.2489648
57 Apartment Workplace 5 0.5833333

Below code chuck works on Jobs.csv to calculate total working hours and total weekly pay.

Jobs1 <- Jobs %>% separate(daysToWork, c('day0','day1','day2','day3','day4','day5','day6'))
#sum(c$day0!="")
#sum(c$day6!="")
Jobs = subset(Jobs1, select = -c(day0,day6) )
Jobs<-Jobs %>%
  mutate(totalworkhour=difftime(Jobs$endTime,Jobs$startTime,units='hours')*5)
Jobs<-Jobs %>%
  mutate(totalweeklypay=hourlyRate*totalworkhour)
Jobs$totalweeklypay=as.numeric(Jobs$totalweeklypay)
Jobs <-Jobs %>%
    mutate(educationRequirement = fct_relevel(educationRequirement,'Low', 
                                              'HighSchoolOrCollege', 'Bachelors', 'Graduate'))

Below code chuck works on Employers.csv to identify building types and location.

Employers<-Employers %>%
    select("employerId","location","buildingId") %>%
    distinct() %>%
    left_join(Buildings, by = 'buildingId')
colnames(Employers)[2] <- "employerlocation"
colnames(Employers)[4] <- "buildinglocation"

Below code chuck revealed that buildingID in Attributes is different from VenueID in Journals and it’s hard to link Employers to Participants.

#If Employers compared to TravelJournal, we can see not all employerId showed up in TravelJournal,
#confirming some employers do not have employees and can not be linked to participantID.

#Employers <- Employers[order(Employers$buildingId),]

#due to the size of travelJournal, we will not read the file here. However, below code shows 
#VenueId is different from buildingId.

#TravelJournal3<-TravelJournal2 %>%
#  mutate(buildingId=travelEndLocationId)

#Employers2<-Employers %>%
#    select("employerId","employerlocation","buildingId", "buildinglocation" ,
#           "buildingType", "maxOccupancy", "units" ) %>%
#    distinct() %>%
#    right_join(TravelJournal2, by = 'buildingId')

Below code chuck works on TravelJournal.csv to reveal the turnover rate by each participantID.

Workplace1<-read_rds("dt/Workplace1hw3.rds")

Participants<-Workplace1 %>%right_join(Participants, by = 'participantId')

Part 2. Employment Pattern

2.1 Employer provided information

The data has revealed that employers all reported a 8 hrs a day, 5 days a week, total of 40hrs/week employment pattern. While it has also revealed that employees worked various amount of time per week (assuming work hours = workplace checkin hours). *In this attempt I tried use mean() to calculate the checkin duration at workplace, however, in reality, individual had multiple checkin durations in a day, thus below “employee reported hrs is not accurate”, but needs further wrangling with condition “under the same day”.

dailyhour_employer <- Jobs %>%
  group_by(educationRequirement) %>%
  summarise(
    n=n(),
    dailyhrs=mean(totalworkhour)/5)

Participants<-avgcheckintime2 %>%
    select("participantId", "Apartmentavgduration", "Pubavgduration", 
           "Restaurantavgduration", 'Workplaceavgduration') %>%
    distinct() %>%
    right_join(Participants, by = "participantId")

dailyhour_employee <- Participants%>%
  group_by(educationLevel) %>%
  summarise(
    n=n(),
    dailyhrs=mean(Workplaceavgduration))

df1 <- data.frame(dailyhour_employer$educationRequirement, 
                  dailyhour_employer$dailyhrs, dailyhour_employee$dailyhrs)

fig <- plot_ly(df1, x = ~df1$dailyhour_employer.educationRequirement, 
               y = ~df1$dailyhour_employer.dailyhrs, type = 'bar', 
               name='employer reported hrs')
fig <- fig %>% add_trace(y = ~df1$dailyhour_employee.dailyhrs, 
                         name = 'employee reported hrs')

fig <- fig %>% layout(title = "Fg.1-1 Daily workhours vs educational requirement",
         xaxis = list(title = ""),
         yaxis = list(title = ""),barmode = 'group')
fig

The total weekly pay’s distribution is as per below. As expected, employer gives higher salary to better educated candidates.

weeklypay_education <- Jobs %>%
  group_by(educationRequirement) %>%
  summarise(
    n=n(),
    mean=mean(totalweeklypay),
    sd=sd(totalweeklypay))%>%
mutate(se=sd/sqrt(n-1))

knitr::kable(head(weeklypay_education), format = 'html')
educationRequirement n mean sd se
Low 119 490.8497 160.4249 14.768306
HighSchoolOrCollege 705 586.9923 247.5990 9.331737
Bachelors 330 934.3181 521.0650 28.727241
Graduate 174 1355.1308 676.4057 51.426170
ggplot(weeklypay_education) +
  geom_errorbar(
    aes(x=educationRequirement, 
        ymin=mean-se, 
        ymax=mean+se), 
    width=0.2, 
    colour="black", 
    alpha=0.9, 
    size=0.5) +
  geom_point(aes
           (x=educationRequirement, 
            y=mean), 
           stat="identity", 
           color="red",
           size = 1.5,
           alpha=1) +
  ggtitle("Fg.1-2 Weekly pay vs educational requirement")+
  theme(plot.title = element_text(hjust = 0.5))

Company 848, among with other 9 companies, are the top paying employers in the city, paying a weekly salary of 1.3k to 1.6k on average to each employee respectively.

weeklypay_employer <- Jobs %>%
  group_by(employerId) %>%
  summarise(
    n=n(),
    mean=mean(totalweeklypay))
colnames(weeklypay_employer)[2] <- "headcount"
colnames(weeklypay_employer)[3] <- "averagepay"


weeklypay_employer_amount<-weeklypay_employer %>% slice_max(averagepay, n = 10)
weeklypay_employer_amount<-weeklypay_employer_amount[order(weeklypay_employer_amount$averagepay),
                                                     decreasing =FALSE]
weeklypay_employer_headcount<-weeklypay_employer %>% slice_max(headcount, n = 50)
weeklypay_employer_headcount<-weeklypay_employer_headcount[order(weeklypay_employer_headcount$averagepay),
                                                           decreasing =FALSE]
weeklypay_employer_amount$employerId <- sapply(weeklypay_employer_amount$employerId, function(x)
  paste('Company',x ))
weeklypay_employer_headcount$employerId <- sapply(weeklypay_employer_headcount$employerId, function(x) paste('Company',x ))

fig <- plot_ly(weeklypay_employer_amount,y=~weeklypay_employer_amount$employerId,
               x=~weeklypay_employer_amount$averagepay,type ='bar',orientation ='h',
        marker = list(color = 'rgb(158,202,225)',
                      line = list(color = 'rgb(8,48,107)',
                                  width = 1.5)))
fig <- fig %>% layout(title = "Fg.1-3 Top 10 highest paying employers",
         xaxis = list(title = ""),
         yaxis = list(title = "",
                      categoryorder = 'array',
                      categoryarray = ~weeklypay_employer_amount$employerId))
fig

Company 383, among with other 50 companies, are the employers who hire the most employees in the city. Ranging 8-9 pax per company.

fig <- plot_ly(weeklypay_employer_headcount, y = ~weeklypay_employer_headcount$employerId,
               x = ~weeklypay_employer_headcount$headcount, type = 'bar',orientation = 'h',
        marker = list(color = 'rgb(158,202,225)',
                      line = list(color = 'rgb(8,48,107)',
                                  width = 1.5)))
fig <- fig %>% layout(barmode = 'stack',
                             title = "Fg.1-4 Employers with most employees",
                             xaxis = list(title = ""),
                             yaxis = list(title = "",
                                          categoryorder = 'array',
                                          categoryarray = 
                                            ~weeklypay_employer_headcount$employerId))
fig

Participants of OHIO city spend 1 to 2+ hrs in workplace/apartment commute, which is higher than the national average of 28min according to U.S. Census Bureau.

workplaceaprtmenttraveltime$tooltip <- c(paste0("ID=", workplaceaprtmenttraveltime$participantId,
                                                "time=", workplaceaprtmenttraveltime$time))
p <- ggplot(workplaceaprtmenttraveltime,aes(x = time))+
  geom_dotplot_interactive(aes(tooltip = workplaceaprtmenttraveltime$tooltip),stackgroups = TRUE,
                           binwidth = 0.009,method = "histodot")+
  scale_y_continuous(NULL,breaks = NULL)+
  ggtitle("Fg.1-5 Workplace/Apartment commute time")+
  theme(plot.title = element_text(hjust = 0.5))

girafe(ggobj = p,width_svg = 20,height_svg =  20*0.618)

Part 3. Turnover Rate

With 1011 participants, majority of them had worked a single place during the time surveyed. The mean of places worked is 1.20. With a null hypothesis that one works for more than one place (ie. model/hypothesis of place worked =1), our log Bayes factor has shown the null hypothesis cannot be rejected, meaning indeed in city of OHIO, people have high likelyhood of having worked for more than one employer during time surveyed.

set.seed(123)

gghistostats(
  data = Participants, ## data from which variable is to be taken
  x = numberofplacesworked, ## numeric variable
  xlab = "numbers of places worked", ## x-axis label
  title = "Distribution of turnover rate", ## title for the plot
  test.value = 1, ## test value
  caption = "VAST Challenge 2022"
)