Describe the health of the various employers within the city limits. What employment patterns do you obesrve? Do you notice any areas of particularly high or low turnover?
In HW3 we will mainly focus on answering the 3rd part of challenge 3 in VAST 2022, ie.
Loading packages
packages = c('tidyverse', 'plotly', 'readxl', 'knitr', 'dplyr', 'ggplot2',
'grid','plotly','ggstatsplot','ggiraph','DT','patchwork',
'gganimate','gifski','gapminder','treemap','treemapify','rPackedBar')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
Reading files
Buildings <- read_csv("dt/Buildings.csv")
Jobs <- read_csv("dt/Jobs.csv")
Employers <- read_csv("dt/Employers.csv")
Participants <- read_rds("dt/hw2_participants.rds")
Below code chuck works on TravelJournal.csv and CheckinJournal.csv to calculate the average traveling time and average checkin/working time.
avgcheckintime<-read_rds("dt/avgcheckintimehw3.rds")
avgtraveltime<-read_rds("dt/avgtraveltimehw3.rds")
avgcheckintime2<-spread(avgcheckintime, venueType.end, time)
colnames(avgcheckintime2)[2] <- "Apartmentavgduration"
colnames(avgcheckintime2)[3] <- "Pubavgduration"
colnames(avgcheckintime2)[4] <- "Restaurantavgduration"
colnames(avgcheckintime2)[5] <- "Workplaceavgduration"
workplaceaprtmenttraveltime<-subset(avgtraveltime, venueType.start=="Apartment")
workplaceaprtmenttraveltime<-subset(workplaceaprtmenttraveltime, venueType.end=="Workplace")
knitr::kable(head(avgcheckintime2), format = 'html')
participantId | Apartmentavgduration | Pubavgduration | Restaurantavgduration | Workplaceavgduration |
---|---|---|---|---|
0 | 7.511465 | 1.2423246 | 0.3333333 | 3.993798 |
1 | 5.733684 | 1.1098485 | 0.3333333 | 7.901840 |
2 | 5.721255 | 0.7616697 | 0.3333333 | 6.046948 |
3 | 7.470541 | 1.4908293 | 0.3333333 | 3.993798 |
4 | 6.703136 | 1.4824811 | 0.3333333 | 4.041796 |
5 | 5.838855 | 1.1943285 | 0.3333333 | 7.975232 |
venueType.start | venueType.end | participantId | time | |
---|---|---|---|---|
9 | Apartment | Workplace | 0 | 2.0000000 |
19 | Apartment | Workplace | 1 | 0.5833333 |
29 | Apartment | Workplace | 2 | 1.1666667 |
39 | Apartment | Workplace | 3 | 1.0000000 |
47 | Apartment | Workplace | 4 | 0.2489648 |
57 | Apartment | Workplace | 5 | 0.5833333 |
Below code chuck works on Jobs.csv to calculate total working hours and total weekly pay.
Jobs1 <- Jobs %>% separate(daysToWork, c('day0','day1','day2','day3','day4','day5','day6'))
#sum(c$day0!="")
#sum(c$day6!="")
Jobs = subset(Jobs1, select = -c(day0,day6) )
Jobs<-Jobs %>%
mutate(totalworkhour=difftime(Jobs$endTime,Jobs$startTime,units='hours')*5)
Jobs<-Jobs %>%
mutate(totalweeklypay=hourlyRate*totalworkhour)
Jobs$totalweeklypay=as.numeric(Jobs$totalweeklypay)
Jobs <-Jobs %>%
mutate(educationRequirement = fct_relevel(educationRequirement,'Low',
'HighSchoolOrCollege', 'Bachelors', 'Graduate'))
Below code chuck works on Employers.csv to identify building types and location.
Below code chuck revealed that buildingID in Attributes is different from VenueID in Journals and it’s hard to link Employers to Participants.
#If Employers compared to TravelJournal, we can see not all employerId showed up in TravelJournal,
#confirming some employers do not have employees and can not be linked to participantID.
#Employers <- Employers[order(Employers$buildingId),]
#due to the size of travelJournal, we will not read the file here. However, below code shows
#VenueId is different from buildingId.
#TravelJournal3<-TravelJournal2 %>%
# mutate(buildingId=travelEndLocationId)
#Employers2<-Employers %>%
# select("employerId","employerlocation","buildingId", "buildinglocation" ,
# "buildingType", "maxOccupancy", "units" ) %>%
# distinct() %>%
# right_join(TravelJournal2, by = 'buildingId')
Below code chuck works on TravelJournal.csv to reveal the turnover rate by each participantID.
Workplace1<-read_rds("dt/Workplace1hw3.rds")
Participants<-Workplace1 %>%right_join(Participants, by = 'participantId')
2.1 Employer provided information
The data has revealed that employers all reported a 8 hrs a day, 5 days a week, total of 40hrs/week employment pattern. While it has also revealed that employees worked various amount of time per week (assuming work hours = workplace checkin hours). *In this attempt I tried use mean() to calculate the checkin duration at workplace, however, in reality, individual had multiple checkin durations in a day, thus below “employee reported hrs is not accurate”, but needs further wrangling with condition “under the same day”.
dailyhour_employer <- Jobs %>%
group_by(educationRequirement) %>%
summarise(
n=n(),
dailyhrs=mean(totalworkhour)/5)
Participants<-avgcheckintime2 %>%
select("participantId", "Apartmentavgduration", "Pubavgduration",
"Restaurantavgduration", 'Workplaceavgduration') %>%
distinct() %>%
right_join(Participants, by = "participantId")
dailyhour_employee <- Participants%>%
group_by(educationLevel) %>%
summarise(
n=n(),
dailyhrs=mean(Workplaceavgduration))
df1 <- data.frame(dailyhour_employer$educationRequirement,
dailyhour_employer$dailyhrs, dailyhour_employee$dailyhrs)
fig <- plot_ly(df1, x = ~df1$dailyhour_employer.educationRequirement,
y = ~df1$dailyhour_employer.dailyhrs, type = 'bar',
name='employer reported hrs')
fig <- fig %>% add_trace(y = ~df1$dailyhour_employee.dailyhrs,
name = 'employee reported hrs')
fig <- fig %>% layout(title = "Fg.1-1 Daily workhours vs educational requirement",
xaxis = list(title = ""),
yaxis = list(title = ""),barmode = 'group')
fig
The total weekly pay’s distribution is as per below. As expected, employer gives higher salary to better educated candidates.
weeklypay_education <- Jobs %>%
group_by(educationRequirement) %>%
summarise(
n=n(),
mean=mean(totalweeklypay),
sd=sd(totalweeklypay))%>%
mutate(se=sd/sqrt(n-1))
knitr::kable(head(weeklypay_education), format = 'html')
educationRequirement | n | mean | sd | se |
---|---|---|---|---|
Low | 119 | 490.8497 | 160.4249 | 14.768306 |
HighSchoolOrCollege | 705 | 586.9923 | 247.5990 | 9.331737 |
Bachelors | 330 | 934.3181 | 521.0650 | 28.727241 |
Graduate | 174 | 1355.1308 | 676.4057 | 51.426170 |
ggplot(weeklypay_education) +
geom_errorbar(
aes(x=educationRequirement,
ymin=mean-se,
ymax=mean+se),
width=0.2,
colour="black",
alpha=0.9,
size=0.5) +
geom_point(aes
(x=educationRequirement,
y=mean),
stat="identity",
color="red",
size = 1.5,
alpha=1) +
ggtitle("Fg.1-2 Weekly pay vs educational requirement")+
theme(plot.title = element_text(hjust = 0.5))
Company 848, among with other 9 companies, are the top paying employers in the city, paying a weekly salary of 1.3k to 1.6k on average to each employee respectively.
weeklypay_employer <- Jobs %>%
group_by(employerId) %>%
summarise(
n=n(),
mean=mean(totalweeklypay))
colnames(weeklypay_employer)[2] <- "headcount"
colnames(weeklypay_employer)[3] <- "averagepay"
weeklypay_employer_amount<-weeklypay_employer %>% slice_max(averagepay, n = 10)
weeklypay_employer_amount<-weeklypay_employer_amount[order(weeklypay_employer_amount$averagepay),
decreasing =FALSE]
weeklypay_employer_headcount<-weeklypay_employer %>% slice_max(headcount, n = 50)
weeklypay_employer_headcount<-weeklypay_employer_headcount[order(weeklypay_employer_headcount$averagepay),
decreasing =FALSE]
weeklypay_employer_amount$employerId <- sapply(weeklypay_employer_amount$employerId, function(x)
paste('Company',x ))
weeklypay_employer_headcount$employerId <- sapply(weeklypay_employer_headcount$employerId, function(x) paste('Company',x ))
fig <- plot_ly(weeklypay_employer_amount,y=~weeklypay_employer_amount$employerId,
x=~weeklypay_employer_amount$averagepay,type ='bar',orientation ='h',
marker = list(color = 'rgb(158,202,225)',
line = list(color = 'rgb(8,48,107)',
width = 1.5)))
fig <- fig %>% layout(title = "Fg.1-3 Top 10 highest paying employers",
xaxis = list(title = ""),
yaxis = list(title = "",
categoryorder = 'array',
categoryarray = ~weeklypay_employer_amount$employerId))
fig
Company 383, among with other 50 companies, are the employers who hire the most employees in the city. Ranging 8-9 pax per company.
fig <- plot_ly(weeklypay_employer_headcount, y = ~weeklypay_employer_headcount$employerId,
x = ~weeklypay_employer_headcount$headcount, type = 'bar',orientation = 'h',
marker = list(color = 'rgb(158,202,225)',
line = list(color = 'rgb(8,48,107)',
width = 1.5)))
fig <- fig %>% layout(barmode = 'stack',
title = "Fg.1-4 Employers with most employees",
xaxis = list(title = ""),
yaxis = list(title = "",
categoryorder = 'array',
categoryarray =
~weeklypay_employer_headcount$employerId))
fig
Participants of OHIO city spend 1 to 2+ hrs in workplace/apartment commute, which is higher than the national average of 28min according to U.S. Census Bureau.
workplaceaprtmenttraveltime$tooltip <- c(paste0("ID=", workplaceaprtmenttraveltime$participantId,
"time=", workplaceaprtmenttraveltime$time))
p <- ggplot(workplaceaprtmenttraveltime,aes(x = time))+
geom_dotplot_interactive(aes(tooltip = workplaceaprtmenttraveltime$tooltip),stackgroups = TRUE,
binwidth = 0.009,method = "histodot")+
scale_y_continuous(NULL,breaks = NULL)+
ggtitle("Fg.1-5 Workplace/Apartment commute time")+
theme(plot.title = element_text(hjust = 0.5))
girafe(ggobj = p,width_svg = 20,height_svg = 20*0.618)
With 1011 participants, majority of them had worked a single place during the time surveyed. The mean of places worked is 1.20. With a null hypothesis that one works for more than one place (ie. model/hypothesis of place worked =1), our log Bayes factor has shown the null hypothesis cannot be rejected, meaning indeed in city of OHIO, people have high likelyhood of having worked for more than one employer during time surveyed.
set.seed(123)
gghistostats(
data = Participants, ## data from which variable is to be taken
x = numberofplacesworked, ## numeric variable
xlab = "numbers of places worked", ## x-axis label
title = "Distribution of turnover rate", ## title for the plot
test.value = 1, ## test value
caption = "VAST Challenge 2022"
)