πŸ’ͺπŸ§‘A Picture is worth a thousand words

Problem Statement

The human capital department of a large corporation wants to know why their is a high employee turnover, they also want to understand which employees are more likely to leave, and why.

Aim and Objectives

  1. Which department has the highest employee turnover? Which one has the lowest?
  2. Investigate which variables seem to be better predictors of employee departure.
  3. Recommendations to help reduce employee turnover

Loading and Data

head(df)
## # A tibble: 6 x 10
##   department promoted review projects salary tenure satisfaction bonus
##   <chr>         <dbl>  <dbl>    <dbl> <chr>   <dbl>        <dbl> <dbl>
## 1 operations        0  0.578        3 low         5        0.627     0
## 2 operations        0  0.752        3 medium      6        0.444     0
## 3 support           0  0.723        3 medium      6        0.447     0
## 4 logistics         0  0.675        4 high        8        0.440     0
## 5 sales             0  0.676        3 high        5        0.578     1
## 6 IT                0  0.683        2 medium      5        0.565     1
## # ... with 2 more variables: avg_hrs_month <dbl>, left <chr>

Exploratory Data Analysis

glimpse(df)
## Rows: 9,540
## Columns: 10
## $ department    <chr> "operations", "operations", "support", "logistics", "sal~
## $ promoted      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ review        <dbl> 0.5775687, 0.7518997, 0.7225484, 0.6751583, 0.6762032, 0~
## $ projects      <dbl> 3, 3, 3, 4, 3, 2, 4, 4, 4, 3, 4, 3, 3, 3, 3, 3, 4, 4, 3,~
## $ salary        <chr> "low", "medium", "medium", "high", "high", "medium", "hi~
## $ tenure        <dbl> 5, 6, 6, 8, 5, 5, 5, 7, 6, 6, 5, 5, 6, 5, 6, 6, 6, 5, 6,~
## $ satisfaction  <dbl> 0.6267590, 0.4436790, 0.4468232, 0.4401387, 0.5776074, 0~
## $ bonus         <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,~
## $ avg_hrs_month <dbl> 180.8661, 182.7081, 184.4161, 188.7075, 179.8211, 178.84~
## $ left          <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "n~
skimr::skim(df)
Table 1: Data summary
Name df
Number of rows 9540
Number of columns 10
_______________________
Column type frequency:
character 3
numeric 7
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
department 0 1 2 11 0 10 0
salary 0 1 3 6 0 3 0
left 0 1 2 3 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
promoted 0 1 0.03 0.17 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
review 0 1 0.65 0.09 0.31 0.59 0.65 0.71 1.00 ▁▃▇▃▁
projects 0 1 3.27 0.58 2.00 3.00 3.00 4.00 5.00 ▁▇▁▅▁
tenure 0 1 6.56 1.42 2.00 5.00 7.00 8.00 12.00 ▁▇▇▂▁
satisfaction 0 1 0.50 0.16 0.00 0.39 0.50 0.62 1.00 ▁▅▇▅▁
bonus 0 1 0.21 0.41 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
avg_hrs_month 0 1 184.66 4.14 171.37 181.47 184.63 187.73 200.86 ▁▆▇▂▁

The summary table above shows no missing values and also outliers in the data. The numerical variables are also normally distributed.

Employee Turnover Rate

Let’s calculate the employee turnover rate in each department. The employee turnover rate is calculated by dividing the number of employees who left the company by the average number of employees (employees at beginning + employees at the end)/2). This number is then multiplied by 100 to get a percentage.

status_count<- as.data.frame.matrix(df %>%
group_by(department) %>%
select(department, left) %>%
ungroup(department) %>% 
table())

status_count <- status_count %>% 
  mutate(total = no + yes,
         turnover_rate = (yes/(total + no)/2)*100)

status_count %>% 
  as.data.frame() %>% 
  select(turnover_rate) %>% 
  arrange(desc(turnover_rate))
##             turnover_rate
## IT               9.136213
## logistics        9.113300
## retail           9.019533
## marketing        8.927259
## support          8.426073
## engineering      8.420039
## operations       8.358896
## sales            8.315268
## admin            8.184319
## finance          7.758621
mean(status_count$turnover_rate)
## [1] 8.565952
range(status_count$turnover_rate)
## [1] 7.758621 9.136213

The average employee turnover rate is 8.57% with the IT department having the highest employee turnover rate of 9.14% while finance has the lowest employee turnover rate of 7.76%.

Relationship between employer review, job satisfaction and average monthly hours

df %>% 
  select(review, satisfaction, avg_hrs_month) %>% 
  cor() %>% 
  corrplot::corrplot(method = "number")

Job satisfaction, review and average monthly hours have a weak negative relationship. This implies that we can assume that no relationship exists between these variables.

Does working hours affects employee departure?

Sometimes high working hours might make an employee to leave a company, lack of time for one’s personal life and family, this makes some employee to ask for a pay raise for the value of the time been spent. Let’s see how this factor relates to employee departure and the salary scale.

df %>% 
  ggplot(aes(x = left, y = avg_hrs_month, colour = left)) +
  geom_boxplot(outlier.colour = NA) + 
  geom_jitter(alpha = 0.05, width = 0.1) +
  facet_wrap(vars(salary), 
             scales = "free", 
             ncol = 3) +
  xlab("Employee Departure") +
  ylab(" Average working hours in a month")

Employees who left the company have an average hour of more than 185 hours per month which is higher than the number of hours spent by those still working in the organization. As expected, seems some of the employees who left the organization spent more time at work. Employees with medium salary scale have the highest average amount of hours per month and most of the departure comes from employees with the medium salary scale.

Are employees leaving as a result of bad reviews from employer?

df %>% 
  ggplot(aes(x = left, y = review, colour = left)) +
  geom_boxplot(outlier.colour = NA) + 
  geom_jitter(alpha = 0.05, width = 0.1) +
  facet_wrap(vars(salary), 
             scales = "free", 
             ncol = 3) +
  xlab("Employee Departure") +
  ylab("Employer Review") 

Employees leaving the organization were having great reviews, their average review falls in the range of 0.68 to 0.73. These employees were hardworking a reason why their departure is a great source of concern for the organization, it is possible that since their departure organizational performance must have reduced.

Are employes not satisfied with their job?

df %>% 
  ggplot(aes(x = left, y = satisfaction, colour = left)) +
  geom_boxplot(outlier.colour = NA) + 
  geom_jitter(alpha = 0.05, width = 0.1) +
  facet_grid(cols = vars(promoted))+
  xlab("Employee Departure") +
  ylab("Job Satisfaction") 

Few people have been promoted during the past 24 months and the average job satisfaction in the organization falls between 0.55-0.45, this is a low figure and employees are likely to leave the organization. Promoted employees recorded a very low average job satisfaction, though they are not much.

Model Building

After performing an exploratory data analysis and understanding our data, we will use the xgboost model to predict employee departure. the given dataset.

XGboost Model

### Data Split
set.seed(2022)
df_split <- initial_split(
  df, 
  prop = 0.2, 
  strata = left
)

#Data Preprocessing
xgboost_recipe <- 
  recipe(formula = left ~ ., data = training(df_split)) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  # step_zv removes variables that contain only a single value
  step_zv(all_predictors())

#model specification
xgboost_spec <- 
  boost_tree(trees = 100) %>% 
  set_mode("classification") %>% 
  set_engine("xgboost") 

#model workflow
xgboost_workflow <- 
  workflow() %>% 
  add_recipe(xgboost_recipe) %>% 
  add_model(xgboost_spec)

#fit model
xgb_fit <- xgboost_workflow %>% 
  fit(training(df_split))
## [09:08:44] WARNING: amalgamation/../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
#Predicted results on test data
pred_class <- predict(xgb_fit, 
                     testing(df_split),
                     type = "class")

pred_results <- testing(df_split) %>% 
  select(left) %>% 
  bind_cols(pred_class) %>% 
  mutate_at(vars(left), as.factor)
  
#model accuracy
accuracy(pred_results,
          truth = left,
          estimate = .pred_class)
## # A tibble: 1 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.853
#variables importance
xgb_fit %>% 
  pull_workflow_fit() %>%
  vip(geom = "col")

The model accuracy on the test dataset is about 85.3% which is good, the variables average hours per month, job satisfaction and review were shown to be the most important in the model.

Recommendations

Working hours appears to be the most important factor in employee departure. The organization should try to reduce the number of hours spent by an employee especially those working in departments with high turnover rate such as IT department. Most staffs leaving have a good working record with the organization, to encourage staffs staying, the management needs to increase their pay and offer promotion to them when due as a reward for their hard-work. These conditions if met is likely to increase employees job satisfaction, this will help to curb the high rate of employee turnover.