visit
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
df = pd.read_csv("../input/hr-analytics/HR_comma_sep.csv")
df.head()
df.info()
As you can see above, all the cols are filled with no NaN values. float, int and object are the data types of the columns.
for col
in
df.columns:
print(f"
{
col
}
- ", df[col].unique())
print()
There seem to be no weird values in the data, which is very good. Let’s look at the metrics.
df.describe()
sns.set()#sets the style of the plot.
fig = plt.figure(figsize=(12,6))#Used to display the plot
sns.barplot(x='Department', y='satisfaction_level', hue='salary', data=df, ci=None)
plt.title("Satisfaction_level Vs Department", size=15)
plt.show()
The ones with higher salaries are more satisfied. But product_mng said hold my cup. It seems like the ones with low salaries are more satisfied than the ones with high salaries. The same is for the land.
fig = plt.figure(figsize=(12,6))
g = sns.barplot(x='Department', y='last_evaluation', data=df, ci=None)
g.bar_label(g.containers[0])
plt.title("Last Evaluation Vs Department", size=15)
plt.show()
Everybody has a similar level of evaluation done.
fig = plt.figure(figsize=(12,6))
sns.barplot(x='Department', y='number_project', data=df, ci=None)
plt.title("Number Project Vs Department", size=15)
plt.show()
fig = plt.figure(figsize=(12,6))
g = sns.barplot(x='Department', y='Work_accident', data=df, ci=None)
g.bar_label(g.containers[0])
plt.title("Work Accident Vs Department", size=15)
plt.show()
fig = plt.figure(figsize=(12,6))
g = sns.barplot(x='Department', y='satisfaction_level', hue='left', data=df, ci=None)
g.bar_label(g.containers[0])
g.bar_label(g.containers[1], rotation=90)
plt.title("Left Vs Department with satisfaction level", size=15)
plt.show()
fig = plt.figure(figsize=(12,6))
g = sns.barplot(x='Department', y='satisfaction_level', hue='left', data=df, ci=None)
g.bar_label(g.containers[0])
g.bar_label(g.containers[1], rotation=90)
plt.title("Left Vs Department with satisfaction level", size=15)
plt.show()
fig = plt.figure(figsize=(12,6))
sns.lineplot(x='Department', y='average_montly_hours', data=df, ci=None, color='g', marker='o')
plt.title("Avg Hours spent in the company per Department", size=15)
plt.show()
fig = plt.figure(figsize=(12,6))
sns.lineplot(x='Department', y='promotion_last_5years', data=df, ci=None, color='black', marker='o')
plt.title("Promotions of Last 5 years in the company per Department", size=15)
plt.show()
The management and the marketing department have more promotions when compared to the others.
fig = plt.figure(figsize=(12,6))
salary = df['salary'].value_counts()
sns.lineplot(x=salary.values, y=salary.index, ci=None, color='orange', marker='o')
plt.title("Salary (Counts) in the company", size=15)
plt.show()
sns.heatmap(df.corr(), center=0, linewidth=1, annot=True, fmt='.2f')
plt.show()
We need to split our data into a training set and test set so that the model doesn’t remember the data by heart. Before doing that let us drop the categorical columns ‘Development’ and ‘salary’.
df = df.drop(['Department', 'salary'], axis=1)
Here we are passing in a list of col names that we want to drop. By specifying axis=1 we are telling it to remove the column.
You can do it like this as welldf.drop(['Department', 'salary'], axis=1, inplace=True)
By doing this you don’t need to specifically assign it df, since it is going to do it in place.Now that we got rid of them let's split our data.First, separate the target var(the one we want to predict) and the rest of the data.X = df.drop('left', axis=1) # we will predict who left
y = df['left']
Then we pass the X, y into the following function.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
RandomForest can do both prediction and regression.
we are also going to be using GridSearchCV to find the best params for our model.
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
param = [
{'n_estimators': [100, 200, 300, 400, 450, 500],
'max_depth': [3, 4, 6, 8, 10, 12],
'max_leaf_nodes': [15, 20, 25]},
]
rf = RandomForestClassifier()
gs_rf = GridSearchCV(rf, param, cv = 5, n_jobs = -1, verbose = 1)
gs_rf.fit(X_train, y_train)
rf_best = gs_rf.best_estimator_
pred = gs_rf.predict(X_test)
gs_rf.best_estimator_