Salary Prediction Analysis

Salary Prediction Analysis

·

8 min read

Have you ever wondered about the various contributing factors that decide the salary offered to an employee? 💸

Ever since I started my journey in the corporate world, I have wondered about this as well.

Here is my analysis of salary prediction. Hope you find your answers too!

Dataset link - bit.ly/2R6ZXqB

The dataset consists of information on job postings in various fields and their corresponding salaries. It comprises various columns like Job ID, Company ID, Job Type, Degree, Major, Industry, Years of Experience, Miles From Metropolis, and Salary.

In this, I have tried to visualize information like different types of job roles in the dataframe and their corresponding counts, the requirement of specific qualifications by the companies, different majors specified by the companies and their counts, which different industries do the openings belong to, mean salaries based on job type, major, degree, years of experience, and so on.

Libraries :

  • Pandas - provides several methods for reading data in different formats.

  • Matplotlib and seaborn - used for plotting basic and advanced visualizations respectively.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Reading and displaying the data :

We have two datasets in this project:

  1. features_df - consists of all the features taken into consideration while calculating salary.
  2. salaries_df - consist of the salary values based on the features.
features_df = pd.read_csv("features.csv")
salaries_df = pd.read_csv("salaries.csv")
features_df.head()

45.PNG

salaries_df.head()

46.PNG

Rename the columns of dataframe :

Renaming the existing column names.

#Rename the columns of dataset.
features_df.rename(columns = {"jobId":"Job ID", "companyId":"Company ID", "jobType":"Job Type", "degree":"Degree", "major":"Major", "industry":"Industry", "yearsExperience":"Years of Experience", "milesFromMetropolis":"Miles From Metropolis"}, inplace = True)
salaries_df.rename(columns = {"jobId":"Job ID", "salary":"Salary"}, inplace = True)

image.png

Information about dataframe :

  • We can see that there are 1000000 entries in the dataframe.
  • We can find the information about the non-null values and data type of all the in a dataframe.
# To get the information about the dataframe.
features_df.info()

48.PNG

salaries_df.info()

49.PNG

Missing Values :

To check if there are any missing values in the dataframe.

  • We can see that there are no missing values in the dataset.
# To check for missing values
features_df.isnull().sum()

50.PNG

salaries_df.isnull().sum()

51.PNG

Data Type :

To check the data type of each column in the dataframe.

# To check the data type of the data frame.
features_df.dtypes

52.PNG

salaries_df.dtypes

53.PNG

Merging the two dataframes :

As we have two datasets, we will combine both into a single dataframe.

#To merge the two data frames
merged = pd.merge(features_df, salaries_df, on ="Job ID", how = "inner")
merged

54.PNG

Duplicate values :

To check if there are any duplicate values in the dataframe.

merged.duplicated().any()

image.png

Unique salary values :

  • We can observe that there is a "0" value in the salary column.
#To check the unique values in the Salary column.
merged["Salary"].unique()

56.PNG

To check the number of rows having "0" as value and displaying the dataframe :

# To check the number of "0" value in the salary column.
len(merged[merged["Salary"] == 0])

image.png

# To display the rows having "0" values.
merged[merged["Salary"] == 0]

58.PNG

Filtering the dataframe based on salary column :

# To filter out the rows having "0" value in Salary column.
merged = merged[merged.Salary != 0]
merged

59.PNG

Statistical Analysis :

# Statistical analysis for numerical data.
merged.describe()

60.PNG

To check several unique values in dataframe :

# To check several unique values in the dataframe.
merged.nunique()

61.PNG

Data Analysis :

Job Type Distribution :

  • To find the unique values in the Job Type column.

  • To find the corresponding counts of all the unique values.

# To check the Unique values in the "Job Type" column.
merged["Job Type"].unique()

image.png

merged["Job Type"].value_counts()

63.PNG

plt.figure(figsize=(10, 8))
sns.countplot(x="Job Type", data=merged)
plt.xticks(rotation = 90)
plt.yticks(np.arange(0, 130000, 10000))
plt.show()

64.PNG We can observe that the number of postings for all the job roles in the dataframe is almost equal.

Degree distribution :

  • To find the unique values in the Degree column.

  • To find the corresponding counts of all the unique values.

merged["Degree"].unique()

image.png

Degree = merged["Degree"].value_counts()
Degree

66.PNG

plt.figure(figsize=(10, 8))
labels = 'MASTERS', 'HIGH_SCHOOL', 'DOCTORAL', 'BACHELORS', 'NONE'
plt.pie(Degree.values, labels = labels, shadow = True, autopct='%1.1f%%', startangle=90)
plt.show()

67.PNG 23.7% of the job postings require either a master's or a high school degree. Whereas, 17.5% of the job postings do not require any degree.

Major Distribution :

merged["Major"].unique()
Major = merged["Major"].value_counts()
Major

image.png

plt.bar(Major.index, Major.values, color ='maroon',width = 0.8)
plt.xticks(rotation = 90)
plt.show()

69.PNG Most of the jobs posted do not require any specific majors.

Industry distribution :

merged["Industry"].unique()
merged["Industry"].value_counts()

image.png

plt.figure(figsize=(10, 8))
sns.countplot(x="Industry", data=merged)
plt.xticks(rotation = 90)
plt.yticks(np.arange(0, 150000 , 10000))
plt.show()

71.PNG There is an equal number of job postings for all the Industries.

Years Of Experience :

Experience = merged["Years of Experience"].value_counts()
Experience
plt.figure(figsize=(10, 8))
plt.bar(Experience.index, Experience.values, color ='blue',width = 0.8)
plt.xticks(rotation = 90)
plt.show()

72.PNG

Salary :

  • To check if the salary distribution is symmetric or skewed.
# To check if the Salary distribution is symmetric or skewed.
print('Salary Skewness:', merged['Salary'].skew())
if -0.5 <= merged['Salary'].skew() <= 0.5:
    print("Salary distribution is symmetric")
else:
    print("Salary distribution is skewed")

73.PNG

# Distribution plot
plt.figure(figsize=(10, 8))
sns.distplot(merged['Salary'], bins=40, kde=True, norm_hist=True)
plt.axvline(merged['Salary'].mean(), color='black')
plt.axvline(merged['Salary'].median(), color='darkblue', linestyle='--')
plt.show()

74.PNG

print("The black line in distribution plot is the mean salary", merged['Salary'].mean())
print("The dark blue line in distribution plot is the median salary", merged['Salary'].median())

image.png

#Box plot
plt.figure(figsize=(18, 6))
sns.boxplot(merged['Salary'], color='gold')
plt.show()

76.PNG

stat = merged.Salary.describe()
stat

77.PNG

IQR = stat['75%'] - stat['25%']
upper = stat['75%'] + 1.5 * IQR
lower = stat['25%'] - 1.5 * IQR
print('The Upper and Lower Bounds for suspected outliers in the Boxplot are {} and {}.'.format(upper, lower))

78.PNG

Displaying correlation between features and target variable:

columns = ["Job Type", "Degree", "Major", "Industry"]
for col in columns:
    plt.figure(figsize = (14, 6))
    sns.boxplot(x=col, y='Salary', data = merged.sort_values('Salary'))
    plt.xticks(rotation = 45)
    plt.ylabel('Salaries')
    plt.title('Relationship of Salary with {}'.format(col))
    plt.show()

79.PNG We can observe that the salary increases as the post of the employee increases that is from janitor to the CEO of the company.

We can similarly observe the correlation of other features with the salary.

Correlation between Experience and Salary using Regression Plot :

plt.figure(figsize=(8, 6))
sns.regplot(x="Years of Experience", y="Salary", data = merged)
plt.show()

83.PNG We can conclude that as the number of years of experience increases, the salary also increases.

Correlation between Salary and Miles from Metro using Regression Plot :

plt.figure(figsize=(8, 6))
sns.regplot(x="Miles From Metropolis", y="Salary", data = merged)
plt.show()

84.PNG We can definitely see a negative correlation between Miles from Metropolis and Salary. This is somewhat understandable as most of the jobs are normally offered closer to the city center.

Mean Salary in Job Type, Degree, Major, Industry, Years of Experience :

  • We need to group the values based on a particular column, then calculate the mean of the Salary column and arrange it in ascending order.

Job Type -

mean_salary_job_type = merged.groupby("Job Type")["Salary"].mean().sort_values()
mean_salary_job_type

85.PNG

mean_salary_job_type.plot()
plt.xticks(rotation =45)
plt.ylabel("Mean Salary")
plt.show()

86.PNG We can see that mean salary increases from janitor to CEO. The mean salary for CFO and CTO is approximately the same.

Degree -

mean_salary_Degree = merged.groupby("Degree")["Salary"].mean().sort_values()
mean_salary_Degree

87.PNG

mean_salary_Degree.plot()
plt.xticks(rotation =45)
plt.ylabel("Mean Salary")
plt.show()

88.PNG The employees having a doctoral degree are paid the highest whereas the mean salary of employees with no relevant degree is very low.

Major -

mean_salary_Major = merged.groupby("Major")["Salary"].mean().sort_values()
mean_salary_Major

89.PNG

mean_salary_Major.plot()
plt.xticks(rotation =45)
plt.ylabel("Mean Salary")
plt.show()

90.PNG The employee with an engineering background is paid the highest.

Industry -

mean_salary_Industry = merged.groupby("Industry")["Salary"].mean().sort_values()
mean_salary_Industry

91.PNG

mean_salary_Industry.plot()
plt.xticks(rotation =45)
plt.ylabel("Mean Salary")
plt.show()

92.PNG The education sector has the lowest mean salary, whereas the oil industry pays the highest mean salary.

Experience -

mean_salary_Experience = merged.groupby("Years of Experience")["Salary"].mean().sort_values()
mean_salary_Experience

93.PNG

mean_salary_Experience.plot()
plt.xticks(rotation =45)
plt.ylabel("Mean Salary")
plt.show()

94.PNG As the experience increases, the mean salary also increases.

Miles From Metropolis -

mean_salary_miles = merged.groupby("Miles From Metropolis")["Salary"].mean().sort_values()
mean_salary_miles

95.PNG

mean_salary_miles.plot()
plt.xticks(rotation =45)
plt.ylabel("Mean Salary")
plt.show()

96.PNG Nearer to the city center, more is the salary. The mean salary decreases as you go away from the city center.

Multiple plots:

Plotting multiple features in a single plot Vs salary.

mean_salary_job_type1 = merged.groupby("Job Type")["Salary"].mean().sort_values()[1:6]
mean_salary_Degree1 = merged.groupby("Degree")["Salary"].mean().sort_values()
mean_salary_Major1 = merged.groupby("Major")["Salary"].mean().sort_values()[4:]
mean_salary_Industry1 = merged.groupby("Industry")["Salary"].mean().sort_values()[2:]
mean_salary_Experience1 = merged.groupby("Years of Experience")["Salary"].mean().sort_values()[:5]
fig,axis = plt.subplots(nrows=1,ncols=1,figsize=(13,6))
mean_salary_job_type1.plot()
mean_salary_Degree1.plot()
mean_salary_Major1.plot()
mean_salary_Industry1.plot()
mean_salary_Experience1.plot()
axis.set_xticklabels([" ", "a", " ", "b", " ", "c", " ", "d", " ", "e", " " ])
plt.xlabel("Features")
plt.ylabel("Mean Salary")
plt.show()

image.png Here, the values "a", "b", "c", "d", "e" on the x-axis represents -

  • "a" - represents Junior(Job Type), None(Degree), Physics(Major), Auto(Industry) and 0(Experience).

  • "b" - represents Senior(Job Type), High School(Degree), CompSci(Major), Health(Industry) and 5(Experience).

  • "c" - represents Manager(Job Type), Bachelors(Degree), Math(Major), Web(Industry) and 10(Experience).

  • "d" - represents Vice President(Job Type), Masters(Degree), Business(Major), Finance(Industry) and 15(Experience).

  • "e" - represents CFO(Job Type), Doctoral(Degree), Engineering(Major), Oil(Industry) and 20(Experience).

Correlation plot :

  • Make a copy of the dataframe.

  • Convert the "object" datatype into "category".

  • Drop "Job ID" and "Company ID" columns.

  • Encode the categorical data type.

corr_plot = merged.copy()
cols = ["Job Type", "Degree", "Major", "Industry"]
for i in cols:
    corr_plot[i] = corr_plot[i].astype("category")
corr_plot.info()

97.PNG

corr_plot = corr_plot.drop("Job ID", axis=1)
corr_plot = corr_plot.drop("Company ID", axis=1)
corr_plot

98.PNG

#One hot encode categorical data in the dataset

corr_plot = pd.get_dummies(corr_plot)
corr_plot.head()

99.PNG

plt.figure(figsize = (10, 8))
sns.heatmap(corr_plot.corr(), cmap = 'BuGn', linewidth = 0.005)

100.PNG This plot shows the correlation of every column with other columns present in the dataframe.

Yay! 🥳 You made it till the end. Hope you found it interesting.

Â