After going through the article authored by MichaelChiaYin medium.com/jovianml/whatsapp-message-explor.., I was fascinated by the idea of conducting an analysis of WhatsApp chats and decided to try it out on my own dataset.
In this project, I will attempt to discover what I typically do in bunch talk with my siblings, for example, which user sends the most messages in the group chat, dynamic hour on WhatsApp, month having the most elevated number of messages, which text and emoticons are utilized the most in the chat.
The Dataset in use:
In this analysis, we are going to use the personal dataset from WhatsApp chat. Everyone will be able to extract their own dataset from their WhatsApp group.
This dataset is from real-life chat that was created from 03/08/2020 till 13/03/2021.
Import Basic Library:
Here we will start importing some library to use in our dataset.
import plotly.express as px
import os
import pandas as pd
import re
import datetime as time
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
import emoji
from collections import Counter
from wordcloud import WordCloud, STOPWORDS
Read the dataset:
We can read the extracted dataset using Pandas library.
whatsapp_df = pd.read_csv("chat_1.txt", header = None)
whatsapp_df
Data Preparation and Cleaning:
Data Understanding –
● In this, we can see that there are 4 columns in the dataset.
● The first column consists of date on which the chat was sent in the group, the second column consist of the time, the name of the sender and the text message which was sent on the group chat.
● The third and fourth column consist of NaN values which are insignificant from the analysis point of view.
● Using the info(), we will be able to determine if there are any null values in any column in the dataset.
whatsapp_df.info()
Preparation and Cleaning -
● We make a new column “Mixed” in order to join the columns 0 and 1 into a single column, which will help us to combine the date with the other required information that will help us to further simplify the data for analysis.
whatsapp_df["Mixed"] = (whatsapp_df[0] + whatsapp_df[1]).astype(str)
whatsapp_df
● So now we understand that the column names need to be changed. So instead of 0, 1, 2 we will use datetime, user and messages.
● The below shows how to convert text file into data frame.
def txtTodf(txt_file):
user = []
message = []
datetime = []
for row in whatsapp_df["Mixed"]:
# timestamp is before the first dash
datetime.append(row.split(' - ')[0])
# sender is between dash and colon
try:
s = re.search(' - (.*?):', row).group(1)
user.append(s)
except:
user.append('')
# message content is after the first colon
try:
message.append(row.split(': ', 1)[1])
except:
message.append('')
df = pd.DataFrame(zip(datetime, user, message), columns=['datetime', 'user', 'message'])
# remove events not associated with a sender
df = df[df.user != ''].reset_index(drop=True)
return df
whatsapp_df = txtTodf('chat_1.txt')
● Convert the datetime column from “string” to “datetime” datatype.
whatsapp_df['datetime'] = pd.to_datetime(whatsapp_df.datetime)
whatsapp_df
Cleaning the Image Data:
As we only need the text data for our analysis, we will drop all the image data present in the chat.
#To understand number of image data
img = whatsapp_df[whatsapp_df["message"] == "<Media omitted>"]
# We will drop all the image file by using the Drop functions
whatsapp_df.drop(img.index, inplace=True)
whatsapp_df.reset_index(inplace=True, drop=True)
whatsapp_df.shape
Exploratory Data Analysis:
1. Which user has the most chats/messages in the groups? In this, we will determine which user sends the maximum chats in the group and also the number of texts sent by each user.
In the code below, I have determined the total number of messages sent on the group and also the unique users present in the group chat.
totalNumberofMessages = whatsapp_df["message"].count()
username = whatsapp_df["user"].unique()
print("Total number of Message", totalNumberofMessages)
print("User names involved in the chat", username)
Using Pandas:
I have made a copy of original data frame, dropped the datetime column in order to avoid confusion. In this, I have grouped the data on basis of unique users in the chat and then calculated their corresponding number of messages in the group. The results show that “Shakshii” is the most active user in the chat with total of 39 messages.
#Creating a new dataframe by copying the old dataframe
whatsapp_df1 = whatsapp_df.copy()
whatsapp_df1['Number_of_messages'] = [1]* whatsapp_df1.shape[0]
whatsapp_df1.drop(columns = 'datetime', inplace = True)
#We are groupby the user and messages together then we will use count() to count the messages for each of user
whatsapp_df1 = whatsapp_df1.groupby('user')['Number_of_messages'].count().sort_values(ascending = False).reset_index()
whatsapp_df1
Data Visualization:
We are going to use plot and bar chart for our data visualization. As you can see the results have shown us the greatest number of messages is by “Shakshii” that is around 40 messages and this shows “Shakshii” is very active member in the group.
Plot chart -
# Using seaborn for Styles
sns.set_style("darkgrid")
# Resize the figure size
plt.figure(figsize=(12, 9))
# Here we are ploting the line chart using plt.plot
plt.plot(whatsapp_df1.user, whatsapp_df1.Number_of_messages, 'o--c')
# In here we are writing the Labels and Title for the plot chart
plt.xlabel('Users')
plt.ylabel('Total number of messages')
plt.title("The highest number of messages send by the user")
plt.legend(['Messages send']);
Bar plot -
sns.set_style("darkgrid")
#Creating a bar chart
sns.barplot(whatsapp_df1.user, whatsapp_df1.Number_of_messages, hue="user", data=whatsapp_df1, palette="CMRmap")
#The title of our charts
plt.title("The highest number of messages")
plt.show()
2. The Most Active Hour in WhatsApp:
In this analysis it helps us to understand at what hour all the members are active in the group. It depends on two variables – Number of messages and the hours. Then we will be able to know when is the most active hour.
Using Pandas:
In this, we have extracted the hour value from the datetime column. The most active hours use in data frame is 20:00 hours.
#Copy a dataframe
whatsapp_df3 = whatsapp_df.copy()
whatsapp_df3['number_of_message'] = [1] * whatsapp_df3.shape[0]
whatsapp_df3['hours'] = whatsapp_df3['datetime'].apply(lambda x: x.hour)
time_df = whatsapp_df3.groupby('hours').count().reset_index().sort_values(by = 'hours')
time_df
Data Visualization:
In this analysis, we are able to found out the most active hours in WhatsApp is 20:00 (8:00PM) because at that time we are mostly having our dinner and normally most of the time we chat during that hour.
#Create the formatting of the graph
matplotlib.rcParams['font.size'] = 20
matplotlib.rcParams['figure.figsize'] = (20, 8)
# Using the seaborn style
sns.set_style("darkgrid")
plt.title('Most active hour in whatsapps');
sns.barplot(time_df.hours,time_df.number_of_message,data = time_df,dodge=False)
3. Which month has the highest messages and also the busiest month?
This group was created between (03/08/2020-13/03/2021). Here we desire to discover the month that we are busiest (most chatty) and we investigate the quantity of messages created.
Using Pandas:
In this, we have extracted the month value from the datetime column and grouped the data based on the month and sorted in the descending value based on number of messages.
We are able to see the month and the generated messages. We can see the most messages are in the month of “February”.
whatsapp_df4 = whatsapp_df.copy()
whatsapp_df4['Number_of_messages'] = [1] * whatsapp_df4.shape[0]
whatsapp_df4['month'] = whatsapp_df4['datetime'].apply(lambda x: x.month)
df_month = whatsapp_df4.groupby('month')['Number_of_messages'].count().sort_values(ascending = False).reset_index()
df_month.head()
Data Visualization:
In this analysis, we have found out that the busiest month was “February”, the total number of messages had reached around 32.
#Formating
sns.set_style("darkgrid")
#The background of the chart
matplotlib.rcParams['font.size'] = 12
matplotlib.rcParams['figure.figsize'] = (12, 9)
matplotlib.rcParams['figure.facecolor'] = '#00000000'
fig, ax = plt.subplots()
#Creating a bar chart
sns.barplot(x=df_month.month,y=df_month.Number_of_messages ,hue='month',data=df_month,dodge=False,palette="pastel")
plt.title("Month that have the highest messages and the busiest month?")
4. Determine which text or word did the user use the most?
In here, we are going to use a word cloud for the visual representation of words in the chat and determine which word is widely used by the user.
We need to join all the text data in the dataset and then set the stop words value in order to refrain using very common words like a, an, the etc.
After that, we can use Word Cloud in order to see all the words used in the chats. Based on the size of the word directly relates to frequency of its occurrence in the chat.
Larger the font size of the word in the word cloud more is its frequency.
whatsapp_df5 = whatsapp_df.copy()
#Each of the word in the message will be review
word = " ".join(review for review in whatsapp_df5.message)
stopwords = set(STOPWORDS)
#delete the word/text that are commonly used(eg.the,yes,no,bye,or and is)
stopwords.update(["the","is","yea","ok","okay","or","bye","no","will","yeah","I","almost","if","me","you","done","want","Ya"])
#Creating a word cloud
wordcloud = WordCloud(width = 500, height =500 ,stopwords=stopwords, background_color="black",min_font_size = 10).generate(word)
plt.figure( figsize=(10,5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
5. Which emoji is used the most by which user?
Now we want to know which emoji is widely used by the user and from the analysis we can do an assumption that user is most likely to use the emoji in the chat.
This was the most difficult part in the whole project, I did refer to the code by author but I was not able to successfully run that on my dataset. After a lot of research and going through multiple solutions and trying out each one of them, I finally found a solution to extract emojis from the text chat.
I first created a list of all the emojis then converted it to dictionary in order to get a count for each time an emoji was used inside the chat.
Then I wrote a code to convert the dictionary into a data frame, so that I can perform analysis easily.
Using Pandas:
As the result, you are able to see the most emoji used in WhatsApp chat are “Face with Tears of Joy”.
#Copy a dataset
whatsapp_df2 = whatsapp_df.copy()
import regex
text1 = " ".join(whatsapp_df2["message"])
text_de= emoji.demojize(text1)
emojis_list_de= re.findall(r'(:[!_\-\w]+:)', text_de)
list_emoji= [emoji.emojize(x) for x in emojis_list_de]
from collections import Counter
dict1 = {}
for item in list_emoji:
if item not in dict1:
dict1[item] = 1
else:
dict1[item] += 1
emoji_df = pd.DataFrame(dict1.items(), columns=['emoji1', 'number_of_Emojis'])
emoji_df
Data Visualization:
Before we go into each of the users to determine which emoji is widely used by the user, we need to look at the overall emoji that have been used from three of the users. As you can see on the results, the most widely use emoji among the three users is Face with Tears of Joy that stand around 36.4% from the overall. So, we can agree that most of the time the user will use Face with Tears of Joy Emoji in this group chat.
import plotly.express as px
fig=px.pie(emoji_df,values='number_of_Emojis',names='emoji1')
fig.update_traces(textposition='inside',textinfo='percent+label')
fig.show()
The above code was for all the emojis used in the chat irrespective of the user that had sent.
In order to find the emojis used by a particular user, I had to filter the data frame and then again perform the analysis.
In case of a particular user – “Shakshii”:
filter_name = whatsapp_df2[whatsapp_df2["user"] == "Shakshii"]
text2 = " ".join(filter_name["message"])
text_de2= emoji.demojize(text2)
emojis_list_de2= re.findall(r'(:[!_\-\w]+:)', text_de2)
list_emoji2= [emoji.emojize(x) for x in emojis_list_de2]
dict2 = {}
for item1 in list_emoji2:
if item1 not in dict2:
dict2[item1] = 1
else:
dict2[item1] += 1
emoji_df1 = pd.DataFrame(dict2.items(), columns=['emoji2', 'number_of_Emojis2'])
fig=px.pie(emoji_df1,values='number_of_Emojis2',names='emoji2')
fig.update_traces(textposition='inside',textinfo='percent+label')
fig.show()
Hope you found it helpful. Thank you! 😊