5 Year WhatsApp Group Chat Analysis

Val
5 min readMar 13, 2021

I’ve been in an active WhatsApp group chat for nearly 5 years with a bunch of friends. It interested me to find out about the habits of the participants.

WhatsApp chat data is available to download with or without media. I downloaded a 18MB file of over 250,000 messages sent in the span of 1,737 days. Information is in a text file, timestamped with the sender and message sent.

Text file from WhatsApp

Cleaning and Pre-processing the data

Packages used:

import pandas as pd
import re
import numpy as np
from datetime import date
import matplotlib.pyplot as plt
import seaborn as sns

Because all of our needed information is in one line we need to use regular expressions to create three columns in a new Dataframe for each of the sender, timestamp and message sent. We can use the following custom function. We use np.nan to leave null the records where messages were in the form of WhatsApp updates rather than group chat messages.

def clean(x):
who = []
when = []
message = []
for i in x.iloc[:,0]:
#message sender
try:
r = re.findall("\]\s(.+?):", i)
who.append(r[0])
except:
who.append(np.nan)
#date and time
try:
r = re.findall("\[([0-9]{2}\/[0-9]{2}\/[0-9]{4}, [0-9]{2}:[0-9]{2}:[0-9]{2})\]", i)
when.append(r[0].replace(',',''))
except:
when.append(np.nan)
#message content
try:
r = re.findall("\[[0-9]{2}\/[0-9]{2}\/[0-9]{4}, [0-9]{2}:[0-9]{2}:[0-9]{2}\]\s.+?: (.+)", i)
message.append(r[0])
except:
message.append(np.nan)
dic = {'who':who, 'when':when, 'message':message}
df = pd.DataFrame(dic).dropna().reset_index(drop=True)
df['when'] = pd.to_datetime(df['when'], format='%d/%m/%Y %H:%M:%S')
return df

To protect my friends’ anonymity I’ll use their initials only with the following function, concatenating to the original dataframe.

def inits(x):
first = []
second = []
for i in x.iloc[:,0]:
#first initial
try:
r = re.findall("([A-Z])[a-z]* [A-Z][a-z]*", i)
first.append(r[0])
except:
first.append(i)
#second initial
try:
r = re.findall("[A-Z][a-z]* ([A-Z][a-z])[a-z]*", i)
second.append(r[0])
except:
second.append(i)
dic = {'first':first, 'second':second}
df = pd.DataFrame(dic).reset_index(drop=True)
df['initials']=df['first']+'.'+df['second']+'.'
df = df.drop(columns=['first','second'])
return df
df = pd.concat([df, inits(df)], axis=1)

Looking at the unique initials we see there exists a couple phone numbers with no name. Context tells me these are phone numbers for my friend ‘K.Kh.’

df['initials'].unique()Output:
array(['\u202a+44\xa07522\xa0630893\u202c.\u202a+44\xa07522\xa0630893\u202c.', 'A.Ma.', 'H.Kh.', 'D.Mi.', 'T.Be.', 'V.Aj.', 'C.Ma.', 'C.Sa.','S.Ga.', 'R.Av.','\u202a+44\xa07713\xa01?????\u202c.\u202a+44\xa0??13\xa0122729\u202c.','A.Mh.','\u202a+44\xa07448\xa01???041\u202c.\u202a+44\xa07??8\xa0196041\u202c.','\u202a+374\xa098\xa032??44\u202c.\u202a+374\xa098\xa03262?4\u202c.', '\u202a+374\xa098\xa032??88\u202c.\u202a+374\xa098\xa03262??\u202c.','K.Kh.'], dtype=object)

We can clean this easily using a map function and setting any nulls to ‘K.Kh.’

mapping = {'D.Mi.':'D.Mi.', 'A.Ma.':'A.Ma.', 'C.Ma.':'C.Ma.', 'V.Aj.':'V.Aj.', 'A.Mh.':'A.Mh.', 'S.Ga.':'S.Ga.', 'R.Av.':'R.Av.', 'C.Sa.':'C.Sa.', 'H.Kh.':'H.Kh.', 'K.Kh.':'K.Kh.'}
df['initials']=df['initials'].map(mapping)
df = df.fillna('K.Kh.')
df['initials'].unique()
Output:
array(['K.Kh.', 'A.Ma.', 'H.Kh.', 'D.Mi.', 'V.Aj.', 'C.Ma.', 'C.Sa.','S.Ga.', 'R.Av.', 'A.Mh.'], dtype=object)

Very nice. Now all 10 participants are identified correctly.

The origins of the chat

Visualising the chat data

‘D.Mi.’ has been the most active on the chat with over 50,000 messages sent, that’s a little over 20% of all chat messages. ‘H.Kh.’ is sitting at 2.3%

df.initials.value_counts(normalize=True)D.Mi.    0.213249
A.Ma. 0.154663
C.Ma. 0.142228
V.Aj. 0.117726
K.Kh. 0.108335
A.Mh. 0.105051
S.Ga. 0.101634
R.Av. 0.040597
C.Sa. 0.014187
H.Kh. 0.002330

2017 was quite the year for the chat over 90,000 messages were sent that year. The chat however is on the decline, perhaps because everyone has graduated from university by now and are getting on with their lives. Or maybe we just aren’t living as exciting lives as we were.

Although, in 2021 we’ve sent almost 40% of the messages we did in 2020, so this could be a rebound year?

As assumed, in the summer months the chat is used less, most likely because we end up seeing each other in person more.

The first half of 2017 is mainly responsible for the year’s dominance in the leaderboard. You could look at the second half of 2016 when the group chat was made and see the first half of 2017 is just a continuation of 2016. The first actual whole year of using the chat is responsible for 44% of all chat messages, almost 50%!

df.month_year.value_counts(sort=False,normalize=True)2016-06    0.030027
2016-07 0.013073
2016-08 0.017131
2016-09 0.018859
2016-10 0.020595
2016-11 0.035084
2016-12 0.053482
2017-01 0.064341
2017-02 0.050153
2017-03 0.047026
2017-04 0.039960
2017-05 0.048128
...

Let’s look now at just the actual full 12 months instead. We can use a mapping.

mapping2 = {'2016-06':'1','2016-07':'1','2016-08':'1','2016-09':'1','2016-10':'1','2016-11':'1','2016-12':'1','2017-01':'1','2017-02':'1','2017-03':'1','2017-04':'1','2017-05':'1','2017-06':'2','2017-07':'2','2017-08':'2','2017-09':'2','2017-10':'2','2017-11':'2','2017-12':'2','2018-01':'2','2018-02':'2','2018-03':'2','2018-04':'2','2018-05':'2','2018-06':'3','2018-07':'3','2018-08':'3','2018-09':'3','2018-10':'3','2018-11':'3','2018-12':'3','2019-01':'3','2019-02':'3','2019-03':'3','2019-04':'3','2019-05':'3','2019-06':'4','2019-07':'4','2019-08':'4','2019-09':'4','2019-10':'4','2019-11':'4','2019-12':'4','2020-01':'4','2020-02':'4','2020-03','4','2020-04':'4','2020-05':'4','2020-06':'5','2020-07':'5','2020-08':'5','2020-09':'5','2020-10':'5','2020-11':'5','2020-12':'5','2021-01':'5','2021-02':'5','2021-03':'5',}df['actual_year'] = df['month_year'].astype(str).map(mapping2)

When I showed ‘D.Mi.’ his message count he suggested it was due to his Apple Watch increasing message replies however as can be seen, he has been at the top of the leaderboard since the beginning.

--

--