Introduction
Like most professionals in financial services I subscribe to a range of market news and commentary emails. These are written by journalists, investment professionals and market commentators
and provide daily perspectives on economics, geopolitics, social and behavioural sciences, etc. and the effect these are having, or being hypothesised to have, on market dynamics.
I realised there was a potentially very interesting and useful time series of data on current market topics and sentiment being built in my inbox from these emails.
My corporate email was traditionally not a data set that I was used to querying and my first challenge was to get hold of this data (or at least metadata) in a way that
allowed me to do some interesting things with it.
The NLP on this data set is a topic for another post, however throughout the process of curating a pandas dataframe containing elements of this market email data I became curious as to what my
Inbox looked like more broadly. In particular what my Received and Sent emails could tell me about my office environment, main collaborators and my own working style from a behavioural standpoint.
This post describes the Python code I used to scrape metadata from my Inbox and some learnings I obtained.
Implementing in Python
Link to my GitHub repo:
Email-Scraping
The first thing to do was to navigate to my email folder structure and pull something out. The pywin32 module was ideal in this case.
###########################################################
# Title: Querying emails
# Purpose: Extract metadata from outlook emails with Python
# Author: Thomas Handscomb
###########################################################
# import modules into session
import pandas as pd
import win32com.client
from tqdm import tqdm # Useful module for displaying a progress bar during long loops
# Define Outlook location
outlook = win32com.client.Dispatch("Outlook.application")
mapi = outlook.GetNamespace("MAPI")
# Find the folder number of the 'Thomas.Handscomb@[CompanyName].com' meta
# data folder to start with
for k in range(1, len(mapi.Folders)+1):
try:
fol = mapi.Folders.Item(k)
if fol.name == 'Thomas.Handscomb@[CompanyName].com':
folnum = k
#print(folnum)
break
except Exception as e:
print('Error:' + '(' + str(k) + ')')
pass
print(folnum)
1
Once you have determined the above folder number, find the 'Inbox' and 'Sent Items' folders within this
Inboxnum, Sentnum = -1, -1
for l in range(1,30):
try:
subfol = mapi.Folders.Item(folnum).Folders.Item(l)
if Inboxnum > 0 and Sentnum > 0:
break
elif subfol.name =='Inbox':
Inboxnum = l
elif subfol.name =='Sent Items':
Sentnum = l
except Exception as e:
print('Error at loop: %.f' %l)
pass
print("%0.f, %0.f" %(Inboxnum, Sentnum))
2, 4
Once the folder numbers are defined, use these to specify the 'Inbox' and 'Sent' folders
Inbox = mapi.Folders.Item(folnum).Folders.Item(Inboxnum)
Sent = mapi.Folders.Item(folnum).Folders.Item(Sentnum)
# Double check the name
if Inbox.name == 'Inbox' and Sent.name == 'Sent Items':
print('Inbox and Sent folders assigned correctly')
pass
else:
print('An error has occured')
'Inbox and Sent folders assigned correctly'
Now that the Inbox and Sent Items folders have been correctly identified the below loop constructs a dataframe by looping through all items
(i.e. emails) in the Inbox and extracting some metadata from them, namely the date received, the sender and the subject.
# Now that the Inbox and Sent Items folders have been determined,
# create a blank data frame to store email metadata, in this case (date/time sent,
# sender name, email subject)
Inbox_col_names = ['Full Date', 'Date', 'Hour', 'Sender', 'Subject']
Inbox_df = pd.DataFrame(columns = Inbox_col_names)
# Loop through all Inbox.Items (i.e. emails)
# the tqdm wrapper puts a progress bar on the loop
for message in tqdm(Inbox.Items):
try:
Inbox_df.loc[len(Inbox_df)] = \\
[message.LastModificationTime.strftime("%Y-%m-%d %H:%M:%S")
, message.LastModificationTime.strftime("%Y-%m-%d")
, message.LastModificationTime.strftime("%H")
, message.Sender
, message.Subject]
except:
pass
# Confirm you are picking up all emails
Inbox_df.groupby(['Date']).size()
# Output data frame to review
Output_filepath = 'C:/Users'
Inbox_df.to_csv(Output_filepath+'/Inbox.csv'
, encoding = 'utf-8'
#, mode = 'a'
, index = False
, header = True)
and similarly for my Sent Items
Outbox_col_names = ['Full Date', 'Date', 'Hour', 'Recipient', 'Subject']
Outbox_df = pd.DataFrame(columns = Outbox_col_names)
for message in tqdm(Sent.Items):
try:
Outbox_df.loc[len(Outbox_df)] = \\
[message.LastModificationTime.strftime("%Y-%m-%d %H:%M:%S")
, message.LastModificationTime.strftime("%Y-%m-%d")
, message.LastModificationTime.strftime("%H")
, message.To
, message.Subject]
except:
pass
Outbox_df.to_csv(Output_filepath+'/Outbox.csv'
, encoding = 'utf-8'
#, mode = 'a'
, index = False
, header = True)
Organisational Working Patterns
The final steps above output two csv files,
Inbox_df and
Outbox_df, summarising the email datestamp, sender/receiver and subject.
Starting with my inbox the below shows the aggregated distribution of the hours of the day when colleagues send me emails

This picture is broadly unsurprising: colleagues tend to come into the office and work through their inboxes first thing. Energised
after a lunchtime lull at mid day, colleagues ramp up communication again from 2:00pm before a tail off to the end of the day.
A few dedicated individuals continue late into the evening and some from overseas offices continue overnight.
Sorting by sender illustrated clearly to me who my closest collaborators were (as well as those who spam me the most!) Each row in the below is a unique sender.
The overlaid pareto curve illustrating the cummulative proportion of these. 4 colleagues send me 20% of all emails I receive!
Learnings
What was more interesting to me was how I was responding to emails. Below is the distribution of my sent emails by sent hour of the day:

I generally followed the same pattern as the broader firm, perhaps a less pronounced spike at 10:00am, and a more pronouced one at 8:00pm when I logged on again from home.
The real insight for me was just how many emails I was sending in the morning, typically my most creative and productive time of the day for doing data science.
Why was I sending so many emails during my most creative time?
Doing so was taking up valuable time to write and distrupting my ongoing concentration on doing or guiding on some difficult piece of data science.
Worse still, almost none of these emails needed a response in this time. Having reflected on this I began blocking 'non emailing sending' time in my diary in the mornings
to ensure I was most efficient in executing on a difficult piece of work.