Insight From The Office Episode

bismark boateng
Oct 16, 2021
4 min read

introduction

The office is an American TV series that started as a British mockumentary series about office culture in 2001, depicting the everyday lives of office employees in fictional Scranton office of Dunder Mifflin Paper Company.

The aim of this post is to acquire insight by investigating the popularity and quality of the office episodes over time.

To do so, datasets/office_projects.csv is the dataset used which was downloaded from kaggle.

The dataset contains information on a variety of characteristics of each episodes. In detail, these are;

episode_number: Canonical episode number
season: Season in which the episode appeared
episode_title: Title of the episode
description: Description of the episode
ratings: Average IMDB rating.
votes: Number of votes
viewership_mil: Number of US viewers in millions
duration: Duration in number of minutes
release_date: Airdate
guest_stars: Guest stars in episode (if any)
director: Director of the episode
writers: Writers of the episode.
has_guests: True/False column for whether the episode contained guest stars.
scaled_ratings: The ratings scaled from 0( worst-reviewed) to 1 (best-reviewed)

Let's peek into the dataset to know the various data type and other properties that will help us analyze the data.

To do so, we need to import certain libraries.

import matplotlib.pyplot as plt 
import pandas as pd

In the above code, since we will manipulate the dataset, pandas library

is imported and for visualizing the data, matplotlib is imported.

Moving on, we need to read the dataset,

episodeDataset = pd.read_csv("datasets/office_projects.csv")
episodeDataset.info()

The code above read the dataset into the variable "episodeDataset".

The second line of code outputs the following;

This is helpful as it tells us what the dataset is about, the number of columns, the datatype of each column etc.

Now that we have seen the details of the dataset, let's visualize it using the matplotlib library.

fig = plt.figure() 
plt.plot(episodeDataset['episode_number'], episodeDataset['viewership_mil'])

This code plot the episode_number on x-axis and viewership_mil on the y-axis, these are the two columns we are much interested in.

The plot above gives us some visual representation of the dataset, but the kind of plotting used makes a bit difficult to wrap our head around the data.

We will make use of a scatter plot as it will give us much better understanding about the data set

fig = plt.figure() 
plt.scatter(episodeDataset['episode_number'], episodeDataset['viewership_mil'])

As it turns out, the scatter plot gives us much understanding about the dataset, notice that, one data point is really far from the other's,

what happened? why is that?

We will get to know more about that later.

Before we start making filters to make the plot clear, we first need to subset the columns we are interested in working with.

# this subset the episode number 
episodeNumber = episodeDataset['episode_number']

# this subset the number of views in millions
episodeViewership = episodeDataset['viewership_mil'] 

# this is the scaled ratings of the episodes 
scaledRatings = episodeDataset['scaled_ratings']

After this, for us to get a clearer understanding of the dataset, we will set up filters and this will depend on the scaled ratings of the each episode

such that, if the

ratings < 0.25, the color will be red

ratings >= 0.25 and < 0.50, the color will be orange

ratings >= 0.50 and < 0.75, the color will be lightgreen

ratings >= 0.75, the color will be darkgreen.

colors_str = list() # creates an empty list to append the #list of colors 

# this loops through each scaled ratings 
for Ratings in scaledRatings:
    if Ratings < 0.25:
        r_color = "red"# r_color is a variable 
        colors_str.append(r_color) 
    elif Ratings >= 0.25 and Ratings < 0.50:
        o_color = "orange"
        colors_str.append(o_color)
    elif Ratings >= 0.50 and Ratings < 0.75: 
        l_color = "lightgreen"
        colors_str.append(l_color) 
    elif Ratings >= 0.75:
        dg_color = "darkgreen" 
        colors_str.append(dg_color)

Now that we are done setting up the filters, we will set up a marker size to know whether the dataset has a guest star or not and assign different markersize to it.

size_s = list() # creates an empty list to append the sizes 

#subsets the column of has_guests 
guests = episodeDataset['has_guests']
for guest in guests: 
    if guest == True: 
        s = 250 
        size_s.append(s) 
    elif guest == False: 
        s = 25 
        size_s.append(s)

Those episodes with guest stars will have a wide marker size than those that do not have a guest star, this will help us understand the dataset much better

We now need to add these filters and plot to see how well we will understand the graph

fig = plt.figure() 

plt.scatter(episodeNumber, episodeViewership, s=size_s, c=colors_str) # the s and c accepts a list of string and numbers  

#naming the x and y coordinates 
plt.xlabel("Episode Number") 
plt.ylabel("Viewership in (Millions)")
plt.title("Popularity, Quality, and Guest Appearances on the office") 

plt.show()

Notice in the above code that we added some flavor to the plot,

we named the x and y axis and a title is included this time.

let's see how the plot will look .

Here, our plot is clear, easy to read and understand by anyone without any knowledge in data analysis. Our plot is telling us that, one of the episodes have views over 20+ million. How come it's so distinct from the others? Comment below what you think.

Since we've determined that one of the episodes had a greater views , we can as well determine one of the top star in that episode

Let's see how to go about this,

stars = episodeDataset[episodeDataset['viewership_mil'] == max(episodeViewership)]['guest_stars']

listStars = list() # a variable to store the list of stars 
for star in stars: 
    listStars.append(star)
    
Stars = listStars[0]
top_star = Stars.split(",")[0]
print(top_star) 

output: 
Cloris Leachman

We see that, from the above code, Cloris Leachman is one of the top stars in the episode

datainsightonline.com

Data Scientist Program

Free Online Data Science Training for Complete Beginners.

No prior coding knowledge required!

Insight From The Office Episode

Recent Posts

Comments

40 Python Projects with Source Code for Beginners

How to Read Medium Premium Articles for Free

How to use Sqlite3 using Python

Data Visualization - which types of graphs should we use?

Best Online Courses for Data Science

9 Ways to Embed Code Snippets on your Data Science Blog Posts