top of page
learn_data_science.jpg

Data Scientist Program

 

Free Online Data Science Training for Complete Beginners.
 


No prior coding knowledge required!

Visualizing The office dataset

The office is a British mockumentary series consisting 9 seasons with 201 episodes. Here we visualize the information obtained from the dataset downloaded from Kaggle.

At first we imported the necessary library.

import pandas as pd
import matplotlib.pyplot as plt

Then we imported the csv into dataframe.


office_df = pd.read_csv('datasets/office_episodes.csv')

We just viewed the columns in the dataframe.


Since different color is required for different rating and sizing for guest appearance, we made two empty arrays.

cols = []
sizes = []

We classified the row having different range of scaled_ratings to different color as:

for ind, row in office_df.iterrows():
    if row['scaled_ratings'] < 0.25:
        cols.append('red')
    elif row['scaled_ratings'] < 0.50:
        cols.append('orange')
    elif row['scaled_ratings'] < 0.75:
        cols.append('lightgreen')
    else:
        cols.append('darkgreen')

Similarly, the rows having guest appearance were assigned size of 250 and 25 for others.

for ind, row in office_df.iterrows():
    if row['has_guests'] == False:
        sizes.append(25)
    else:
        sizes.append(250)

We added two columns in the actual dataset for our easeness.

office_df['colors'] = cols
office_df['sizes'] = sizes

We splitted the dataframe into guest and non-guest dataframe.

non_guest_df = office_df[office_df['has_guests'] == False]
guest_df = office_df[office_df['has_guests'] == True]

For figure plot we assume the figure size as:

# Set the figure size and plot style        
plt.rcParams['figure.figsize'] = [11, 7]
plt.style.use('fivethirtyeight')

# Create the figure
fig = plt.figure()

We plotted a normal scatter plot for episode number vs. viewership_million with the color and size array in guest and non-guest dataframes.

plt.scatter(x=non_guest_df.episode_number, y=non_guest_df.viewership_mil,c=non_guest_df['colors'], s=25)
plt.scatter(x=guest_df.episode_number, y=guest_df.viewership_mil,c=guest_df['colors'], marker='*', s=250)

In the plot we added the title, xlabel and ylabel and showed the plot.

plt.title("Popularity, Quality, and Guest Appearances on the Office", fontsize=28)
plt.xlabel("Episode Number", fontsize=18)
plt.ylabel("Viewership (Millions)", fontsize=18)
plt.show()

We analyze which episode has the highest viewership and see the guest stars in that episode.

highest_view = max(data_frame["viewership_mil"])

most_watched_dataframe = data_frame.loc[data_frame["viewership_mil"] == highest_view]

top_star = most_watched_dataframe[["guest_stars"]]
0 comments

Recent Posts

See All
bottom of page