top of page

Data Scientist Program


Free Online Data Science Training for Complete Beginners.

No prior coding knowledge required!

NAICS; The North American Industry Classification System analysis project

The North American Industry Classification (NAICS) is the standard used by Federal statistical agencies in classifying business establishments for the purpose of collecting, analyzing, and publishing statistical data related to Canada, Mexico and the United States economy.

It gives us definition about the industrial structure of the three countries

The NAICS contains 15 csv files beginning with RTRA. These files contain employment data by industry at levels of aggregation, 2-digit NAICS, 3-digit and 4-digit.


The first thing to do in any data analysis project is to process the available data, clean and make sure the data is ready for analysis,

since the NAICS contain 15 csv files, we need to find a way to combine

all these dataset into one, making our analysis very effective.

We first start by combining the 2-digit NAICS datasets together.

In this project, we'll make use of python, so let's start by importing

all the necessary libraries that we'll need.

import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 
import os 

The os module help direct us to the directory path of the NAICS data.

NB: The files for this project is downloaded locally, thus there will be a difference in the directory names, the process of cleaning and analysing is still the same.

Let's define our directory where we have the NAICS datasets


The NAICS dataset contains a data output template excel files which contains the industry names and dates, since this will be crucial to our analysis, we'll load the excel file .

#reading the output template file 
out_temp_file = 'Data_Output_Template.xlsx'
out_temp = pd.read_excel(os.path.join(DATE_DIR, out_temp_file), parse_dates = {'DATE':['SYEAR', 'SMTH']})

Next, we'll load LMO_Detailed_Industry excel sheet which contains Industry names and it's naic code.

#reading the LM0_Detailed industy csv file 
lmo_file = 'LMO_Detailed_Industries_by_NAICS.xlsx'
lmo = pd.read_excel(os.path.join(DATA_DIR, lmo_file))

# cleaning the lmo dataset lmo['NAICS'] = lmo['NAICS'].astype('str').str.replace('&','')lmo['NAICS'] = lmo['NAICS'].str.strip()lmo.head()

Now, we'll continue by creating a function that'll read the csv files of all

the different NAICS datasets, parse the dates and split the industry names from it's naic codes

#reading each csv file, parsing dates and joining the paths def read_csv(file):
''' This function reads the csv files 
    parse date into a datetime , 
    and splits the                 
    naics code from the industry names  '''
    r = pd.read_csv(file, parse_dates={'DATE':['SYEAR','SMTH']})
    r['NAICS'] = r['NAICS'].str.replace(']', '')
    r[['NAICS', 'CODE']] = r['NAICS'].str.split('[', n=1, expand=True)
    return r

we'll continue by cleaning the 2-digit dataset which will be the same as the 3-digit, and then merge all with the 4-digit dataset.

# 2 NAICS csv files 
file_00_05 = 'RTRA_Employ_2NAICS_00_05.csv'
file_06_10 = 'RTRA_Employ_2NAICS_06_10.csv'
file_11_15 = 'RTRA_Employ_2NAICS_11_15.csv'
file_16_20 = 'RTRA_Employ_2NAICS_16_20.csv'
file_97_99 = 'RTRA_Employ_2NAICS_97_99.csv'

# using the read_csv function, apply it to all the files
file1 = read_csv(os.path.join(DATA_DIR, file_00_05))
file2 = read_csv(os.path.join(DATA_DIR, file_06_10))
file3 = read_csv(os.path.join(DATA_DIR, file_11_15))
file4 = read_csv(os.path.join(DATA_DIR, file_16_20))
file5 = read_csv(os.path.join(DATA_DIR, file_97_99))

#concatenating all the two digit files and selecting the 
# date of interest
two_digit = pd.concat([file1, file2, file3, file4, file5])
two_digit = two_digit[two_digit['DATE'] < '2019-01-01']

#further cleaning 

#column of interest is NAICS CODE, DATE AND _EMPLOYMENT_, dropping any other columns 
two_digit = two_digit.drop('NAICS', axis=1)
#re assigning the NAICS NAME 
two_digit = two_digit.rename(columns={'CODE': 'NAICS'})

Finally we merge the two digit dataset with lmo on the naic column

data_2 = lmo.merge(two_digit, on='NAICS', how='right').reset_index(drop=True)data_2.head()

This process is repeated for the 3-digit and 4-digit dataset

After this is done, we get a new dataset, showing the few rows below;


A properly processed and cleaned is useless if we cannot get insight from it. That is what we want from a data after several hour of processing and cleaning, insight !

In this second part of the project, we will answer some questions which will give us insight into the data.

But first, let's visualize the employment in each sector, which industry has the higest employment ?

sns.barplot(x='LMO_Detailed_Industry', y='Employment', data=output_industries)
plt.title('Plotting Employment for each sector')

Construction has the higest employment rate, this insight opens doors to answer more questions about the data,

How has employment in construction evolved over time ?

construction = output_industries[output_industries['LMO_Detailed_Industry'] == 'Construction']

sns.lineplot(x=construction.index, y='Employment', data=construction)
plt.ylabel('Employment in Construction')
plt.title('How Employment in Construction has evolved over time')

This tells us that, looking at the big picture, employment increased over the years .

The next question; with this employment, how does it vary with the other employments in the industry

not_construction =
output_industries[output_industries['LMO_Detailed_Industry'] != 'Construction']

sns.lineplot(x=construction.index, y='Employment', data=construction, label='Construction')
sns.lineplot(x=not_construction.index, y='Employment', data=not_construction, label='other industries')

We see that, employment in construction was really high compared to other employment across industries

Hospital = output_industries[output_industries['LMO_Detailed_Industry'] == 'Hospitals']

sns.lineplot(x=Hospital.index, y=Hospital.Employment, data = Hospital)

We can see that there is variation of employment in the hospital sector across the year's, overall the employment rate in the hospital sector rallied up.

Let us look at how the Hospital sector evolved in relation to other sector in the industries

not_Hospital = output_industries[output_industries['LMO_Detailed_Industry'] != 'Hospitals']

sns.lineplot(x=Hospital.index, y=Hospital.Employment, data=Hospital, label='Hospital')
sns.lineplot(x=construction.index, y='Employment', data=construction, label='Construction')
sns.lineplot(x=not_Hospital.index, y='Employment', data=not_Hospital, label='other industries')

From the graph, whiles the Employment rate of the Hospital industry did increase over time gradually

the employment rate in the Hospital industry experienced a subtle increase over the years.

Our last question will be,

How has the Mining industry evolved over time ?

did it increase sharply or it was subtle?

was there a lot of variations or it looked like cycles ?

let us answer this question

Mining = output_industries[output_industries['LMO_Detailed_Industry'] == 'Mining']

sns.lineplot(x=Mining.index, y='Employment', data=Mining)plt.xlabel('Time')
plt.ylabel('Employment in Mining')
plt.title('How Employment in Mining has evolved over time')

Overall, the employment rate in the Mining industry did increase, despite the pit fall starting from 2000 through to 2008


This analysis fairly gives us insight into the NAICS dataset.

Thank you.


Recent Posts

See All
bottom of page