Omar Ahmed
- Sep 6, 2020
- 5 min read

The Analytics of NAICS.

Introduction to NAICS

(North American Industry Classification System)

The North American Industry Classification System (NAICS) is an industry classification system developed by the statistical agencies of Canada, Mexico and the United States. Created against the background of the North American Free Trade Agreement, it is designed to provide common definitions of the industrial structure of the three countries and a common statistical framework to facilitate the analysis of the three economies. NAICS is based on supply-side or production-oriented principles, to ensure that industrial data, classified to NAICS, are suitable for the analysis of production-related issues such as industrial performance.

Economic statistics describe the behaviour and activities of economic transactors and of the transactions that take place among them. The economic transactors for which NAICS is designed are businesses and other organizations engaged in the production of goods and services. They include farms, incorporated and unincorporated businesses and government business enterprises. They also include government institutions and agencies engaged in the production of marketed and non-marketed services, as well as organizations such as professional associations and unions and charitable or non-profit organizations and the employees of households. NAICS is a comprehensive system encompassing all economic activities. It has a hierarchical structure. At the highest level, it divides the economy into 20 sectors. At lower levels, it further distinguishes the different economic activities in which businesses are engaged.

SOURSES:

Dataset

RTRA (The Real Time Remote Access (RTRA) system- Data Repository by Statistics Canada
LMO Detailed Industries Data Repository by NAICS
Data Output Table by Datainsight

Imports and Datasets

import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
import math
from plotly.subplots import make_subplots
import plotly_express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
import empiricaldist as emp
import re
import os
%matplotlib inline
sns.set_style('darkgrid')

Dataset Used

North American Industry Classification System (NAICS) (LINK)

Dataset consists of time-series data from JAN 1997 up to JAN 2019 (Updated Annually).

Importing Datasets from Local files

RTRA

15 csv files RTRA_Employ_.csv :* RTRA (The Real Time Remote Access (RTRA) system

LMO Detailed Industries

LMO_Detailed_Industries_by_NAICS.xlsx : LMO Detailed Industries from Datainsight

Data sets consists of 3 main files:-

- An output template which is the target form of the data set. - A LMO file which is the focus classes desired for analysis. - RTRA CSV files which are the source of employment data.

Exploring Imported Datasets

Using .head(),.describe() and .info() methods of pandas

Cleaning LMO Detailed Industries

# Reading Data
df_LMO = pd.read_excel(r'D:\DATASCIENCE\Project 4\Datasets\\LMO_Detailed_Industries_by_NAICS.xlsx')

# Renaming columns
df_LMO.rename(columns={'NAICS':'code','LMO_Detailed_Industry':'NAICS'},inplace = True)

# Changing data types
df_LMO.code.astype('str')
# Cleaning strings
for i in range(len(df_LMO)):
    if '&' in str(df_LMO.code[i]):
        df_LMO.code[i] = df_LMO.code[i].replace(' & ',',')

# Splitting data
for i in range(len(df_LMO)):
    if ',' in str(df_LMO.code[i]):
        df_LMO.code[i] = str(df_LMO.code[i]).split(',')
# Reshaping Dataframe
df_LMO = df_LMO.explode(column='code').reset_index()[['NAICS','code']]

Cleaning RTRA CSV Files

df_2_3 = pd.DataFrame()
# Importing datafiles
files = os.listdir(r"D:\DATASCIENCE\Project 4\Datasets\CSVs\withname")
for j in files:
    df = pd.read_csv(os.path.join(r'D:\DATASCIENCE\Project 4\Datasets\CSVs\withname\\', j))

#Renaming columns
    df.rename(columns = {'_EMPLOYMENT_':'employment'},inplace=True)

#Extracting codes from NAICS using regex
    df['code'] = 'a'
    for i in range(len(df)):
        if '[' not in df.NAICS[i]:
            df.code[i] = '0000'
        else:
            x = re.search(r'(\[(\d+-?\d+(-?\d+)?)])',df.NAICS[i])
            df.code[i] = x.group(2)
            df.NAICS[i] = (df.NAICS[i].replace(x.group(1),'')).strip()

# Reshaping dataframe
    df = (df.set_index(['SYEAR', 'SMTH','NAICS','employment'])
       .apply(lambda x: x.str.split('-').explode())
       .reset_index())

    df = df[['SYEAR','SMTH','code','NAICS','employment']]
    df_2_3 = df_2_3.append(df)
    
        
df_4 = pd.DataFrame()
files = os.listdir(r"D:\DATASCIENCE\Project 4\Datasets\CSVs\withnoname")
for j in files:
    df = pd.read_csv(os.path.join(r'D:\DATASCIENCE\Project 4\Datasets\CSVs\withnoname\\', j))

    df.rename(columns = {'_EMPLOYMENT_':'employment'},inplace=True)
    df['code'] = df.NAICS

    df = df[['SYEAR','SMTH','code','NAICS','employment']]
    df_4 = df_4.append(df)
    
    
# Cleaning up and joining Datasets    
df_2_3.reset_index(drop=True,inplace=True)
df_4.reset_index(drop=True,inplace=True)
df_full = df_2_3.append(df_4)
df_full.reset_index(drop=True,inplace=True)

df_LMO.code = df_LMO.code.astype('int')
df_full.code = df_full.code.astype('int')

Exploring Cleaned Datasets

Using .head(),.describe() and .info() methods of pandas

Redefining DataFrame to Achieve Output Template.

Merging dataframes and limiting years to specified range.

LMO_full = pd.merge(df_LMO,df_full[['SYEAR','SMTH','code','employment']],on=['code'],how='left')
LMO_full = LMO_full[(LMO_full.SYEAR > 1996) & (LMO_full.SYEAR < 2019)]

Creating final output template file with filled values for employment.

output = LMO_full[['SYEAR','SMTH','NAICS','employment']].groupby(['NAICS','SYEAR','SMTH']).sum().reset_index()
output.sort_values(['SYEAR','SMTH','NAICS']).reset_index(drop=True)

output.SYEAR=output.SYEAR.astype('int')
output.SMTH = output.SMTH.astype('int')
output.employment = output.employment.astype('int')
output = output[['SYEAR','SMTH','NAICS','employment']]
output = output.sort_values(['SYEAR','SMTH','NAICS']).reset_index(drop=True)

Exploring output dataframe

Since Output dataframe matches output excel table row count therefore dataframe is cleaned.

Analysis Process

Creating timeseries DataFrame for Furthur Analysis.

timeseries = output
timeseries['date'] = pd.to_datetime(timeseries.SYEAR.astype(str) + '/' + timeseries.SMTH.astype(str) + '/01')
timeseries.drop(['SYEAR','SMTH'],inplace=True,axis=1)

Exploring created Dataframe

Inspecting Data and Answering questions that rose up.

1 - Pre-defined Questions:

How employment in Construction evolved overtime?
How employment in Construction evolved over time, compared to the total employment across all industries?

2 - Novel Questions:

Which are the most common top industries over the years?
How did the top 10 industries evolve overtime?
How did the bottom 5 industries evolve overtime?

How employment in Construction evolved overtime?

fig = px.line(timeseries[timeseries.NAICS == 'Construction'],x = 'date',y='employment',title='Evolution of Constraction over Time')
fig.show()

Conclusion:

Data shows steady fluctuations through out 1997 till early 2004.
A huge increase appeared in North America through out 2004 till early 2009.
A constant fluctuation started from 2009 till early 2017.
Data suggests future increase in rate of employment in Construction, yet this matter is to be further explored.

How employment in Construction evolved over time, compared to the total employment across all industries?

fig = go.Figure()
fig.add_trace(go.Scatter(x=timeseries[timeseries.NAICS == 'Construction'].date, y=timeseries[timeseries.NAICS == 'Construction'].employment,name = 'Construction'))
fig.add_trace(go.Scatter(x=timeseries.groupby('date').sum().reset_index().date, y=timeseries.groupby('date').sum().reset_index().employment,name = 'Total'))
fig.update_layout(title='Employment in Construction over time WRT total employment',
                   xaxis_title='Date',
                   yaxis_title='Employment')
fig.update_layout(yaxis_type="log")
fig.show()

Conclusion:

The Bump in Construction appears to slightly affect the total employment rate.
Total employment rate is showing slight and steady increase.

Most Common Top Industries over the years.

Finding the top ten industries each year and counting the occurrence of each industry in the top ten through the years

dicts = {}
years = list(timeseries.year.unique())
for i in years:
    temp = timeseries.groupby(['year','NAICS']).mean().reset_index()
    dicts[i] = list(temp[temp.year == i].nlargest(10,'employment').NAICS)
df = pd.DataFrame(dicts)
# pd.DataFrame(df.stack().value_counts()).reset_index()
fig = px.bar(pd.DataFrame(df.stack().value_counts()).reset_index(), y="index", x=0, orientation='h')
fig.update_layout(title='Frequency of top 10 Industries overtime',
                   xaxis_title='Count',
                   yaxis_title='Industries')
fig.show()

Conclusion:

Construction along with 8 other industries appeared in the top ten over the years equal times -22-.
Ambulatory health care services and other manufacturing appeared equal times at 11 times in the top ten.

How did the top 10 industries evolve overtime?

f = [a_tuple[0] for a_tuple in list(df.stack().value_counts().items())]
fig = go.Figure()
for i in f:
    
    fig.add_trace(go.Scatter(x=timeseries[timeseries.NAICS == i].date, y=timeseries[timeseries.NAICS == i].employment,name = i))
    
    
fig.update_layout(title='Top 10 Industries progression overtime',
                   xaxis_title='Date',
                   yaxis_title='Employment')
fig.show()

Conclusion:

'Other Retail Trades' is the top industry over time and Construction is closing in.
Construction and 'Food Services and drinking places' were close from 1997 till 2004 yet construction employment rate increased ever since.
'Repair, personal and non-profit services' was closely tied to Construction and 'Food Services and drinking places' yet continued to stabilize at the same rate.

How did the bottom 5 industries evolve overtime?

f = [a_tuple[0] for a_tuple in list(df.stack().value_counts().head().items())]
fig = go.Figure()
for i in f:
    
    fig.add_trace(go.Scatter(x=timeseries[timeseries.NAICS == i].date, y=timeseries[timeseries.NAICS == i].employment,name = i))
    
    
fig.update_layout(title='Bottom 5 Industries progression overtime',
                   xaxis_title='Date',
                   yaxis_title='Employment')
fig.show()

Conclusion:

'Fishing, hunting and trapping' is currently the Bottom most industry over time and 'Oil and gas extraction' is oscillating with it.
'Primary metal manufacturing' is in steady decline since 1997 with very close ties to 'Heritage institutions'.
'Ship and boat building' and 'Heritage institutions' has very similar progression over time.

Final Thoughts

Data shows steady fluctuations in 'Construction' through out 1997 till early 2004.
A huge increase in 'Construction' appeared in North America through out 2004 till early 2009.
A constant fluctuation in 'Construction' started from 2009 till early 2017.
Data suggests future increase in rate of employment in 'Construction', yet this matter is to be further explored.
The Bump in 'Construction' appears to slightly affect the total employment rate.
Total employment rate is showing slight and steady increase.
Construction along with 8 other industries appeared in the top ten over the years equal times -22-.
'Ambulatory health care services' and 'other manufacturing' appeared equal times at 11 times in the top ten.
'Other Retail Trades' is the top industry over time and Construction is closing in.
'Construction' and 'Food Services and drinking places' were close from 1997 till 2004 yet construction employment rate increased ever since.
'Repair, personal and non-profit services' was closely tied to Construction and 'Food Services and drinking places' yet continued to stabilize at the same rate.
Top most industry is 'Construction' industry with average employment rate around 172291.6 .
Bottom most industry is 'Fishing, hunting and trapping' with average employment rate around 2807.7.

Future Recommendations.

Further Study of forecasting of employment rates based on history of data is highly recommended in the future.

Data Limitations.

An indepth study of Future predictions and Machine Learning models is limited due to limited features in dataset.
A deeper aggregation of data is to be expected, a detailed location for each country in the given dataset would have aided in a deeper study.

datainsightonline.com

Data Scientist Program

Free Online Data Science Training for Complete Beginners.

No prior coding knowledge required!

The Analytics of NAICS.

Introduction to NAICS

(North American Industry Classification System)

SOURSES:

Dataset

Dataset Used

North American Industry Classification System (NAICS) (LINK)

Importing Datasets from Local files

RTRA

LMO Detailed Industries

Exploring Imported Datasets

Cleaning LMO Detailed Industries

Cleaning RTRA CSV Files

Exploring Cleaned Datasets

Redefining DataFrame to Achieve Output Template.

Analysis Process

Creating timeseries DataFrame for Furthur Analysis.

Inspecting Data and Answering questions that rose up.

Conclusion:

Conclusion:

Most Common Top Industries over the years.

Conclusion:

Conclusion:

Final Thoughts

Future Recommendations.

Data Limitations.

Recent Posts