top of page
learn_data_science.jpg

Data Scientist Program

 

Free Online Data Science Training for Complete Beginners.
 


No prior coding knowledge required!

Analysis Of NAICS Timeseries

The North American Industry Classification System (NAICS) is an industry classification system developed by the statistical agencies of Canada, Mexico, and the United States. NAICS is designed to provide common definitions of the industrial structure of the three countries and a common statistical framework to facilitate the analysis of the three economies.


The structure of NAICS is hierarchical. The numbering system that has been adopted is a six-digit code, of which the first five digits are used to describe the NAICS levels that will be used by the three countries to produce comparable data. The first two digits designate the sector, the third digit designates the subsector, the fourth digit designates the industry group and the fifth digit designates the industry. The sixth digit is used to designate national industries. A zero as the sixth digit indicates that there is no further national detail. For example (see page 22), a 2-digit NAICS industry (e.g., 23 - Construction) is composed of some 3-digit NAICS industries (236 - Construction of buildings, 237 - Heavy and civil engineering construction, and a few more 3-digit NAICS industries). Similarly, a 3-digit NAICS industry (e.g., 236 - Construction of buildings), is composed of 4-digit NAICS industries (2361 - Residential building construction and 2362 - Non-residential building construction).


In this blog, we take time series data analysis on the Employment data based on the North American Industry Classification System - NAICS . Firstly, we import the data-set, prepare data and clear data. And then we take exploratory data analysis on the previous data-set.


The database file contain -


Raw data: 15 CSV files beginning with RTRA. These files contain employment data by industry at different levels of aggregation; 2-digit NAICS, 3-digit NAICS, and 4-digit NAICS. Columns mean as follows:



(i) SYEAR: Survey Year


(ii) SMTH: Survey Month


(iii) NAICS: Industry name and associated NAICS code in the bracket


(iv) _EMPLOYMENT_: Employment



LMO Detailed Industries by NAICS: An excel file for mapping the RTRA data to the desired data. The first column of this file has a list of 59 industries that are frequently used. The second column has their NAICS definitions.


Data Output Template: An excel file with an empty column for employment


Date Wrangling:


import used libraries

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 
import re
%matplotlib inline
import scipy

read output file as csv

df_output = pd.read_csv('Data_Output_Template - Sheet1.csv')
df_output.head(10)


Read second file "'LMO_Detailed_Industries_by_NAICS - LMO_Detailed_Industry.csv'"

df_Details = pd.read_csv('LMO_Detailed_Industries_by_NAICS - LMO_Detailed_Industry.csv')
df_Details['NAICS_Codes'] = df_Details.NAICS.astype(str).str.replace('&',',')
df_Details = df_Details.drop(columns = 'NAICS', axis = 1)

df_Details.head(10)

create add_data function to help us later

def add_date(df1):
    datetime_str = df1.SYEAR.astype(str) + ' ' + df1.SMTH.astype(str)
    df1['DATE'] = pd.to_datetime(datetime_str).dt.strftime('%Y-%m')
    df1.set_index('DATE', inplace = True)
    return df1


df_output = add_date(df_output)
df_output.head()


def Clean_Raw_Data(df):
    df['NAICS_Codes'] = df.NAICS\
                        .map(lambda x:x.split('[')[1].strip(']').replace('-', ','))
    df = df.drop(columns = 'NAICS', axis = 1)[df.SYEAR < 2019]
    return df



df2 = pd.read_csv('RTRA_Employ_2NAICS_00_05.csv')
#df2
files_2_naics = ['RTRA_Employ_2NAICS_06_10.csv',  'RTRA_Employ_2NAICS_11_15.csv', 'RTRA_Employ_2NAICS_16_20.csv',  'RTRA_Employ_2NAICS_97_99.csv']
for i in files_2_naics:
    df = pd.read_csv(i)
    df2 = df2.append(df, ignore_index=True)
Cleanedd_df2 = Clean_Raw_Data(df2)
Cleanedd_df2 .head()
#df2 = df2[df2.NAICS.map(lambda x: len(re.findall('[0-9][0-9][0-9]', x))>0)]




df3 = pd.read_csv('RTRA_Employ_3NAICS_00_05.csv')
df2
files_3_naics = ['RTRA_Employ_3NAICS_06_10.csv',  'RTRA_Employ_3NAICS_11_15.csv', 'RTRA_Employ_3NAICS_16_20.csv',  'RTRA_Employ_3NAICS_97_99.csv']
for i in files_3_naics:
    df3_ = pd.read_csv(i)
    df3 = df3.append(df3_, ignore_index=True)
df3 = df3[df3.NAICS.map(lambda x: len(re.findall('[0-9][0-9][0-9]', x))>0)]

Cleanedd_df3 = Clean_Raw_Data(df3)
Cleanedd_df3.head()





df4 = pd.read_csv('RTRA_Employ_4NAICS_00_05.csv')
df4
files_4_naics = ['RTRA_Employ_4NAICS_06_10.csv',  'RTRA_Employ_4NAICS_11_15.csv', 'RTRA_Employ_4NAICS_16_20.csv',  'RTRA_Emply_4NAICS_97_99.csv']
for i in files_4_naics:
    df4_ = pd.read_csv(i)
    df4 = df4.append(df4_, ignore_index=True)
df4.rename(columns = {'NAICS':'NAICS_Codes'},inplace = True)
df4 = df4[df4.SYEAR < 2019]
df4 = df4.astype({'NAICS_Codes':'str'})
df4.head()


And now we merge the data frames all together and check our data missing and type,etc

Exploratory Data Analysis

What is the top 20 industries?


sumOfEmployeInIndstry = Output_df.groupby('LMO_Detailed_Industry')['EMPLOYMENT'].sum().sort_values(ascending=False);
sumOfEmployeInIndstry2 = sumOfEmployeInIndstry.head(20);

plt.figure(figsize=(7,7));
sumOfEmployeInIndstry2.plot(kind = 'bar');
plt.xlabel('Employment')
plt.ylabel('Industry')

How employment in Construction evolved over time and how this compares to the total employment across all industries?


construction = Output_df[Output_df['LMO_Detailed_Industry'] == 'Construction']
construction_A = pd.crosstab(construction.SMTH, construction.SYEAR, values = construction.EMPLOYMENT, aggfunc = 'sum')

plt.figure(figsize=(10,5))
sns.heatmap(construction_A,  cmap = 'plasma',linewidths = 0.5)
plt.title('Eolution of Top 5 Industries')
construction_A

2004 is the center of transformation in employment as before it employment is under 140k and it increase through years to be 240k in 2018

what is the largest year in number of employeement?


maxAndMinEmp = Output_df['EMPLOYMENT'].agg(['max','min'])

maxv = Output_df['EMPLOYMENT'].max()
maxv
minv = Output_df['EMPLOYMENT'].min()

Output_df[Output_df['EMPLOYMENT'] == maxv]

Construction industry hase 118000 number of employee in year 1997

Heritage institutions,Oil and gas extraction ,Fishing, hunting and trapping is the minimum in all industries


What is the average employment for each industry during the years?


fig, ax = plt.subplots(1, 1, figsize=(12,10))
Output_df = Output_df.head(30)
ax.plot(pd.crosstab(Output_df['LMO_Detailed_Industry'], [Output_df['SYEAR']], values = Output_df['EMPLOYMENT'], aggfunc='mean'))

plt.xticks(rotation='vertical')

plt.show()

maxAndMinEmp = Output_df['EMPLOYMENT'].agg(['max','min'])
maxAndMinEmp


max 118000.0  min      1750.0 Name: EMPLOYMENT, dtype:float64
large number of employee is 118000.0 and min is 1750.0

Conclusions :There is a small change in number of employees every year increase and decrease normally

industry of construction is the largest one in number of employees which is up to 120k

and Heritage institutions,Oil and gas extraction ,Fishing, hunting and trapping is the minimum in all industries have small number of employees


0 comments

Recent Posts

See All

Comments