top of page
learn_data_science.jpg

Data Scientist Program

 

Free Online Data Science Training for Complete Beginners.
 


No prior coding knowledge required!

TIME SERIES ANALYSIS OF NAICS


NAICS

The North American Industry Classification System (NAICS) is the standard used by Federal statistical agencies in classifying business establishments for the purpose of collecting, analyzing, and publishing statistical data related to the U.S., Canada, and Mexico business economies. NAICS is intended to give similar definitions of the three countries' industrial structure as well as a common statistics framework to aid in the analysis of their economies.


HOW IT WORKS

A hierarchical structure is used by NAICS. The link of one thing to a specific category is known as a "hierarchy." It is as follows:

  • Sector: 2-digit code

    • Subsector: 3-digit code

      • Industry Group: 4-digit code

        • NAICS Industry: 5-digit code

          • National Industry: 6-digit code

For example, a 2-digit NAICS industry (e.g., 23 - Construction) is

composed of some 3-digit NAICS industries (236 - Construction of buildings, 237 - Heavy and civil engineering construction, and a few more 3-digit NAICS industries). Similarly, a 3-digit NAICS industry (e.g., 236 - Construction of buildings), is composed of 4-digit NAICS industries (2361 - Residential building construction and 2362 - Non-residential building construction).


DATA USED

  1. Raw Data:15 csv files beginning with RTRA. These files contain employment data by industry at different levels of aggregation, 2-digit NAICS, 3-digit NAICS, and 4-digit NAICS. The columns are as follows: SYEAR(Survey Year), SMTH(Survey Month), NAICS(Industry name and associated NAICS code in the bracket ), and _EMPLOYMENT_(Employment).

  2. Data Output Template: An excel file with an empty column for employment. You should fill the empty column with the data you prepared from your analysis.

  3. LMO Detailed Industries by NAICS: An excel file for mapping the RTRA data to the desired data. The first column of this file has a list of 59 industries that are frequently used. The second column has their NAICS definitions. Using these NAICS definitions and RTRA data, you would create a monthly employment data series from 1997 to 2018 for these 59 industries.

WORKING WITH THE DATA

The analysis of the data was done in Google Collaboratory.


IMPORTING NECESSARY LIBRARIES

Importing all the necessary libraries needed in the analysis of the data.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline 

UPLOADING ALL THE NECESSARY FILES INTO COLLAB

Uploading all the files needed for the analysis from my local computer onto Google Collaboratory.

from google.colab import files
files.upload()

Output of code:


LOADING, CHECKING, AND CLEANING THE LMO DETAILED INDUSTRIES BY NAICS

Loading the LMO_Detailed_Industries_by_NAICS.xlsx file and replacing the '&' sign with ',' sign. This is to make analysis simpler.

#Loading the contents of LMO_Detailed_Industries_by_NAICS
df_lmo_details = pd.read_excel('LMO_Detailed_Industries_by_NAICS.xlsx')
#Displaying the first 5 rows
df_lmo_details.head()

Output:

The presence of the '&' sign must be changed to ','.

#Converting to string and changing & to ,
df_lmo_details.NAICS = df_lmo_details.NAICS.astype(str).str.replace('&', ',')
#Printing first 5 rows to check if it is changed
df_lmo_details.head()

Output:

'&' replaced with ','.


LOADING DATA OUTPUT TEMPLATE

The given final output for the data that will be analyzed.

#Loading the contents of Data_Output_Template
df_output_tem = pd.read_excel('Data_Output_Template.xlsx')
#Displaying the first 5 rows
df_output_tem.head()

Output:

It has 4 main columns with the Employment column having no values.


LOADING AND CLEANING RTRA 2-DIGIT NAICS

Since there are 5 different files of 2-digit NAICS, it is prudent to find an easy way to automate that. Creating a list for all of them and looping through to load them then appending them to each other to form a single file will be more expedient.

#Making a list to house all the RTRA_Employ_2NAICS data
rtra_2_list = ['RTRA_Employ_2NAICS_00_05.csv', 'RTRA_Employ_2NAICS_06_10.csv', 'RTRA_Employ_2NAICS_11_15.csv', 'RTRA_Employ_2NAICS_16_20.csv']
#Loading the data from 1997 to 1999
df_2 = pd.read_csv('RTRA_Employ_2NAICS_97_99.csv')

#For loop to join all the files together
for files in rtra_2_list:
  df = pd.read_csv(files)
  df_2 = df_2.append(df, ignore_index=True)

#Displaying first 5 rows of the dataset
df_2.head()

Output:

Data successfully loaded.

It can be seen that the NAICS column has both strings and numbers and hence the need to divide them into words and numbers.

Splitting the NAICS column into NAICS and NAICS_CODE

#Splitting the NAICS column into NAICS and NAICS code at the [ index
df_2[['NAICS', 'NAICS_CODE']] = df_2['NAICS'].astype(str).str.split('[', expand=True)
#Displaying the first 5 rows
df_2.head()

Output:

NAICS_CODE column is produced but has a bracket which needs not to be there.

Getting ride of the last bracket and replacing '-' with ','.

#Getting rid of the last bracket the replacing - with ,
df_2['NAICS_CODE'] = df_2['NAICS_CODE'].astype(str).str.strip(']').replace('-', ',')
#Displaying the first 5 rows to check if the change has been effected
df_2.head()

Output:

Final output has the bracket removed as shown below.

There is the need to merge the cleaned 2-digit NAICS dataset with LMO detail dataset and then get rid of the missing rows.

#Merging the 2 digit NAICS dataset with the LMO Details dataset
df_2_lmo = df_2.merge(df_lmo_details, left_on='NAICS_CODE', right_on='NAICS', how='left')

#Displaying the first 10 rows
df_2_lmo.head(10)

Output:

Merged the two datasets.

The above dataset has NAICS_x and NAICS_y which need to be dropped.

#Dropping the NAICS_x and NAICS_y columns
df_2_lmo = df_2_lmo.drop(columns=['NAICS_x', 'NAICS_y'])

#Displaying the first 10 rows again
df_2_lmo.head(10)

Output:

There are some NaN values in the LMO_Detailed_Industry column which need to be dealt with and also a better indexing.

The index must be changed into the year and month format.

combined= df_2_lmo.SMTH.astype('str').str.cat(df_2_lmo.SYEAR.astype('str'),sep=' ')
df_2_lmo['DATE']= pd.to_datetime(combined).dt.strftime('%Y-%m')
df_2_lmo.set_index('DATE',inplace=True)

#Dropping missing values
df_2_lmo = df_2_lmo.dropna()
df_2_lmo

Output:

This is the final output for 2-digit NAICS dataset. The above processes will be carried out for 3-digits NAICS and 4-digits NAICS.

The 3-digits NAICS final output is:

The 4-digit NAICS final output is:

Afterwards, all the 3 datasets are joined to give the final data for analysis.

#Adding the datasets together
df_2_3 = df_2_lmo.append(df_3_lmo)
df_naics_set = df_2_3.append(df_4_lmo)
df_naics_set

Output:

Since the instruction wanted analysis from 1997 to 2018, it is best to choose these and omit the 2019 and other year.

#Choosing rows with indices less than 2019
df_naics_97_18 = df_naics_set[(df_naics_set.index).astype(str) < '2019']

#Displaying some of the rows
df_naics_97_18

Output:

Now loading of the data output template and making a date column as index.

#Forming right date format from SMTH and SYEAR and making it the index of the dataset
month_year= df_output_tem.SMTH.astype('str').str.cat(df_output_tem.SYEAR.astype('str'),sep=' ')
df_output_tem['DATE']= pd.to_datetime(month_year).dt.strftime('%Y-%m')
df_output_tem.set_index('DATE',inplace=True)

df_output_tem

Output:

Now, it is best to merge data from 1997 to 2018 with the data output template.

#Merging datasets together
final_data = df_output_tem.merge(df_naics_97_18, left_on=['DATE','LMO_Detailed_Industry'], right_on=['DATE','LMO_Detailed_Industry'], how='left')
#Displaying dataset
final_data

Output:

Now there is a need to drop the SYEAR_y and SMTH_y. And rename SYEAR_x and SMTH_x to SYEAR and SMTH respectively.

#Dropping SYEAR_y and SMTH_y 
final_data = final_data.drop(columns=['SYEAR_y', 'SMTH_y'], axis=1)

#Fusing _EMPLOYMENT_ into Employment
final_data['Employment'] = final_data._EMPLOYMENT_

#Dropping _EMPLOYMENT_ column
final_data = final_data.drop('_EMPLOYMENT_', axis=1)

#Renaming SYEAR_x and SMTH_x columns to SYEAR and SMTH
final_data.rename(columns={'SYEAR_x': 'SYEAR', 'SMTH_x':'SMTH'},inplace=True)

final_data

Output:

Now we check for NaN values.

#Checking for missing data
final_data.isnull().sum()

Output:

Since there are null values. It is best to take care of them. We will fill them with 0s.

#Filling NAN values with 0
final_data['Employment']=final_data['Employment'].fillna(0).astype(int)
final_data['NAICS_CODE']=final_data['NAICS_CODE'].fillna(0).astype(int)

final_data

The final data output is:


INSIGHTS FROM DATA

As a data scientist trying to acquire the best information from this data for my company, I will channel my findings into giving the answers for 5 basic questions. The questions are as follows:

  1. How employment in construction evolved over time and how this compares to the total employment across all countries?

  2. How is the total employment of each industry from 1997 to 2018?

  3. How is the trend of employment in the real estate rental and leasing industry?

  4. What are the top 10 employment industries?

  5. What has the trend been over the years in the hospital industry?

HOW EMPLOYMENT IN CONSTRUCTION INDUSTRY HAS EVOLVED

To check the evolution of employment of the construction industry, it is prudent to take only the construction industry into consideration.

#Taking only rows with Construction as LMO_Detailed_Industry
construction = final_data[final_data['LMO_Detailed_Industry'] == 'Construction']

#Displaying the columns
construction

Output: