top of page
learn_data_science.jpg

Data Scientist Program

 

Free Online Data Science Training for Complete Beginners.
 


No prior coding knowledge required!

Pandas Techniques for Data Manipulation in Python apply function:

Pandas.apply allow users to pass a function to every cell in the dataframe. Ander conditions of the function it works to dataframe it increase the simplicity and readability of code.

DataFrame.apply(func, axis=0, raw=False, result_type=None, args=(), **kwargs) fun: function you want to apply to rows or columns

axis(0 or index,1 or column) axis:0 or 'index': apply function to each column. axis:1 or 'column': apply function to each row.

row:(False is Default,Determines if row or column is passed as a Series or ndarray object) if row = False :passes each row or column as a Series to the function. else: the passed function will receive ndarray objects instead

args :tuple Positional arguments to pass to func in addition to the array/series.

**kwargs:additional keyword arguments to pass as keywords arguments to func.


import pandas as pd
import numpy as np
df3 = pd.DataFrame([[2,3,4,5]] * 3, columns=['A', 'B','C','D'])
df3


df4 = df3.apply(np.sqrt)

#sum each column alone so you will get 4 values 
df3.apply(np.sum,axis = 0)
A     6
B     9
C    12
D    15
dtype: int64
#sum all row so you will get 3 rows 
df3.apply(np.sum,axis = 1)
0    14
1    14
2    14
dtype: int64

df = pd.read_csv('traffic.csv')
df.head()
dfi.export(df.head(),'df.png')

def use_apply(i):
        j = "NotFound"
        if i == 'M':
            j ="Male"
        elif i == "F":
            j = "Female"
        
        return j
      
result = df['driver_gender'].apply(use_apply)
result
0          Male
1          Male
2          Male
3          Male
4        Female
          ...  
91736    Female
91737    Female
91738      Male
91739    Female
91740      Male
Name: driver_gender, Length: 91741, dtype: object

As we see above we change every cell in column driver_gender as function said


pandas.DataFrame.agg

This Function help you to do some operations at the same time so reduce your code.


DataFrame.agg(func=None, axis=0, *args, **kwargs)

func : accept functions to perform them this function accept list of funtions

axis: The default is (0)to perform along columns

axis :1 to perform along rows

*args: to add some parameters

**kwargs:Keyword arguments to pass to func to name identify parameters



df3 = pd.DataFrame([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9],
                   [np.nan, np.nan, np.nan],
                    [3,4,5],
                    [8,9,6]],
                  columns=['A', 'B', 'C'])
df3.agg(['max','sum'])


df3.agg(sum)
A    23.0
B    28.0
C    29.0
dtype: float64

df3.agg(min)
A    1.0
B    2.0
C    3.0
dtype: float64

Merge Dataframe


If You have tow Data sets and you want to Work at them at the same time to get results from them all you should use merge.


dataframe pandas.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)

left: Dataframe name

right: the second dataframe name

how:(inner defult, outer, left, right, cross)

left: use keys from left data frame

right: use keys from right data frame

inner: intersection between the two data frame

outer: the union of the two data frame

on: column name which should be in the two data Frame

left_on: column or index level names to join on in the left DataFrame.

right_on: column or index level names to join on in the right DataFrame.

left_index: default False, use index of the left dataframe

right_index: use index of the right dataframe

suffixes: to distinguish between two data frame columns if there exist the same names


df4 = pd.DataFrame({'dfk': ['foo', 'bar', 'baz', 'foo'],
                    'value': [1, 2, 3, 5]})
df5 = pd.DataFrame({'dfk': ['foo', 'bar', 'baz', 'foo'],
                    'value': [5, 6, 7, 8]})
pd.merge(df4,df5,on ='dfk') # intersection


pd.merge(df4,df5,how = 'outer' ,on = 'dfk')

df4.merge(df5,how = 'cross') # cross product 

pandas.isnull

Catch Empty cells and Return True if NaN and False if not

df3.isnull()

pandas.unique(values)

Return unique values

# notice that the output not sorted 
pd.unique(pd.Series([4,5,7,8,9,99,4,5,4 ,2,33]))
array([ 4,  5,  7,  8,  9, 99,  2, 33], dtype=int64)
pd.unique([("m", "n"), ("z", "x"), ("n", "v"), ("z", "x")]) 
# note that (a,b) != (b,a)
array([('m', 'n'), ('z', 'x'), ('n', 'v')], dtype=object)


melt in pandas

used to change Data format from wide-----> to long(⬇️)

m = {"Name": ["Aya", "Lisa", "David"], "ID": [1, 2, 3], "Role": ["CEO", "Editor", "Author"]}

df = pd.DataFrame(m)

print(df)
print('\n________________________________________\n')

df_melted = pd.melt(df, id_vars=["ID"], value_vars=["Name", "Role"])

print(df_melted)
    Name  ID    Role
0    Aya   1     CEO
1   Lisa   2  Editor
2  David   3  Author

________________________________________

   ID variable   value
0   1     Name     Aya
1   2     Name    Lisa
2   3     Name   David
3   1     Role     CEO
4   2     Role  Editor
5   3     Role  Author

we can use pivot to unmelt dataframe

m = {"Name": ["Aya", "Lisa", "David"], "ID": [1, 2, 3], "Role": ["CEO", "Editor", "Author"]}

df = pd.DataFrame(m)

print(df)
print('\n________________________________________\n')

melted = pd.melt(df, id_vars=["ID"], value_vars=["Name", "Role"], var_name="Attribute", value_name="Value")

print(melted)
print('\n________________________________________\n')

# unmelting using pivot()

unmelted = melted.pivot(index='ID', columns='Attribute')

print(unmelted)
    Name  ID    Role
0    Aya   1     CEO
1   Lisa   2  Editor
2  David   3  Author

________________________________________

   ID Attribute   Value
0   1      Name     Aya
1   2      Name    Lisa
2   3      Name   David
3   1      Role     CEO
4   2      Role  Editor
5   3      Role  Author

________________________________________
           Value        
Attribute   Name    Role
ID                      
1            Aya     CEO
2           Lisa  Editor
3          David  Author
unmelted = unmelted['Value'].reset_index()
unmelted.columns.name = None
print(unmelted)
   ID   Name    Role
0   1    Aya     CEO
1   2   Lisa  Editor
2   3  David  Author


0 comments

Recent Posts

See All
bottom of page