pandas techniques
1) Apply function
Pandas.apply allow the users to pass a function and apply it on every single value of the Pandas series. It comes as a huge improvement for the pandas library as this function helps to segregate data according to the conditions required due to which it is efficiently used in data science and machine learning..
Objects passed to the function are Series objects whose index is either the DataFrame’s index (axis=0) or the DataFrame’s columns (axis=1). By default (result_type=None)
first we add the needed libraries , this is the first step when using pandas functions
import pandas as pd
import numpy as np
creating a dataframe
df = pd.DataFrame([[1,2,3]]*5 , columns = ['A','B','C'])
print(df)
output:
A B C
0 1 2 3
1 1 2 3
2 1 2 3
3 1 2 3
4 1 2 3
applying apply function per column
df.apply(np.sum , axis =0)
output:
A 5
B 10
C 15
dtype: int64
applying apply function per row
df.apply(np.sum , axis =1)
output:
0 6
1 6
2 6
3 6
4 6
dtype: int64
2) Boolean indexing
Boolean Indexing is used if user wants to filter the values of a column based on conditions from another set of columns. For instance, we want a list of all students who are not scholars and got a loan. Boolean indexing can support here.
in python 0 is false , 1 is true and vice versa
0==False
output:
True
c = 10
(c > 1) + (c<20) +(c == 12)
output:
2
A boolean test can be used as an index for an index for an array or tuple
state = True
state = (True , False)[state]
state
output:
False
3) Is null function
Detect missing values for an array-like object.
This function takes a scalar or array-like object and indicates whether values are missing .
import pandas as pd
import numpy as np
pd.isna('dog')
output:
False
Is pd.na a null value?
pd.isna(pd.NA)
output:
True
Is np.nan a null value?
pd.isna(np.nan)
output:
True
construct an array with null values , and pass it to isna function
array = np.array([[1, np.nan, 3], [4, 5, np.nan]])
pd.isna(array)
output:
array([[False, True, False],
[False, False, True]])
construct a dataframe , and pass it to isna function
df = pd.DataFrame([['ant', 'bee', 'cat'], ['dog', None, 'fly']])
pd.isna(df)
output:
pd.isna(df[1])
output:
0 False
1 True
Name: 1, dtype: bool
4) Get dummies
Convert categorical variable into dummy/indicator variables.
convert the list into dummy variables
import pandas as pd
import numpy as np
s = pd.Series(list('abca'))
pd.get_dummies(s)
output:
convert the list into dummy variables
s1 = ['a', 'b', np.nan]
pd.get_dummies(s1)
output:
it includes the null values when converting to dummy varaibles
pd.get_dummies(s1, dummy_na=True)
output:
5) Cut function
Use cut when you need to segment and sort data values into bins. This function is also useful for going from a continuous variable to a categorical variable. For example, cut could convert ages to groups of age ranges. Supports binning into an equal number of bins, or a pre-specified array of bins.
The input array to be binned. Must be 1-dimensional.
Discretize into three equal-sized bins.
import pandas as pd
import numpy as np
pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3)
output:
[(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], (0.994, 3.0]]
Categories (3, interval[float64, right]): [(0.994, 3.0] < (3.0, 5.0] < (5.0, 7.0]]
Discretize into three equal-sized bins ,
pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3, retbins=True)
output:
([(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], (0.994, 3.0]]
Categories (3, interval[float64, right]): [(0.994, 3.0] < (3.0, 5.0] < (5.0, 7.0]],
array([0.994, 3. , 5. , 7. ]))
pd.cut(np.array([1, 7, 5, 4, 6, 3]),
3, labels=["bad", "medium", "good"])
output:
['bad', 'good', 'medium', 'medium', 'good', 'bad']
Categories (3, object): ['bad' < 'medium' < 'good']
Comments