amrali150
Apr 10, 20222 min
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data
string_data.isnull()
Output is:
0 False
1 False
2 True
3 False
dtype: bool
The built-in Python None value is also treated as NA in object arrays:
string_data[0] = None
string_data.isnull()
Output is:
0 True
1 False
2 True
3 False
dtype: bool
There are a few ways to filter out missing data. While you always have the option to do it by hand using pandas.isnull and boolean indexing, the dropna can be helpful. On a Series, it returns the Series with only the non-null data and index values:
from numpy import nan as NA
data = pd.Series([1, NA, 3.5, NA, 7])
data.dropna()
Output is:
0 1.0
2 3.5
4 7.0
dtype: float64
With DataFrame objects, things are a bit more complex. You may want to drop rows or columns that are all NA or only those containing any NAs. dropna by default drops any row containing a missing value:
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
[NA, NA, NA], [NA, 6.5, 3.]])
cleaned = data.dropna()
cleaned
Output is:
Passing how='all' will only drop rows that are all NA:
data.dropna(how='all')
Output is:
To drop columns in the same way, pass axis=1:
data[4] = NA
data
data.dropna(axis=1, how='all')
Output is:
A related way to filter out DataFrame rows tends to concern time series data. Suppose
you want to keep only rows containing a certain number of observations. You can
indicate this with the thresh argument:
df = pd.DataFrame(np.random.randn(7, 3))
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA
df
Output is:
df.dropna()
Output is:
df.dropna(thresh=2)
Output is:
Rather than filtering out missing data (and potentially discarding other data along with it), you may want to fill in the “holes” in any number of ways. For most purposes, the fillna method is the workhorse function to use. Calling fillna with a constant replaces missing values with that value:
df.fillna(0)
Output is:
Calling fillna with a dict, you can use a different fill value for each column:
df.fillna({1: 0.5, 2: 0})
Output is:
fillna returns a new object, but you can modify the existing object in-place:
_ = df.fillna(0, inplace=True)
df
Output is:
Duplicate rows may be found in a DataFrame for any number of reasons. Here is an example:
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
'k2': [1, 1, 2, 3, 3, 4, 4]})
data
Output is:
The DataFrame method duplicated returns a boolean Series indicating whether each row is a duplicate (has been observed in a previous row) or not:
data.duplicated()
Output is:
Relatedly, drop_duplicates returns a DataFrame where the duplicated array is False:
data.drop_duplicates()
Output is:
To see more details visit GitHub here