![如何用Pandas 三步清洗数据? 如何用Pandas 三步清洗数据?](https://www.dataapplab.com/wp-content/uploads/2020/07/微信截图_20200722160952.jpg)
如何用Pandas 三步清洗数据?
数据科学并不全是花哨的图表,它是我们用来清理、浏览和建模数据的一组工具,以提取现实中有意义的信息。获取真实世界的信息首先需要真实数据, 但真实数据是往往是残缺和杂乱的。
1. 删除变量
- 对特征变量进行相关分析
- 检查每个功能变量缺少多少行。如果某个变量缺少其90%的数据点,那么最好将它们全部丢弃
- 考虑变量本身的性质。从实用的角度来看,使用此功能变量是否真的有用?我们仅在确定不会有用时才将其丢弃
# Computing correlation coefficients
x_cols = [col for col in data.columns if col not in ['output']]for col in x_cols:
corr_coeffs = np.corrcoef(data[col].values, data.output.values)
# Get the number of missing values in each column / feature variable
# Drop a feature variable
data = data.drop('feature_name', 1)
2. 处理缺失值
- 用任意值填写缺失的行
- 在缺失的行中填充根据数据统计信息计算出的值
- 忽略缺少的行
在你确定了一个好的默认值的情况下, 你可以选择第一个方法。但是, 如果你可以通过某些统计分析得出一个值的话, 我们会更倾向于用第二个方法, 毕竟它至少可以从数据中得到一些支持。如果我们有足够大的数据集可以丢弃一些行,则可以采用最后一个选项。但是,在执行此操作之前,请务必快速查看数据,以确保这些数据点不是至关重要的。
# Filling in NaN values of a particular feature variable
avg_height = 67 # Maybe this is a good number
data["height"] = data["height"].fillna(avg_height)
# Filling in NaN values with a calculated one
avg_height = data["height"].median() # This is probably more accurate
data["height"] = data["height"].fillna(avg_height)
# Dropping rows with missing values
# Here we check which rows of "height" aren't null
# and only keep those
data = data[pd.notnull(data['height'])]
3. 格式化数据
- 标准化数据格式,包括首字母缩写词,大写字母和样式
- 离散化连续数据,反之亦然
让我们用Pandas 示范一下:
# Formattinng data
data['state'] = data['state'].str.upper() # Capitalize the whole thing
data['state'] = data['state'].replace( # Changing the format of the string
to_replace=["CA", "C.A", "CALI"],
# Dates and times are quite common in large datasets# Converting all strings to datetime objects is good standardisation practice
# Here, the data["time"] strings will look like "2019-01-15", which is exactly
# how we set the "format" variable below
data["time"] = pd.to_datetime(data["time"], format='%Y-%m-%d', errors='ignore')
# Discretising continous variables
# Height is in inches and we make a few bins with labels
data["height"] = pd.cut(data['height'],
bins=[0, 48, 60, 66, 72, 78, 100],
labels=["Super Short","Short", "Average", "Above Average","Tall","Super Tall"])
# Replace string values with some floats that our
# ML models will be able to handle
mapping = {'Adult Woman': 1, 'Adult Man': 2, "Child": 3}
data"Person" = data["Person"].replace(mapping)
原文作者:George Seif
翻译作者:Yi Chen