【Python-数据科学】Pandas Basics速查表（2023）

Pandas library是Python中最强大有效的library之一。它基于NumPy构建，为 Python编程语言提供了易于使用的数据结构和数据分析工具。如果你想了解更多关于Python的相关内容，可以阅读以下这些文章：
如何使用Python进行运动检测？
Tabpy ——Tableau和Python合体了！
动动小手，几行python代码让你的邮箱自动化又智能！
Python3.11新鲜出炉，快来看看有哪些亮点！

查看下面的内容，了解Pandas提供的各种功能和工具。

Pandas数据结构
删除
排序＆排名
检索Series/DataFrame信息
Dataframe统计
查询
函数应用
数据对齐
输入/输出

Pandas数据结构

Pandas library主要围绕两种类型的数据结构。第一个是称为Series的一维数组，第二个是称为Data Frame的二维表。

Series：一维标记数组

>>> s = pd.Series([3, -5, 7, 4], index = ['a','b','c','d'])
 a    3
 b   -5
 c    7
 d    4

Data Frame：二维标记数据结构

>>> data = {'Country':['Belgium','India','Brazil'], 'Capital':['Brussels','New Delhi','Brasilia'], 'Population':['111907','1303021','208476']}>>> df = pd.DataFrame(data, columns = ['Country','Capital','Population'])   Country    Capital Population
0  Belgium   Brussels     111907
1    India  New Delhi    1303021
2   Brazil   Brasilia     208476

删除

在这一部分，你将学习如何从一个Series中删除特定值，以及如何从一个Data Frame中删除列或行。

下面是代码中的s和df在这一部分用作Series和Data Frame的示例。

>>> sa    6
b   -5
c    7
d    4>>> df   Country    Capital  Population
0  Belgium   Brussels      111907
1    India  New Delhi     1303021
2   Brazil   Brasilia      208476
l ·Drop values from rows (axis = 0)
>>> s.drop(['a','c']) b   -5
d    4
l ·Drop values from columns (axis = 1)
>>> df.drop('Country', axis = 1)    Capital Population
0   Brussels     111907
1  New Delhi    1303021
2   Brasilia     208476

排序＆排名

在这一部分，你将学习如何按索引或列对一个Data Frames进行排序，以及如何对列值进行排名。

下面是代码中的df在这一部分用作Data Frame的示例。

>>> df      Country    Capital  Population
0  Belgium   Brussels      111907
1    India  New Delhi     1303021
2   Brazil   Brasilia      208476
l ·Sort by labels along an axis
>>> df.sort_index()   Country    Capital Population
0  Belgium   Brussels     111907
1    India  New Delhi    1303021
2   Brazil   Brasilia     208476
l ·Sort by values along an axis
>>> df.sort_values(by = 'Country')   Country    Capital Population
0  Belgium   Brussels     111907
2   Brazil   Brasilia     208476
1    India  New Delhi    1303021
l ·Assign ranks to entries
>>> df.rank()   Country  Capital  Population
0      1.0      2.0         1.0
1      3.0      3.0         2.0
2      2.0      1.0         3.0

检索Series/DataFrame信息

在这一部分，你将学习如何从一个Data Frame中检索信息，该Data Frame包括维度、列命名、列类型和索引范围。

下面是代码中的df在这一部分用作Data Frame的示例。

>>> df   Country    Capital  Population
0  Belgium   Brussels      111907
1    India  New Delhi     1303021
2   Brazil   Brasilia      208476
l ·(rows, columns)
>>> df.shape
 (3, 3)
l ·Describe index
>>> df.index
 RangeIndex(start=0, stop=3, step=1)
l ·Describe DataFrame columns
>>> df.columns
 Index(['Country', 'Capital', 'Population'], dtype='object')
l ·Info on DataFrame
>>> df.info()
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
Country       3 non-null object
Capital       3 non-null object
Population    3 non-null object
dtypes: object(3)
memory usage: 152.0+ bytes
l ·Number of non-NA values
>>> df.count()Country       3
Capital       3
Population    3

Dataframe统计

在这一部分，你将学习如何检索一个Data Frame的统计信息，其中包括每列的总和、每列的最小/最大值、每列的平均值等。

下面是代码中的df在这一部分用作Data Frame的示例。

>>> df   Even  Odd
0     2    1
1     4    3
2     6    5
l ·Sum of values
>>> df.sum()
Even    12
Odd      9
l ·Cumulative sum of values
>>> df.cumsum()   Even  Odd
0     2    1
1     6    4
2    12    9
l ·Minimum value
>>> df.min()
Even    2
Odd     1
l ·Maximum value
>>> df.max()
Even    6
Odd     5
l ·Summary statistics
>>> df.describe()       Even  Odd
count   3.0  3.0
mean    4.0  3.0
std     2.0  2.0
min     2.0  1.0
25%     3.0  2.0
50%     4.0  3.0
75%     5.0  4.0
max     6.0  5.0
l ·Mean of values
>>> df.mean()
Even    4.0
Odd     3.0
l ·Median of values
>>> df.median()
Even    4.0
Odd     3.0

查询

在这一部分，你将学习如何从一个Series或Data Frame中查询特定值。

下面是代码中的s和df在这一部分用作Series和Data Frame的示例。

>>> sa    6
b   -5
c    7
d    4>>> df      Country    Capital  Population
0  Belgium   Brussels      111907
1    India  New Delhi     1303021
2   Brazil   Brasilia      208476
l ·Get one element
>>> s['b']
 -5
l ·Get subset of a DataFrame
>>> df[1:]  Country    Capital Population
1   India  New Delhi    1303021
2  Brazil   Brasilia     208476
l ·Select single value by row & column
>>> df.iloc[0,0]
 'Belgium'
l ·Select single value by row and column labels
>>> df.loc[0,'Country']
 'Belgium'
l ·Select single row of subset rows
>>> df.ix[2]Country         Brazil
Capital       Brasilia
Population      208476
l ·Select a single column of subset of columns
>>> df.ix[:,'Capital']0     Brussels
1    New Delhi
2     Brasilia
l ·Select rows and columns
>>> df.ix[1,'Capital']
 'New Delhi'
l ·Use filter to adjust DataFrame
>>> df[df['Population'] > 120000]  Country    Capital  Population
1   India  New Delhi     1303021
2  Brazil   Brasilia      208476
l ·Set index a of Series s to 6
>>> s['a'] = 6 a    6
 b   -5
 c    7
 d    4

函数应用

在这一部分，你将学习如何将函数应用于一个Data Frame或一个特定列的所有值。

下面是代码中的df在这一部分用作Data Frame的示例。

>>> df   Even  Odd
0     2    1
1     4    3
2     6    5
l ·Apply function
>>> df.apply(lambda x: x*2)   Even  Odd
0     4    2
1     8    6
2    12   10

数据对齐

在这一部分，你将学习如何对具有不同索引的两个Series进行加、减和除。

下面是代码中的s和s3在这一部分用作Series的示例。

>>> sa    6
b   -5
c    7
d    4>>> s3a    7
c   -2
d    3
l ·Internal Data Alignment
>>> s + s3a    13.0
b     NaN
c     5.0
d     7.0#NA values are introduced in the indices that don't overlap
l ·Arithmetic Operations with Fill Methods
>>> s.add(s3, fill_value = 0)a    13.0
b    -5.0
c     5.0
d     7.0>>> s.sub(s3, fill_value = 2)a   -1.0
b   -7.0
c    9.0
d    1.0>>> s.div(s3, fill_value = 4)a    0.857143
b   -1.250000
c   -3.500000
d    1.333333

输入/输出

在这一部分，你将学习如何使用Pandas将CSV文件、Excel文件和SQL查询读入Python。你还将学习如何将一个Data Frame从Pandas导出为CSV文件、Excel文件和SQL查询。

Read CSV file

>>> pd.read_csv('file.csv')

Write to CSV file

>>> df.to_csv('myDataFrame.csv')

Read Excel file

>>> pd.read_excel('file.xlsx')

Write to Excel file

>>> pd.to_excel('dir/'myDataFrame.xlsx')

Read multiple sheets from the same file

>>> xlsx = pd.ExcelFile('file.xls')>>> df = pd.read_excel(xlsx, Sheet1')

Read SQL Query

>>> from sqlalchemy import create_engine
>>> engine = create_engine('sqlite:///:memory:')
>>> pd.read_sql('SELECT * FROM my_table;', engine)
>>> pd.read_sql_table('my_table', engine)

Write to SQL Query

>>> pd.to_sql('myDF', engine)

结语

不管是在当前还是在可预见的未来，Python都是数据科学领域的佼佼者。了解作为最强大有效的library之一的Pandas，通常是当今数据科学家的必备条件。

在初学时使用这个速查表作为指南，你将很好地掌握Pandas library。感谢阅读。你还可以订阅我们的YouTube频道，观看大量大数据行业相关公开课：https://www.youtube.com/channel/UCa8NLpvi70mHVsW4J_x9OeQ；在LinkedIn上关注我们，扩展你的人际网络！https://www.linkedin.com/company/dataapplab/

原文作者：Zita
翻译作者：高佑兮
美工编辑：过儿
校对审稿：Chuang
原文链接：https://levelup.gitconnected.com/pandas-basics-cheat-sheet-2023-python-for-data-science-b59fb7786b4d

November 24, 2022 | Blog | Tags: 数据科学

【Python-数据科学】Pandas Basics速查表（2023）

【Python-数据科学】Pandas Basics速查表（2023）

Google BigLake是Snowflake、Redshift & Co.的杀手吗？

数据工程——Scala与Python的区别

Latest post

加州州长挽救裁员危机

大语言模型的工资出乎你的想象

白领工作的消亡，人工智能开启的第四次革命

Courses

Events

Lecture 17: 100 Days of LLM Mastery

Trade Stocks and Crypto with AI Agents

Lecture 18: 100 Days of LLM Mastery

Consulting

ABOUT US

Contact Info: