跟着Kaggle学习Pandas

Kaggle给初学者们提供了简单的上手教学，在此将其关于Pandas的课程提炼出来，虽然并没有多少内容，但对于做数据分析应该够用了
Kaggle原文链接：https://www.kaggle.com/learn/pandas
Pandas官方文档更为详细，举例也很友好：http://pandas.pydata.org/pandas-docs/stable/reference/index.html
例子中用到的数据我已在下面链接中分享出来：
链接：https://pan.baidu.com/s/1zalqh7qRPb_L92oKRPPg_Q
提取码：4198

1. Creating, Reading & Writing

1.1 Creating data

Pandas中有两个核心对象，DataFrame和Series

1.1.1 DataFrame

DataFrame是二维表格，竖列称为column，横行称为index，例如创建一个名为fruits的DataFrame对象：

import pandas as pd

# 方法一：先按行输入表格数据，然后分别对columns和index作补充说明
fruits = pd.DataFrame([[10, 20], [30, 40]], columns=["Apples", "Bananas"], index=["Price", "Amount"])

# 方法二：按列来输入每一个column的数据，然后对index作单独补充说明
fruits = pd.DataFrame({"Apples": [10, 30], "Bananas": [20, 40]}, index=["Price", "Amount"])

在这里插入图片描述

1.1.2 Series

Series是一维表格，即只有单列column，可以把一个DataFrame看作是多个Series组合起来的合体，它们总是相互关联。不对column单独命名，只有一个总的表格名，例如创建一个名为things的Series对象：

# 方法类似，先对表格内容进行输入，然后对index和name补充说明
things = pd.Series([1, 2, 3], index=["Milk", "Eggs", "Spam"], name="Dinner")

things

在这里插入图片描述
注意：在上述创建过程中，如果index不作特殊说明，那么就会是从0开始的连续自然数

1.2 Reading file

数据可以存储在多种文件格式中，目前最基础的是CSV格式的文件，CSV（comma-seperated values）格式是数据被逗号隔开的一种表格，我们使用read_csv()函数把数据读取到一个DataFrame对象中：

# 此时pandas会自动加入从0开始的index
wine_reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv")

# 如果原数据有自己的index，比如在第一列，则可用index_col=0来描述，这样第一列就会作为index
wine_reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)

另一种常见的数据格式是SQL（Structured Query Language），它的存储能力相当惊人，SQL有很多不同种类，每一种都需要各自的connector，它的读取没有CSV那么方便，目前在Kaggle唯一支持的种类是SQLite，应用举例：

import sqlite3

conn = sqlite3.connect("../input/188-million-us-wildfires/FPA_FOD_20170508.sqlite")
fires = pd.read_sql_query("SELECT * FROM fires", conn)
# SELECT是每个SQL开始陈述时的固定表达，星号*意为everything，FROM fires表示只从数据中读取名为fires的表格

1.3 Writing file

使用to_csv()函数将数据写入CSV格式的文件：

wine_reviews.to_csv("wine_reviews.csv")

2. Indexing, Selecting & Assigning

2.1 Naive accessors

简单的访问数据：可以直接显示整个DataFrame，若数据量较大最好设置展示的最大行数；也可以使用 head() 来展示前几行数据

import pandas as pd

reviews = pd.read_csv("wine-reviews/winemag-data-130k-v2.csv", index_col=0)
pd.set_option("display.max_rows", 2)    # 数据量大，这里设置为最多显示5行

reviews

在这里插入图片描述

reviews.head(n=1)           # 若不指定，则默认n=5

在这里插入图片描述
选取特定的某一列数据，可以用 DataFrame.column 或者 DataFrame[“column”]

reviews.country   
reviews['country']

在这里插入图片描述
选取特定的某一行某一列的元素，可以用 DataFrame[“column”][“index”]

reviews['country'][0]

‘Italy’

2.2 DataFrame.iloc

DataFrame.iloc基于数据的数字索引位置（integer-location）来检索数据，也可用布尔值（boolean array）进行检索

reviews.iloc[0]               # 选取第一行

在这里插入图片描述

reviews.iloc[:, 0]            # 选取第一列，:表示所有

在这里插入图片描述

reviews.iloc[:3, 0]           # 选取第一列的前3个数据

在这里插入图片描述

reviews.iloc[[0, 3, 5], 0]    # 选取第一列的第1、4、6个数据，使用列表检索要加[]

在这里插入图片描述

reviews.iloc[-1]              # 选取最后一行：若是负数，则是从数据末尾开始计算，类似于python的列表功能

在这里插入图片描述

# 基于布尔值进行检索选取数据，注意布尔的数量要与index数量一致，下例表示选取第1行和第4行数据
reviews.head().iloc[[True, False, False, True, False]]

在这里插入图片描述

2.3 DataFrame.loc

DataFrame.loc可以基于标签（label）或布尔值（boolean array）对行列进行检索，还可以使用逻辑符号进行条件检索

reviews.loc[0, 'country']                     # 选取第1行第1列的数据

‘Italy’

reviews.loc[:, ['taster_name', 'points']]     # 选取某两列的所有数据

在这里插入图片描述

# 基于布尔值进行检索选取数据，注意布尔的数量要与index数量一致，下例表示选取第1行和第4行数据
reviews.head().loc[[True, False, False, True, False]]

在这里插入图片描述
总结一下iloc与loc的异同：

相同点：都是先行后列的检索顺序，都支持使用布尔值检索
iloc适用于基于数字索引进行检索；loc可以使用表格的标签进行检索，还可以进行条件检索
对于一个range（比如1:10），iloc是前闭后开（即1到9），loc是前后都闭（即1到10）

2.4 条件检索

reviews.loc[reviews.country == 'Italy']

在这里插入图片描述

reviews.loc[(reviews.country == 'Italy') & (reviews.points >= 90)]     # 在Pandas里“与”用“&”表示

在这里插入图片描述

reviews.loc[(reviews.country == 'Italy') | (reviews.points >= 90)]     # 在Pandas里“或”用“|”表示

在这里插入图片描述
isin() 用来选取存在于列表中的数据

reviews.loc[reviews.country.isin(['Italy', 'France'])]

在这里插入图片描述
isnull() 与 notnull() 用来确认数据是否为空

reviews.loc[reviews.price.notnull()]

在这里插入图片描述

2.5 Assigning data 赋值

reviews['critic'] = 'everyone'
reviews['critic']

在这里插入图片描述

3. Summary functions & maps

import pandas as pd

reviews = pd.read_csv("wine-reviews/winemag-data-130k-v2.csv", index_col=0)

reviews.points.describe()              # 描述分数的一些情况

在这里插入图片描述

reviews.points.mean()                   # 分数的平均值

88.44713820775404

reviews.points.median()                 # 分数的中值

88.0

reviews.points.unique()                 # 分数的不同数值

array([ 87, 86, 85, 88, 92, 91, 90, 89, 83, 82, 81, 80, 100, 98, 97, 96, 95, 93, 94, 84, 99], dtype=int64)

pd.set_option("display.max_rows", 5)

reviews.taster_name.value_counts()      # 分数的不同数值分别出现了多少次

在这里插入图片描述

4. Grouping & Sorting

import pandas as pd

reviews = pd.read_csv("wine-reviews/winemag-data-130k-v2.csv", index_col=0)
pd.set_option("display.max_rows", 5)

reviews.groupby('points').points.count()    # 将分数升序排列，并统计出对应的个数

在这里插入图片描述

reviews.groupby("price").points.max()      # 将价格升序排列，并统计出对应的最高分数

在这里插入图片描述

reviews.groupby('country').price.agg([min, max])      # agg可以运行多个函数

在这里插入图片描述

# groupby可以有多个index
countries_reviewed = reviews.groupby(['country', 'province']).description.agg([len])
countries_reviewed

在这里插入图片描述

countries_reviewed.reset_index()     # reset_index()转换为常规的dataframe类型

在这里插入图片描述

countries_reviewed.reset_index().sort_values(by='len')      # 默认升序

在这里插入图片描述

countries_reviewed.reset_index().sort_values(by='len', ascending=False)

在这里插入图片描述

5. Data types & dealing with missing data

5.1 Data types

import pandas as pd

reviews = pd.read_csv("wine-reviews/winemag-data-130k-v2.csv", index_col=0)
pd.set_option("display.max_rows", 2)

reviews.dtypes

在这里插入图片描述

reviews.points.dtype

在这里插入图片描述

reviews.points.astype('float64')       # 将points的int64类型转换为float64类型

5.2 Missing data handling

reviews.country.isnull().sum()

63

reviews[reviews.country.isnull()]

在这里插入图片描述

reviews.region_2.fillna("Unknown")       # fillna用于将缺失的NA值替换成别的

在这里插入图片描述

DataFrame.dropna()                          # 去掉有缺失项的行
DataFrame.dropna(axis='columns')            # 去掉有缺失项的列
DataFrame.dropna(how='all')                 # 去掉所有项均缺失的行
DataFrame.dropna(thresh=2)                  # 保留至少有两个非空项的行

还有一种处理数据的方法：replace，例如DataFrame.replace(“A”, “B”)，意为将A替换为B
关于replace有很多用法，具体可查看官方文档:http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html#pandas.DataFrame.replace

6. Renaming & Combining

这部分用的比较少，就不详细写了：）

reviews.rename(columns={'points': 'score'})
reviews.rename(index={0: 'firstEntry', 1: 'secondEntry'})

combined_products = pd.concat([gaming_products, movie_products])

left = canadian_youtube.set_index(['title', 'trending_date'])
right = british_youtube.set_index(['title', 'trending_date'])

left.join(right, lsuffix='_CAN', rsuffix='_UK')