Kaggle给初学者们提供了简单的上手教学,在此将其关于Pandas的课程提炼出来,虽然并没有多少内容,但对于做数据分析应该够用了
Kaggle原文链接:https://www.kaggle.com/learn/pandas
Pandas官方文档更为详细,举例也很友好:http://pandas.pydata.org/pandas-docs/stable/reference/index.html
例子中用到的数据我已在下面链接中分享出来:
链接:https://pan.baidu.com/s/1zalqh7qRPb_L92oKRPPg_Q
提取码:4198
1. Creating, Reading & Writing
1.1 Creating data
Pandas中有两个核心对象,DataFrame和Series
1.1.1 DataFrame
DataFrame是二维表格,竖列称为column,横行称为index,例如创建一个名为fruits的DataFrame对象:
import pandas as pd
fruits = pd. DataFrame( [ [ 10 , 20 ] , [ 30 , 40 ] ] , columns= [ "Apples" , "Bananas" ] , index= [ "Price" , "Amount" ] )
fruits = pd. DataFrame( { "Apples" : [ 10 , 30 ] , "Bananas" : [ 20 , 40 ] } , index= [ "Price" , "Amount" ] )
1.1.2 Series
Series是一维表格,即只有单列column,可以把一个DataFrame看作是多个Series组合起来的合体,它们总是相互关联。不对column单独命名,只有一个总的表格名,例如创建一个名为things的Series对象:
things = pd. Series( [ 1 , 2 , 3 ] , index= [ "Milk" , "Eggs" , "Spam" ] , name= "Dinner" )
things
注意:在上述创建过程中,如果index不作特殊说明,那么就会是从0开始的连续自然数
1.2 Reading file
数据可以存储在多种文件格式中,目前最基础的是CSV格式的文件,CSV(comma-seperated values)格式是数据被逗号隔开的一种表格,我们使用read_csv()函数把数据读取到一个DataFrame对象中:
wine_reviews = pd. read_csv( "../input/wine-reviews/winemag-data-130k-v2.csv" )
wine_reviews = pd. read_csv( "../input/wine-reviews/winemag-data-130k-v2.csv" , index_col= 0 )
另一种常见的数据格式是SQL(Structured Query Language),它的存储能力相当惊人,SQL有很多不同种类,每一种都需要各自的connector,它的读取没有CSV那么方便,目前在Kaggle唯一支持的种类是SQLite,应用举例:
import sqlite3
conn = sqlite3. connect( "../input/188-million-us-wildfires/FPA_FOD_20170508.sqlite" )
fires = pd. read_sql_query( "SELECT * FROM fires" , conn)
1.3 Writing file
使用to_csv()函数将数据写入CSV格式的文件:
wine_reviews. to_csv( "wine_reviews.csv" )
2. Indexing, Selecting & Assigning
2.1 Naive accessors
简单的访问数据:可以直接显示整个DataFrame,若数据量较大最好设置展示的最大行数;也可以使用 head() 来展示前几行数据
import pandas as pd
reviews = pd. read_csv( "wine-reviews/winemag-data-130k-v2.csv" , index_col= 0 )
pd. set_option( "display.max_rows" , 2 )
reviews
reviews. head( n= 1 )
选取特定的某一列数据,可以用 DataFrame.column 或者 DataFrame[“column”]
reviews. country
reviews[ 'country' ]
选取特定的某一行某一列的元素,可以用 DataFrame[“column”][“index”]
reviews[ 'country' ] [ 0 ]
‘Italy’
2.2 DataFrame.iloc
DataFrame.iloc基于数据的数字索引位置(integer-location)来检索数据,也可用布尔值(boolean array)进行检索
reviews. iloc[ 0 ]
reviews. iloc[ : , 0 ]
reviews. iloc[ : 3 , 0 ]
reviews. iloc[ [ 0 , 3 , 5 ] , 0 ]
reviews. iloc[ - 1 ]
reviews. head( ) . iloc[ [ True , False , False , True , False ] ]
2.3 DataFrame.loc
DataFrame.loc可以基于标签(label)或布尔值(boolean array)对行列进行检索,还可以使用逻辑符号进行条件检索
reviews. loc[ 0 , 'country' ]
‘Italy’
reviews. loc[ : , [ 'taster_name' , 'points' ] ]
reviews. head( ) . loc[ [ True , False , False , True , False ] ]
总结一下iloc与loc的异同:
相同点:都是先行后列的检索顺序,都支持使用布尔值检索
iloc适用于基于数字索引进行检索;loc可以使用表格的标签进行检索,还可以进行条件检索
对于一个range(比如1:10),iloc是前闭后开(即1到9),loc是前后都闭(即1到10)
2.4 条件检索
reviews. loc[ reviews. country == 'Italy' ]
reviews. loc[ ( reviews. country == 'Italy' ) & ( reviews. points >= 90 ) ]
reviews. loc[ ( reviews. country == 'Italy' ) | ( reviews. points >= 90 ) ]
isin() 用来选取存在于列表中的数据
reviews. loc[ reviews. country. isin( [ 'Italy' , 'France' ] ) ]
isnull() 与 notnull() 用来确认数据是否为空
reviews. loc[ reviews. price. notnull( ) ]
2.5 Assigning data 赋值
reviews[ 'critic' ] = 'everyone'
reviews[ 'critic' ]
3. Summary functions & maps
import pandas as pd
reviews = pd. read_csv( "wine-reviews/winemag-data-130k-v2.csv" , index_col= 0 )
reviews. points. describe( )
reviews. points. mean( )
88.44713820775404
reviews. points. median( )
88.0
reviews. points. unique( )
array([ 87, 86, 85, 88, 92, 91, 90, 89, 83, 82, 81, 80, 100, 98, 97, 96, 95, 93, 94, 84, 99], dtype=int64)
pd. set_option( "display.max_rows" , 5 )
reviews. taster_name. value_counts( )
4. Grouping & Sorting
import pandas as pd
reviews = pd. read_csv( "wine-reviews/winemag-data-130k-v2.csv" , index_col= 0 )
pd. set_option( "display.max_rows" , 5 )
reviews. groupby( 'points' ) . points. count( )
reviews. groupby( "price" ) . points. max ( )
reviews. groupby( 'country' ) . price. agg( [ min , max ] )
countries_reviewed = reviews. groupby( [ 'country' , 'province' ] ) . description. agg( [ len ] )
countries_reviewed
countries_reviewed. reset_index( )
countries_reviewed. reset_index( ) . sort_values( by= 'len' )
countries_reviewed. reset_index( ) . sort_values( by= 'len' , ascending= False )
5. Data types & dealing with missing data
5.1 Data types
import pandas as pd
reviews = pd. read_csv( "wine-reviews/winemag-data-130k-v2.csv" , index_col= 0 )
pd. set_option( "display.max_rows" , 2 )
reviews. dtypes
reviews. points. dtype
reviews. points. astype( 'float64' )
5.2 Missing data handling
reviews. country. isnull( ) . sum ( )
63
reviews[ reviews. country. isnull( ) ]
reviews. region_2. fillna( "Unknown" )
DataFrame. dropna( )
DataFrame. dropna( axis= 'columns' )
DataFrame. dropna( how= 'all' )
DataFrame. dropna( thresh= 2 )
还有一种处理数据的方法:replace,例如DataFrame.replace(“A”, “B”),意为将A替换为B
关于replace有很多用法,具体可查看官方文档:http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html#pandas.DataFrame.replace
6. Renaming & Combining
这部分用的比较少,就不详细写了:)
reviews. rename( columns= { 'points' : 'score' } )
reviews. rename( index= { 0 : 'firstEntry' , 1 : 'secondEntry' } )
combined_products = pd. concat( [ gaming_products, movie_products] )
left = canadian_youtube. set_index( [ 'title' , 'trending_date' ] )
right = british_youtube. set_index( [ 'title' , 'trending_date' ] )
left. join( right, lsuffix= '_CAN' , rsuffix= '_UK' )