《python for data analysis》第七章,数据规整化

论坛 期权论坛     
选择匿名的用户   2021-5-28 02:12   0   0
<div class="blogpost-body" id="cnblogs_post_body">
<p>《利用Python进行数据分析》第七章的代码。</p>
<pre class="blockcode"># -*- coding:utf-8 -*-<br># 《python for data analysis》第七章, 数据规整化<br><br><br>import pandas as pd<br>import numpy as np<br>import time<br><br>start &#61; time.time()<br># 1、合并数据集,有merge、join、concat三种方式<br># 1.1、数据库风格的dataframe合并(merge &amp; join)<br># merge函数将两个dataframe按照键把行连接起来<br>df1 &#61; pd.DataFrame({<!-- --><br>    &#39;key&#39;: [&#39;a&#39;, &#39;b&#39;, &#39;b&#39;, &#39;c&#39;, &#39;d&#39;],<br>    &#39;data1&#39;: [1, 2, 3, 4, 5]<br>})<br>df2 &#61; pd.DataFrame({<!-- --><br>    &#39;key&#39;: [&#39;a&#39;, &#39;b&#39;, &#39;c&#39;],<br>    &#39;data2&#39;: [6, 7, 8]<br>})<br>df &#61; pd.merge(df1, df2, on&#61;&#39;key&#39;)  # on用于显示指定连接的依据,缺省时则依据共有的列名(键)<br>print(df)<br># 当两个dataframe中没有公共键时,可以在merge函数中分别指定左依据(left_on)和右依据(right_on)<br>df1 &#61; pd.DataFrame({<!-- --><br>    &#39;key1&#39;: [&#39;a&#39;, &#39;b&#39;, &#39;b&#39;, &#39;c&#39;, &#39;d&#39;],<br>    &#39;data1&#39;: [1, 2, 3, 4, 5]<br>})<br>df2 &#61; pd.DataFrame({<!-- --><br>    &#39;key2&#39;: [&#39;a&#39;, &#39;b&#39;, &#39;c&#39;],<br>    &#39;data2&#39;: [6, 7, 8]<br>})<br>df &#61; pd.merge(df1, df2, left_on&#61;&#39;key1&#39;, right_on&#61;&#39;key2&#39;)<br>print(df)<br># merge缺省是求键的交集(df1中的d没有了)<br># 可用how关键字显示指定求交集(inner)还是求并集(outer)还是以左dataframe为准(left)还是以右dataframe为准(right)<br>df &#61; pd.merge(df1, df2, left_on&#61;&#39;key1&#39;, right_on&#61;&#39;key2&#39;, how&#61;&#39;outer&#39;)<br>print(df)<br># merge还有一个关键字用于对重复的键(非连接依据)进行重命名<br>df1 &#61; pd.DataFrame({<!-- --><br>    &#39;key&#39;: [&#39;a&#39;, &#39;b&#39;, &#39;b&#39;, &#39;c&#39;, &#39;d&#39;],<br>    &#39;same&#39;: range(5),<br>    &#39;data1&#39;: [1, 2, 3, 4, 5]<br>})<br>df2 &#61; pd.DataFrame({<!-- --><br>    &#39;key&#39;: [&#39;a&#39;, &#39;b&#39;, &#39;c&#39;],<br>    &#39;same&#39;: range(3, 0, -1),<br>    &#39;data2&#39;: [6, 7, 8]<br>})<br>df &#61; pd.merge(df1, df2, on&#61;&#39;key&#39;, suffixes&#61;(&#39;_left&#39;, &#39;_right&#39;))  # 自动追加<br>print(df)<br>print(&#39;1-1····························&#39;)<br># 另外merge的返回值默认会对连接键的数据进行排序,可通过sort关键字显示指定排序与否。对于大样本数据关掉收到更快<br># 1.2、索引上的合并<br># 若某个dataframe上的连接键为其索引,则可通过merge的right_index和left_index关键字来指定<br>df1 &#61; pd.DataFrame({<!-- --><br>    &#39;key&#39;: [&#39;a&#39;, &#39;b&#39;, &#39;c&#39;],<br>    &#39;data1&#39;: range(3)<br>})<br>df2 &#61; pd.DataFrame({&#39;data2&#39;: range(3, 0, -1), },<br>                   index&#61;[&#39;a&#39;, &#39;b&#39;, &#39;d&#39;])<br>df &#61; pd.merge(df1, df2, left_on&#61;&#39;key&#39;, right_index&#61;True)  # 必要时合并双方的连接键均可用索引<br>print(df)<br># pd的join方法可以更加方便地按照索引进行合并<br>df1 &#61; pd.DataFrame({&#39;data1&#39;: range(3)},<br>                   index&#61;[&#39;a&#39;, &#39;b&#39;, &#39;c&#39;])<br>df2 &#61; pd.DataFrame({&#39;data2&#39;: range(3, 0, -1)},<br>                   index&#61;[&#39;a&#39;, &#39;b&#39;, &#39;d&#39;])<br>df &#61; df1.join(df2)<br>print(df)  # 缺省为左连接,可通过how关键字显示指定<br># join方法还可以传入一组dataframe,以列表方式<br>df3 &#61; pd.DataFrame({&#39;data3&#39;: [6, 9, 3]},<br>                   index&#61;[&#39;a&#39;, &#39;b&#39;, &#39;g&#39;])<br>df &#61; df1.join([df2, df3], how&#61;&#39;outer&#39;)<br>print(df)<br>print(&#39;1-2···········································&#39;)<br># 1.3 轴向连接,concat方法(concatenate)<br># 不同于按键进行拼接,concat之间按照轴的方向进行拼接,更加直观<br>df1 &#61; pd.Series(range(2), index&#61;[&#39;a&#39;, &#39;b&#39;])<br>df2 &#61; pd.Series(range(3), index&#61;[&#39;c&#39;, &#39;d&#39;, &#39;e&#39;])<br>df &#61; pd.concat([df1, df2])<br>print(df)<br># concat方法默认按axis&#61;0进行拼接,可通过axis关键字进行显式指定<br>df &#61; pd.concat([df1, df2], axis&#61;1)<br>print(df)  # 按轴1进行拼接,生成dataframe<br># 在axis1上拼接时默认是按照并集(how&#61;outer)拼接<br># 可以通过join_axes关键字显式指定拼接对象的索引<br>df &#61; pd.concat([df1, df2], axis&#61;1, join_axes&#61;[[&#39;a&#39;, &#39;b&#39;, &#39;g&#39;]])<br>print(df)<br># 有时候有在对接对象中区分来源的需求,可用keys参数来实现<br>df &#61; pd.concat([df1, df2], keys&#61;[&#39;df1&#39;, &#39;df2&#39;])<br>print(df)  # 生成的dataframe是层次化索引的(multi-index)<br># 在连接轴为横向(axis&#61;1)的情况下,指定keys,则生成的dataframe的index为keys<br>df &#61; pd.concat([df1, df2], keys&#61
分享到 :
0 人收藏
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

积分:3875789
帖子:775174
精华:0
期权论坛 期权论坛
发布
内容

下载期权论坛手机APP