本次将介绍操作Series和DataFrame中的数据的基本手段。
重新索引的方法为reindex,其作用是创建一个适应新索引的新对象。调用reindex方法,series会按照给定的索引进行重新排序,当给定的索引值在原series时,默认会显示缺失值NaN,但我们可以通过fill_value属性指定填充值。
In [105]: obj = Series([4.5,7.2,5,6],index=['d','b','a','c']) In [106]: obj Out[106]: d 4.5 b 7.2 a 5.0 c 6.0 dtype: float64 In [107]: obj2=obj.reindex(['a','b','c','d','e'],fill_value=0) In [108]: obj2 Out[108]: a 5.0 b 7.2 c 6.0 d 4.5 e 0.0 dtype: float64对于时间序列这样的有序数据,重新索引可能需要一些插值处理,可以使用method选项,值为ffill实现前向填充
In [109]: obj3 = Series(['blue','purple','yellow'],index=[0,2,4]) In [110]: obj3.reindex(range(6),method='ffill') Out[110]: 0 blue 1 blue 2 purple 3 purple 4 yellow 5 yellow dtype: object下表中列出了method选项的可用值
参数说明ffill或pad前向填充值bfill或backfill后向填充值reindex 函数的参数
参数说明index索引新序列method插值填充方法fill_value引入的缺失数据值limit前向填充或后向填充的最大填充量copy如果新索引与旧的相等则底层数据不会拷贝。默认为True(即始终拷贝)level在多层索引上匹配简单索引,否则选取其子集Series的索引、选取和过滤类似NumPy数组的索引
In [5]: obj = Series(np.arange(4.),index=['a','b','c','d']) In [6]: obj Out[6]: a 0.0 b 1.0 c 2.0 d 3.0 dtype: float64 In [7]: obj['a'] Out[7]: 0.0 In [8]: obj[2:4] Out[8]: c 2.0 d 3.0 dtype: float64 In [9]: obj['b':'c']=5 In [10]: obj Out[10]: a 0.0 b 5.0 c 5.0 d 3.0 dtype: float64 In [11]: obj['b':'c']=[5,6] In [12]: obj Out[12]: a 0.0 b 5.0 c 6.0 d 3.0 dtype: float64由于DataFrame为二维表格型数据,DataFrame的索引就是获取一列或多个列
In [13]: data = DataFrame(np.arange(16).reshape((4,4)),index=['Ohio','Colorado','Utah','New York'],columns=['one','two','three','four']) In [14]: data Out[14]: one two three four Ohio 0 1 2 3 Colorado 4 5 6 7 Utah 8 9 10 11 New York 12 13 14 15 In [15]: data['two'] Out[15]: Ohio 1 Colorado 5 Utah 9 New York 13 Name: two, dtype: int32 In [17]: data[['two','three']] Out[17]: two three Ohio 1 2 Colorado 5 6 Utah 9 10 New York 13 14 In [18]: data[data['three']>5] Out[18]: one two three four Colorado 4 5 6 7 Utah 8 9 10 11 New York 12 13 14 15 In [19]: data[data<5]=0 In [20]: data Out[20]: one two three four Ohio 0 0 0 0 Colorado 0 5 6 7 Utah 8 9 10 11 New York 12 13 14 15引入ix对DataFrame的行上进行索引
In [21]: data.ix[['Colorado','Utah'],[3,0,1]] Out[21]: four one two Colorado 7 0 5 Utah 11 8 9 In [22]: data.ix[:'Utah','two'] Out[22]: Ohio 0 Colorado 5 Utah 9 Name: two, dtype: int32在pandas中,数据的选取和重排方式有很多,详情见 http://blog.csdn.net/u011707148/article/details/76822877
pandas可以对不同索引的对象进行算术运算。在对象相加时,如果存在不同的索引对,则结果的索引就是该索引对的并集。
In [24]: s1 = Series([7.3,-2.5,3.4,1.5],index=['a','c','d','e']) In [25]: s2 = Series([-2.1,3.6,-1.5,4,3.1],index=['a','c','e','f','g']) In [26]: s1 Out[26]: a 7.3 c -2.5 d 3.4 e 1.5 dtype: float64 In [27]: s2 Out[27]: a -2.1 c 3.6 e -1.5 f 4.0 g 3.1 dtype: float64 In [28]: s1+s2 Out[28]: a 5.2 c 1.1 d NaN e 0.0 f NaN g NaN dtype: float64在不重叠的索引处引入NA值。缺失值会在算术运算过程中传播。
对于DataFrame,对齐操作会同时发生在行和列上。
In [31]: df1 = DataFrame(np.arange(9).reshape((3,3)),columns=list('bcd'),index=['Ohio','Texas','Colorado']) In [32]: df2 = DataFrame(np.arange(12.).reshape((4,3)),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon']) In [33]: df1 Out[33]: b c d Ohio 0 1 2 Texas 3 4 5 Colorado 6 7 8 In [34]: df2 Out[34]: b d e Utah 0.0 1.0 2.0 Ohio 3.0 4.0 5.0 Texas 6.0 7.0 8.0 Oregon 9.0 10.0 11.0 In [35]: df1+df2 Out[35]: b c d e Colorado NaN NaN NaN NaN Ohio 3.0 NaN 6.0 NaN Oregon NaN NaN NaN NaN Texas 9.0 NaN 12.0 NaN Utah NaN NaN NaN NaNDataFrame和Series作运算
跟NumPy数组一样,DataFrame和Series之间作运算也是有明确规定的。我们来计算一下二维数组与其某行之间的差:
In [56]: arr = np.arange(12.).reshape((3,4)) In [57]: arr Out[57]: array([[ 0., 1., 2., 3.], [ 4., 5., 6., 7.], [ 8., 9., 10., 11.]]) In [58]: arr[0] Out[58]: array([ 0., 1., 2., 3.]) In [59]: arr-arr[0] Out[59]: array([[ 0., 0., 0., 0.], [ 4., 4., 4., 4.], [ 8., 8., 8., 8.]])默认情况下,dataframe和series之间的算术运算会将series的索引匹配到dataframe的列,然后沿着行一直向下广播
NumPy的ufuncs(元素级数组方法)也可用于操作pandas对象:
In [61]: frame = DataFrame(np.random.randn(4,3),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon']) In [62]: frame Out[62]: b d e Utah 0.351351 -0.609639 -0.262186 Ohio -1.794275 -0.846146 0.187630 Texas 1.031743 -0.438754 0.312941 Oregon -1.316956 -1.182678 0.211529 In [63]: np.abs(frame) Out[63]: b d e Utah 0.351351 0.609639 0.262186 Ohio 1.794275 0.846146 0.187630 Texas 1.031743 0.438754 0.312941 Oregon 1.316956 1.182678 0.211529另一个常见的操作时,将函数应用到有各列或行所形成的一维数组上。
In [64]: f = lambda x:x.max()-x.min() In [65]: frame.apply(f) Out[65]: b 2.826018 d 0.743924 e 0.575127 dtype: float64 In [66]: frame.apply(f,axis=1) Out[66]: Utah 0.960990 Ohio 1.981905 Texas 1.470496 Oregon 1.528485 dtype: float64lambda 定义了一个匿名函数 lambda 并不会带来程序运行效率的提高,只会使代码更简洁。 如果可以使用for…in…if来完成的,坚决不用lambda。 如果使用lambda,lambda内不要包含循环,如果有,我宁愿定义函数来完成,使代码获得可重用性和更好的可读性。 axis=0表述列,axis=1表述行 除标量外,传递给apply的函数还可以返回多个值组成的Series
In [67]: def f(x): ...: return Series([x.min(),x.max()],index=['min','max']) ...: In [68]: frame.apply(f) Out[68]: b d e min -1.794275 -1.182678 -0.262186 max 1.031743 -0.438754 0.312941此外,Python函数也是可用的。如:格式化frame字符串
In [69]: format = lambda x:'%.2f' %x In [70]: frame.applymap(format) Out[70]: b d e Utah 0.35 -0.61 -0.26 Ohio -1.79 -0.85 0.19 Texas 1.03 -0.44 0.31 Oregon -1.32 -1.18 0.21