pandas官方文档cookbook（5）中MissValue&groupby翻译

xiaoxiao2021-02-28 68

文档版本：0.20.3 这些例子是用python3.4写出来的。对于较早的python版本需要对代码做些相应的调整。 Pandas（pd）和Numpy（np）是唯一两个默认导入的包。其余的包会显示导入给新用户看。若有翻译不当的地方，请多多指教。

这份文档中的例子都是从Stack-Overflow和Github中别人提问的比较经典的问题，作者从中进行提炼与总结。

缺失值

反转时间序列把缺失值替换成前一个值。df.index[::-1]表示把df的index进行倒序处理，::表示选择全列，-1表示依次递减逆序。

In [82]: df = pd.DataFrame(np.random.randn(6,1), index=pd.date_range('2013-08-01', periods=6, freq='B'), columns=list('A')) In [83]: df.loc[df.index[3], 'A'] = np.nan In [84]: df Out[84]: A 2013-08-01 -1.054874 2013-08-02 -0.179642 2013-08-05 0.639589 2013-08-06 NaN 2013-08-07 1.906684 2013-08-08 0.104050 In [85]: df.reindex(df.index[::-1]).ffill() Out[85]: A 2013-08-08 0.104050 2013-08-07 1.906684 2013-08-06 1.906684 2013-08-05 0.639589 2013-08-02 -0.179642 2013-08-01 -1.054874

分组

不同于agg函数，apply函数是可以在一个数据框的子集中获取所有列进行处理的函数。

In [86]: df = pd.DataFrame({'animal': 'cat dog cat fish dog cat cat'.split(), ....: 'size': list('SSMMMLL'), ....: 'weight': [8, 10, 11, 1, 20, 12, 12], ....: 'adult' : [False] * 5 + [True] * 2}); df ....: Out[86]: adult animal size weight 0 False cat S 8 1 False dog S 10 2 False cat M 11 3 False fish M 1 4 False dog M 20 5 True cat L 12 6 True cat L 12 #List the size of the animals with the highest weight. In [87]: df.groupby('animal').apply(lambda subf: subf['size'][subf['weight'].idxmax()]) Out[87]: animal cat L dog M fish M dtype: object

get_group的使用方法

In [88]: gb = df.groupby(['animal']) In [89]: gb.get_group('cat') Out[89]: adult animal size weight 0 False cat S 8 2 False cat M 11 5 True cat L 12 6 True cat L 12

在分组后对一个列中不同的分类变量进行不同的操作

In [90]: def GrowUp(x): ....: avg_weight = sum(x[x['size'] == 'S'].weight * 1.5) ....: avg_weight += sum(x[x['size'] == 'M'].weight * 1.25) ....: avg_weight += sum(x[x['size'] == 'L'].weight) ....: avg_weight /= len(x) ....: return pd.Series(['L',avg_weight,True], index=['size', 'weight', 'adult']) ....: In [91]: expected_df = gb.apply(GrowUp) In [92]: expected_df Out[92]: size weight adult animal cat L 12.4375 True dog L 20.0000 True fish L 1.2500 True

expanding与apply的组合使用

In [93]: S = pd.Series([i / 100.0 for i in range(1,11)]) In [94]: def CumRet(x,y): ....: return x * (1 + y) ....: In [95]: def Red(x): ....: return functools.reduce(CumRet,x,1.0) ....: In [96]: S.expanding().apply(Red) Out[96]: 0 1.010000 1 1.030200 2 1.061106 3 1.103550 4 1.158728 5 1.228251 6 1.314229 7 1.419367 8 1.547110 9 1.701821 dtype: float64

在分组后替换一些值为每个分组的平均值（这里注意apply与transform的差别）

In [97]: df = pd.DataFrame({'A' : [1, 1, 2, 2], 'B' : [1, -1, 1, 2]}) In [98]: gb = df.groupby('A') In [99]: def replace(g): ....: mask = g < 0 ....: g.loc[mask] = g[~mask].mean() ....: return g ....: In [100]: gb.transform(replace) Out[100]: B 0 1.0 1 1.0 2 1.0 3 2.0

对聚合的数据进行分组排序

In [101]: df = pd.DataFrame({'code': ['foo', 'bar', 'baz'] * 2, .....: 'data': [0.16, -0.21, 0.33, 0.45, -0.59, 0.62], .....: 'flag': [False, True] * 3}) .....: In [102]: code_groups = df.groupby('code') In [103]: agg_n_sort_order = code_groups[['data']].transform(sum).sort_values(by='data') In [104]: sorted_df = df.loc[agg_n_sort_order.index] In [105]: sorted_df Out[105]: code data flag 1 bar -0.21 True 4 bar -0.59 False 0 foo 0.16 False 3 foo 0.45 True 2 baz 0.33 False 5 baz 0.62 True

创建多重聚合列

In [106]: rng = pd.date_range(start="2014-10-07",periods=10,freq='2min') In [107]: ts = pd.Series(data = list(range(10)), index = rng) In [108]: def MyCust(x): .....: if len(x) > 2: .....: return x[1] * 1.234 .....: return pd.NaT .....: In [109]: mhc = {'Mean' : np.mean, 'Max' : np.max, 'Custom' : MyCust} In [110]: ts.resample("5min").apply(mhc) Out[110]: Custom 2014-10-07 00:00:00 1.234 2014-10-07 00:05:00 NaT 2014-10-07 00:10:00 7.404 2014-10-07 00:15:00 NaT Max 2014-10-07 00:00:00 2 2014-10-07 00:05:00 4 2014-10-07 00:10:00 7 2014-10-07 00:15:00 9 Mean 2014-10-07 00:00:00 1 2014-10-07 00:05:00 3.5 2014-10-07 00:10:00 6 2014-10-07 00:15:00 8.5 dtype: object In [111]: ts Out[111]: 2014-10-07 00:00:00 0 2014-10-07 00:02:00 1 2014-10-07 00:04:00 2 2014-10-07 00:06:00 3 2014-10-07 00:08:00 4 2014-10-07 00:10:00 5 2014-10-07 00:12:00 6 2014-10-07 00:14:00 7 2014-10-07 00:16:00 8 2014-10-07 00:18:00 9 Freq: 2T, dtype: int64

创建一个计数的列并且放置在数据框中

In [112]: df = pd.DataFrame({'Color': 'Red Red Red Blue'.split(), .....: 'Value': [100, 150, 50, 50]}); df .....: Out[112]: Color Value 0 Red 100 1 Red 150 2 Red 50 3 Blue 50 In [113]: df['Counts'] = df.groupby(['Color']).transform(len) In [114]: df Out[114]: Color Value Counts 0 Red 100 3 1 Red 150 3 2 Red 50 3 3 Blue 50 1

在index的基础上进行分组后对某一个组的值位移

In [115]: df = pd.DataFrame( .....: {u'line_race': [10, 10, 8, 10, 10, 8], .....: u'beyer': [99, 102, 103, 103, 88, 100]}, .....: index=[u'Last Gunfighter', u'Last Gunfighter', u'Last Gunfighter', .....: u'Paynter', u'Paynter', u'Paynter']); df .....: Out[115]: beyer line_race Last Gunfighter 99 10 Last Gunfighter 102 10 Last Gunfighter 103 8 Paynter 103 10 Paynter 88 10 Paynter 100 8 In [116]: df['beyer_shifted'] = df.groupby(level=0)['beyer'].shift(1) In [117]: df Out[117]: beyer line_race beyer_shifted Last Gunfighter 99 10 NaN Last Gunfighter 102 10 99.0 Last Gunfighter 103 8 102.0 Paynter 103 10 NaN Paynter 88 10 103.0 Paynter 100 8 88.0

从每个分组中选取最大值的行

In [118]: df = pd.DataFrame({'host':['other','other','that','this','this'], .....: 'service':['mail','web','mail','mail','web'], .....: 'no':[1, 2, 1, 2, 1]}).set_index(['host', 'service']) .....: In [119]: mask = df.groupby(level=0).agg('idxmax') In [120]: df_count = df.loc[mask['no']].reset_index() In [121]: df_count Out[121]: host service no 0 other web 2 1 that mail 1 2 this mail 2

像Python的itertools.groupby一样分组

In [122]: df = pd.DataFrame([0, 1, 0, 1, 1, 1, 0, 1, 1], columns=['A']) In [123]: df.A.groupby((df.A != df.A.shift()).cumsum()).groups Out[123]: {1: Int64Index([0], dtype='int64'), 2: Int64Index([1], dtype='int64'), 3: Int64Index([2], dtype='int64'), 4: Int64Index([3, 4, 5], dtype='int64'), 5: Int64Index([6], dtype='int64'), 6: Int64Index([7, 8], dtype='int64')} In [124]: df.A.groupby((df.A != df.A.shift()).cumsum()).cumsum() Out[124]: 0 0 1 1 2 0 3 1 4 2 5 3 6 0 7 1 8 2 Name: A, dtype: int64

转载请注明原文地址: https://www.6miu.com/read-46700.html

技术

最新回复(0)

pandas官方文档cookbook（5）中MissValue&amp;groupby翻译

缺失值

分组

技术

pandas官方文档cookbook（5）中MissValue&groupby翻译