# Python数据分析实战笔记—深入pandas：数据处理（1）

xiaoxiao2021-02-28  2

《Python数据分析实战》

1.数据准备

2.合并

pandas库中这类操作叫作合并，执行合并操作的函数为merge()。

frame1 >> id price 0 ball 12.33 1 pencil 11.44 2 pen 33.21 3 mug 13.23 4 ashtray 33.62 frame2 >> color id 0 white pencil 1 red pencil 2 red ball 3 black pen #merge()函数，执行合并操作。 pd.merge(frame1,frame2) >> id price color 0 ball 12.33 red 1 pencil 11.44 white 2 pencil 11.44 red 3 pen 33.21 black

frame1 =pd.DataFrame({ 'id':['ball','pencil','pen','mug','ashtray'], 'color':['white','red','red','black','green'], 'brand':['OMG','ABC','ABC','POD','POD'] }) frame2 =pd.DataFrame({ 'id':['pencil','pencil','ball','pen'], 'brand':['OMG','POD','ABC','POD'] }) #1.因为一个对象的列名称完全在另一个对象中也存在，所以对它们执行合并操作将得到一个空DataFrame对象 pd.merge(frame1,frame2) >> Empty DataFrame Columns:[barnds,color,id] Index:[] #2.我们使用on选项指定合并操作所依据的基准列 pd.merge(frame1,frame2,on='id') >> brand_x color id brand_y 0 OMG white ball ABC 1 ABC red pencil OMG 2 ABC red pencil POD 3 ABC red pen POD pd.merge(frame1,frame2,on='brand') >> brand color id_x id_y 0 OMG white ball pencil 1 ABC red pencil ball 2 ABC red pen ball 3 POD black mug pencil 4 POD black mug pen 5 POD green ashtray pencil 6 POD green ashtray pen

pd.merge(frame1,frame2,left_on='id',right_on='sid') >> brand_x color id brand_y sid 0 OMG white ball ABC ball 1 ABC red pencil OMG pencil 2 ABC red pencil POD pencil 3 ABC red pen POD pen

merge()函数默认执行的是内连接操作；上述结果中的键是由交叉操作得到的。

frame2.columns['brand','id'] #1.默认内连接 pd.merge(frame1,frame2,on='id') >> brand_x color id brand_y 0 OMG white ball ABC 1 ABC red pencil OMG 2 ABC red pencil POD 3 ABC red pen POD #2.外连接 pd.merge(frame1,frame2,on='id',how='outer') >> brand_x color id brand_y 0 OMG white ball ABC 1 ABC red pencil OMG 2 ABC red pencil POD 3 ABC red pen POD 4 POD black mug NaN 5 POD green ashtray NaN #3.左连接 pd.merge(frame1,frame2,on='id',how='left') >> brand_x color id brand_y 0 OMG white ball ABC 1 ABC red pencil OMG 2 ABC red pencil POD 3 ABC red pen POD 4 POD black mug NaN 5 POD green ashtray NaN #4.右连接 pd.merge(frame1,frame2,on='id',how='right') >> brand_x color id brand_y 0 OMG white ball ABC 1 ABC red pencil OMG 2 ABC red pencil POD 3 ABC red pen POD #5.要合并多个键，则把多个键赋给on选项 pd.merge(frame1,frame2,on=['id','brand'],how='outer') >> brand color id 0 OMG white ball 1 ABC red pencil 2 ABC red pen 3 POD black mug 4 POD green ashtray 5 OMG NaN pencil 6 POD NaN pencil 7 ABC NaN ball 8 POD NaN pen

pd.merge(frame1,frame2,right_index=True,left_index=True) >> brand_x color id_x brand_y id_y 0 OMG white ball OMG pencil 1 ABC red pencil POD pencil 2 ABC red pen ABC ball 3 POD black mug POD pen

fram1.join(frame2)

pandas将会给出错误信息，因为frame1的列名称与frame2有重合。因此在使用join()函数之前，要重命名frame2的列。

frame2.columns = ['brand2','id2'] frame1.join(frame2) >> brand color id brand2 id2 0 OMG white ball OMG pencil 1 ABC red pencil POD pencil 2 ABC red pen ABC ball 3 POD black mug POD pen 4 POD green ashtray NaN NaN

3.拼接

array1 >> array([ [0,1,2], [3,4,5], [6,7,8] ]) array2 >> array([ [6,7,8], [9.10.11], [12,13,14] ]) #Numpy的concatenate()函数 np.concatenate([array1,array2],axis=0) >> array([ [0,1,2], [3,4,5], [6,7,8], [6,7,8], [9.10.11], [12,13,14] ]) np.concatenate([array1,array2],axis=1) >> array([ [0,1,2,6,7,8], [3,4,5,9,10,11], [6,7,8,12,13,14] ]) #Pandas的concat()函数 ser1 >> 1 0.636 2 0.345 3 0.157 4 0.070 ser2 >> 5 0.411 6 0.359 7 0.987 8 0.329 pd.concat([ser1,ser2]) >> 1 0.636 2 0.345 3 0.157 4 0.070 5 0.411 6 0.359 7 0.987 8 0.329 #concat()函数默认按照axis=0这条轴拼接数据，返回series对象。如果指定axis=1，返回结果将是DataFrame对象。 pd.concat([ser1,ser2],axis=1) >> 0 1 1 0.636 NaN 2 0.345 NaN 3 0.157 NaN 4 0.070 NaN 5 NaN 0.411 6 NaN 0.359 7 NaN 0.987 8 NaN 0.329

pd.concat([ser1,ser3],axis=1,join='inner') >> 0 0 1 1 0.636 0.636 NaN 2 0.345 0.345 NaN 3 0.157 0.157 NaN 4 0.070 0.070 NaN

pd.concat([ser1,ser2],key=[1,2]) >> 1 1 0.636 2 0.345 3 0.157 4 0.070 2 5 0.411 6 0.359 7 0.987 8 0.329

pd.concat([ser1,ser2],axis=1,key=[1,2]) >> 1 2 1 0.636 NaN 2 0.345 NaN 3 0.157 NaN 4 0.070 NaN 5 NaN 0.411 6 NaN 0.359 7 NaN 0.987 8 NaN 0.329

4.组合

combine_first()函数可以用来组合Series对象，同时对齐数据。

ser1 >> 1 0.942 2 0.035 3 0.886 4 0.809 5 0.800 ser2 >> 2 0.739 4 0.225 5 0.709 6 0.214 ser1.combine_fisrt(ser2) >> 1 0.942 2 0.033 3 0.886 4 0.809 5 0.800 6 0.214 ser2.combine_first(ser1) >> 1 0.942 2 0.739 3 0.886 4 0.225 5 0.709 6 0.214

ser1[:3].combine(ser2[:3]) >> 1 0.942 2 0.033 3 0.886 4 0.225 5 0.709

5.轴向旋转：

ser5 >> white ball 0 pen 1 pencil 2 black ball 3 pen 4 pencil 5 red ball 6 pen 7 pencil 8 ser5.unstack() >> ball pen pencil white 0 1 2 black 3 4 5 red 6 7 8

ser5.unstack(0) >> ball pen pencil white 0 1 2 black 3 4 5 red 6 7 8

longframe >> color item value 0 white ball 0.091 1 white pen 0.495 2 white mug 0.956 3 red ball 0.394 4 red pen 0.501 5 red mug 0.561 6 black ball 0.879 7 black pen 0.610 8 black mug 0.093

pandas提供了能够把长格式DataFrame格式转换为宽格式的pivot()函数，它以用作键的一列或多列作为参数。

#选择color列作为主键，item列作为第二主键 wideframe = longframe.pivot('color','item') wideframe >> value item ball mug pen color balck 0.879 0.093 0.610 red 0.394 0.561 0.501 white 0.091 0.956 0.495

6.删除：

del命令和drop()函数

frame1 >> ball pen pencil white 0 1 2 balck 3 4 5 red 6 7 8 #要删除一列，对DataFrame对象应用del命令，指定列名。 del frame1['ball'] frame1 >> pen pencil white 1 2 balck 4 5 red 7 8 #要删除多余的行，使用drop()函数，将索引的名作为参数 frame1.drop('white') frame1 >> ball pen pencil balck 3 4 5 red 6 7 8