10 minutes of pandas

xiaoxiao2025-10-09  5

这是网上十分钟入门 pandas 的教程,在此手敲一遍。

ps:这哪是十分钟,tm明明敲了好久,蓝瘦香菇。

首先导入库:

import pandas as pd import numpy as np from matplotlib import pyplot as plt

创建对象

创建一个 series 通过传递值的列表,让 pandas 创建一个整数索引:

s = pd.Series([1,2,3,4,5,np.nan,6]) s 0 1.0 1 2.0 2 3.0 3 4.0 4 5.0 5 NaN 6 6.0 dtype: float64

创建空值一般用 np.nan 。

DataFrame 通过传递带有日期索引和标记列的 numpy 数组来创建:

datas = pd.date_range('20180501',periods=6) datas DatetimeIndex(['2018-05-01', '2018-05-02', '2018-05-03', '2018-05-04', '2018-05-05', '2018-05-06'], dtype='datetime64[ns]', freq='D') df = pd.DataFrame(np.random.rand(6,4),index=datas,columns=list('ABCD')) df ABCD2018-05-010.2252790.7357100.0096390.4081492018-05-020.7050070.9002390.5512070.1654712018-05-030.6080680.3323450.5190190.1819472018-05-040.9219580.6260030.9458280.3572112018-05-050.3044230.8364940.7313510.6789472018-05-060.3428600.0532110.6707770.186546 np.random.randn:正态分布生成随机数np.random.rand:随机分布生成随机数(0-1)之间的数

通过字典类型创建 DataFrame 。

df2 = pd.DataFrame({'A':1., 'B':pd.Timestamp('20180101'), 'C':pd.Series(1,index=list(range(4)),dtype='float32'), 'D':np.array([3] * 4,dtype='int32'), 'E' : pd.Categorical(["test","train","test","train"]), 'F':'foo'}) df2 ABCDEF01.02018-01-011.03testfoo11.02018-01-011.03trainfoo21.02018-01-011.03testfoo31.02018-01-011.03trainfoo pd.Timestamp:时间戳,相当于 python 中的 datetimepd.Series:dataframe 中的每一列是由 series 组成的。np.array:也可以用 numpy 的数组来生成。pd.Categorical:分类的数值

结果有不同的类型:

df2.dtypes A float64 B datetime64[ns] C float32 D int32 E category F object dtype: object

查看数据

df.head() # 数据框中的前五条数据 ABCD2018-05-010.2252790.7357100.0096390.4081492018-05-020.7050070.9002390.5512070.1654712018-05-030.6080680.3323450.5190190.1819472018-05-040.9219580.6260030.9458280.3572112018-05-050.3044230.8364940.7313510.678947 df.tail(3) # 尾部的三条数据 ABCD2018-05-040.9219580.6260030.9458280.3572112018-05-050.3044230.8364940.7313510.6789472018-05-060.3428600.0532110.6707770.186546

显示索引

df.index # 显示索引 DatetimeIndex(['2018-05-01', '2018-05-02', '2018-05-03', '2018-05-04', '2018-05-05', '2018-05-06'], dtype='datetime64[ns]', freq='D') df.columns # 显示列名 Index(['A', 'B', 'C', 'D'], dtype='object') df.values # 显示值 array([[0.22527927, 0.73570966, 0.00963855, 0.40814939], [0.70500688, 0.90023873, 0.55120699, 0.16547071], [0.60806789, 0.33234503, 0.51901872, 0.18194742], [0.92195751, 0.62600349, 0.94582788, 0.35721069], [0.304423 , 0.83649383, 0.73135067, 0.67894676], [0.34286019, 0.05321105, 0.67077744, 0.18654591]])

describe() 显示快速统计:

df.describe() ABCDcount6.0000006.0000006.0000006.000000mean0.5179320.5806670.5713030.329712std0.2713820.3266620.3144470.199088min0.2252790.0532110.0096390.16547125%0.3140320.4057600.5270660.18309750%0.4754640.6808570.6109920.27187875%0.6807720.8112980.7162070.395415max0.9219580.9002390.9458280.678947 count:计数mean:平均值std:标准差min:最小值25%:较低的百分位数50%:中位数75%:较高的百分位数max:最大值

数据转置:

df.T 2018-05-01 00:00:002018-05-02 00:00:002018-05-03 00:00:002018-05-04 00:00:002018-05-05 00:00:002018-05-06 00:00:00A0.2252790.7050070.6080680.9219580.3044230.342860B0.7357100.9002390.3323450.6260030.8364940.053211C0.0096390.5512070.5190190.9458280.7313510.670777D0.4081490.1654710.1819470.3572110.6789470.186546

按轴排序:

df.sort_index(axis=1,ascending=False) DCBA2018-05-010.4081490.0096390.7357100.2252792018-05-020.1654710.5512070.9002390.7050072018-05-030.1819470.5190190.3323450.6080682018-05-040.3572110.9458280.6260030.9219582018-05-050.6789470.7313510.8364940.3044232018-05-060.1865460.6707770.0532110.342860 axis:通常所说的按列排序,0,index;1,columnsascending:布尔值,升序还是降序

按值排序:

df.sort_values('B') ABCD2018-05-060.3428600.0532110.6707770.1865462018-05-030.6080680.3323450.5190190.1819472018-05-040.9219580.6260030.9458280.3572112018-05-010.2252790.7357100.0096390.4081492018-05-050.3044230.8364940.7313510.6789472018-05-020.7050070.9002390.5512070.165471

切片

选择一个列,产生一个 series ,相当于 df.A

df['A'] 2018-05-01 0.225279 2018-05-02 0.705007 2018-05-03 0.608068 2018-05-04 0.921958 2018-05-05 0.304423 2018-05-06 0.342860 Freq: D, Name: A, dtype: float64 df.A 2018-05-01 0.225279 2018-05-02 0.705007 2018-05-03 0.608068 2018-05-04 0.921958 2018-05-05 0.304423 2018-05-06 0.342860 Freq: D, Name: A, dtype: float64 df[:3] ABCD2018-05-010.2252790.7357100.0096390.4081492018-05-020.7050070.9002390.5512070.1654712018-05-030.6080680.3323450.5190190.181947

按标签选择

使用标签获取横截面的数据:

df.loc[datas[0]] A 0.225279 B 0.735710 C 0.009639 D 0.408149 Name: 2018-05-01 00:00:00, dtype: float64 df.loc[datas[2]] A 0.608068 B 0.332345 C 0.519019 D 0.181947 Name: 2018-05-03 00:00:00, dtype: float64

按标签选择多轴:

df.loc[:,['A','B']] AB2018-05-010.2252790.7357102018-05-020.7050070.9002392018-05-030.6080680.3323452018-05-040.9219580.6260032018-05-050.3044230.8364942018-05-060.3428600.053211

两个端点也包括:

df.loc['20180502':'20180504',['A','C']] AC2018-05-020.7050070.5512072018-05-030.6080680.5190192018-05-040.9219580.945828 df.loc['20180505',['A','C']] A 0.304423 C 0.731351 Name: 2018-05-05 00:00:00, dtype: float64

获取某一个值:

df.loc[datas[0],'A'] 0.22527926638468565

也可以用 at

df.at[datas[0],'A'] 0.22527926638468565

按位置选择

通过传递的整数位置选择:

df.iloc[3] A 0.921958 B 0.626003 C 0.945828 D 0.357211 Name: 2018-05-04 00:00:00, dtype: float64

切片

df.iloc[0:6,0:3] ABC2018-05-010.2252790.7357100.0096392018-05-020.7050070.9002390.5512072018-05-030.6080680.3323450.5190192018-05-040.9219580.6260030.9458282018-05-050.3044230.8364940.7313512018-05-060.3428600.0532110.670777

通过整数位置位置列表

df.iloc[[0,2,4],[1,3]] BD2018-05-010.7357100.4081492018-05-030.3323450.1819472018-05-050.8364940.678947 df.iloc[:,1:3] BC2018-05-010.7357100.0096392018-05-020.9002390.5512072018-05-030.3323450.5190192018-05-040.6260030.9458282018-05-050.8364940.7313512018-05-060.0532110.670777

获取某一个值:

df.iloc[1,1] 0.9002387294615217

布尔索引

df[df>0.2] ABCD2018-05-010.2252790.735710NaN0.4081492018-05-020.7050070.9002390.551207NaN2018-05-030.6080680.3323450.519019NaN2018-05-040.9219580.6260030.9458280.3572112018-05-050.3044230.8364940.7313510.6789472018-05-060.342860NaN0.670777NaN

使用 isin() 过滤:

df2 = df.copy() df2['E'] = ['one','two','three','four','five','six'] df2 ABCDE2018-05-010.2252790.7357100.0096390.408149one2018-05-020.7050070.9002390.5512070.165471two2018-05-030.6080680.3323450.5190190.181947three2018-05-040.9219580.6260030.9458280.357211four2018-05-050.3044230.8364940.7313510.678947five2018-05-060.3428600.0532110.6707770.186546six df2['E'].isin(['one','five']) 2018-05-01 True 2018-05-02 False 2018-05-03 False 2018-05-04 False 2018-05-05 True 2018-05-06 False Freq: D, Name: E, dtype: bool df2[df2['E'].isin(['one','five'])] ABCDE2018-05-010.2252790.7357100.0096390.408149one2018-05-050.3044230.8364940.7313510.678947five

自定义设置值

s1 = pd.Series([1,2,3,4,5,6],index=pd.date_range('20180501',periods=6)) s1 2018-05-01 1 2018-05-02 2 2018-05-03 3 2018-05-04 4 2018-05-05 5 2018-05-06 6 Freq: D, dtype: int64 df['F'] = s1

按照标签设置值

df.loc[datas[0],'A'] = 0

按照位置设置值

df.iloc[0,1] = 0

看一下操作的结果:

df ABCDF2018-05-010.0000000.0000000.0096390.40814912018-05-020.7050070.9002390.5512070.16547122018-05-030.6080680.3323450.5190190.18194732018-05-040.9219580.6260030.9458280.35721142018-05-050.3044230.8364940.7313510.67894752018-05-060.3428600.0532110.6707770.1865466

缺失的数据

pandas 主要使用 np.nan 来表示缺失的数据,默认不包含在计算中。

重建索引允许您更改/添加/删除指定轴上的索引。这将返回数据的副本。

df1 = df.reindex(index=datas[0:4],columns=list(df.columns) + ['E']) df1.loc[datas[0]:datas[1],'E'] = 1 df1 ABCDFE2018-05-010.0000000.0000000.0096390.40814911.02018-05-020.7050070.9002390.5512070.16547121.02018-05-030.6080680.3323450.5190190.1819473NaN2018-05-040.9219580.6260030.9458280.3572114NaN

删除任何缺少数据的行:

df1.dropna() ABCDFE2018-05-010.0000000.0000000.0096390.40814911.02018-05-020.7050070.9002390.5512070.16547121.0

填充数据:

df1.fillna(value=5) ABCDFE2018-05-010.0000000.0000000.0096390.40814911.02018-05-020.7050070.9002390.5512070.16547121.02018-05-030.6080680.3323450.5190190.18194735.02018-05-040.9219580.6260030.9458280.35721145.0

判断是否为空,返回布尔值

df1.isna() ABCDFE2018-05-01FalseFalseFalseFalseFalseFalse2018-05-02FalseFalseFalseFalseFalseFalse2018-05-03FalseFalseFalseFalseFalseTrue2018-05-04FalseFalseFalseFalseFalseTrue

一些函数的操作

df.mean() # 列统计 A 0.480386 B 0.458049 C 0.571303 D 0.329712 F 3.500000 dtype: float64 df.mean(1) # 行统计 2018-05-01 0.283558 2018-05-02 0.864385 2018-05-03 0.928276 2018-05-04 1.370200 2018-05-05 1.510243 2018-05-06 1.450679 Freq: D, dtype: float64 df.apply(np.cumsum) ABCDF2018-05-010.0000000.0000000.0096390.40814912018-05-020.7050070.9002390.5608460.57362032018-05-031.3130751.2325841.0798640.75556862018-05-042.2350321.8585872.0256921.112778102018-05-052.5394552.6950812.7570431.791725152018-05-062.8823152.7482923.4278201.97827121

这会应用到所有的数据框中。

直方图

s = pd.Series(np.random.randint(1,7,size=9)) s 0 6 1 1 2 1 3 5 4 1 5 3 6 2 7 1 8 1 dtype: int32 s.value_counts() 1 5 6 1 5 1 3 1 2 1 dtype: int64

字符串方法

s = pd.Series(['A','B','C','Al',np.nan,'dOg']) s.str.lower() 0 a 1 b 2 c 3 al 4 NaN 5 dog dtype: object

合并

s1 = pd.Series(['a','b']) s2 = pd.Series(['c','d']) pd.concat([s1,s2]) 0 a 1 b 0 c 1 d dtype: object

通过将 ignore_index 选项设置为,清除现有索引并在结果中重置它 True 。

pd.concat([s1,s2],ignore_index=True) 0 a 1 b 2 c 3 d dtype: object

使用该 keys 选项在数据的最外层添加分层索引

pd.concat([s1,s2],keys=['s1','s2']) s1 0 a 1 b s2 0 c 1 d dtype: object pd.concat([s1,s2],keys=['s1','s2'],ignore_index=True) 0 a 1 b 2 c 3 d dtype: object

合并

left = pd.DataFrame({'key':['foo','bar'],'lval':[1,2]}) right = pd.DataFrame({'key':['foo','bar'],'rval':[3,4]}) left keylval0foo11bar2 right keyrval0foo31bar4 pd.merge(left,right) keylvalrval0foo131bar24

追加

将行追加到数据框:

df = pd.DataFrame(np.random.rand(4,4),columns=['A','B','C','D']) df ABCD00.3042660.5581590.6998050.96488710.0832970.2289680.8256720.48359120.4970660.2037180.8949970.83023430.0011100.3232480.0663820.074556 s = df.iloc[3] s A 0.001110 B 0.323248 C 0.066382 D 0.074556 Name: 3, dtype: float64 df.append(s,ignore_index=True) ABCD00.3042660.5581590.6998050.96488710.0832970.2289680.8256720.48359120.4970660.2037180.8949970.83023430.0011100.3232480.0663820.07455640.0011100.3232480.0663820.074556

有点像列表添加元素。

分组

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', ....: 'foo', 'bar', 'foo', 'foo'], ....: 'B' : ['one', 'one', 'two', 'three', ....: 'two', 'two', 'one', 'three'], ....: 'C' : np.random.randn(8), ....: 'D' : np.random.randn(8)}) ....: df ABCD0fooone0.015713-0.2768901barone0.5635660.0899732footwo-1.2036562.2425533barthree-0.254199-1.3585234footwo-0.6254210.2520785bartwo0.461810-2.0499066fooone-1.2721690.4476157foothree-0.1007210.131472 df.groupby('A').sum() CDAbar0.771176-3.318457foo-3.1862542.796829 df.groupby(['A','B']).sum() CDABbarone0.5635660.089973three-0.254199-1.358523two0.461810-2.049906fooone-1.2564560.170726three-0.1007210.131472two-1.8290772.494631

数据透视表

df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3, .....: 'B' : ['A', 'B', 'C'] * 4, .....: 'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2, .....: 'D' : np.random.randn(12), .....: 'E' : np.random.randn(12)}) df ABCDE0oneAfoo0.167128-0.5014731oneBfoo-1.218322-0.8753972twoCfoo-1.327522-1.6089713threeAbar-0.917783-0.5374534oneBbar-0.803415-0.1290885oneCbar-0.6865710.1235546twoAfoo0.0515451.1388507threeBfoo0.1386660.3962748oneCfoo0.8401120.8204829oneAbar0.452267-1.41154010twoBbar-1.0002971.03771511threeCbar2.481947-1.184744 pd.pivot_table(df,values='D',index=['A','B'],columns=['C']) CbarfooABoneA0.4522670.167128B-0.803415-1.218322C-0.6865710.840112threeA-0.917783NaNBNaN0.138666C2.481947NaNtwoANaN0.051545B-1.000297NaNCNaN-1.327522

数据读取

df.read_csv() # 从CSV读取数据 df.to_excel('foo.xlsx', sheet_name='Sheet1') # 从excel读取数据
转载请注明原文地址: https://www.6miu.com/read-5037626.html

最新回复(0)