这是网上十分钟入门 pandas 的教程,在此手敲一遍。
ps:这哪是十分钟,tm明明敲了好久,蓝瘦香菇。
首先导入库:
import pandas as pd import numpy as np from matplotlib import pyplot as plt创建一个 series 通过传递值的列表,让 pandas 创建一个整数索引:
s = pd.Series([1,2,3,4,5,np.nan,6]) s 0 1.0 1 2.0 2 3.0 3 4.0 4 5.0 5 NaN 6 6.0 dtype: float64创建空值一般用 np.nan 。
DataFrame 通过传递带有日期索引和标记列的 numpy 数组来创建:
datas = pd.date_range('20180501',periods=6) datas DatetimeIndex(['2018-05-01', '2018-05-02', '2018-05-03', '2018-05-04', '2018-05-05', '2018-05-06'], dtype='datetime64[ns]', freq='D') df = pd.DataFrame(np.random.rand(6,4),index=datas,columns=list('ABCD')) df ABCD2018-05-010.2252790.7357100.0096390.4081492018-05-020.7050070.9002390.5512070.1654712018-05-030.6080680.3323450.5190190.1819472018-05-040.9219580.6260030.9458280.3572112018-05-050.3044230.8364940.7313510.6789472018-05-060.3428600.0532110.6707770.186546 np.random.randn:正态分布生成随机数np.random.rand:随机分布生成随机数(0-1)之间的数通过字典类型创建 DataFrame 。
df2 = pd.DataFrame({'A':1., 'B':pd.Timestamp('20180101'), 'C':pd.Series(1,index=list(range(4)),dtype='float32'), 'D':np.array([3] * 4,dtype='int32'), 'E' : pd.Categorical(["test","train","test","train"]), 'F':'foo'}) df2 ABCDEF01.02018-01-011.03testfoo11.02018-01-011.03trainfoo21.02018-01-011.03testfoo31.02018-01-011.03trainfoo pd.Timestamp:时间戳,相当于 python 中的 datetimepd.Series:dataframe 中的每一列是由 series 组成的。np.array:也可以用 numpy 的数组来生成。pd.Categorical:分类的数值结果有不同的类型:
df2.dtypes A float64 B datetime64[ns] C float32 D int32 E category F object dtype: object显示索引
df.index # 显示索引 DatetimeIndex(['2018-05-01', '2018-05-02', '2018-05-03', '2018-05-04', '2018-05-05', '2018-05-06'], dtype='datetime64[ns]', freq='D') df.columns # 显示列名 Index(['A', 'B', 'C', 'D'], dtype='object') df.values # 显示值 array([[0.22527927, 0.73570966, 0.00963855, 0.40814939], [0.70500688, 0.90023873, 0.55120699, 0.16547071], [0.60806789, 0.33234503, 0.51901872, 0.18194742], [0.92195751, 0.62600349, 0.94582788, 0.35721069], [0.304423 , 0.83649383, 0.73135067, 0.67894676], [0.34286019, 0.05321105, 0.67077744, 0.18654591]])describe() 显示快速统计:
df.describe() ABCDcount6.0000006.0000006.0000006.000000mean0.5179320.5806670.5713030.329712std0.2713820.3266620.3144470.199088min0.2252790.0532110.0096390.16547125%0.3140320.4057600.5270660.18309750%0.4754640.6808570.6109920.27187875%0.6807720.8112980.7162070.395415max0.9219580.9002390.9458280.678947 count:计数mean:平均值std:标准差min:最小值25%:较低的百分位数50%:中位数75%:较高的百分位数max:最大值数据转置:
df.T 2018-05-01 00:00:002018-05-02 00:00:002018-05-03 00:00:002018-05-04 00:00:002018-05-05 00:00:002018-05-06 00:00:00A0.2252790.7050070.6080680.9219580.3044230.342860B0.7357100.9002390.3323450.6260030.8364940.053211C0.0096390.5512070.5190190.9458280.7313510.670777D0.4081490.1654710.1819470.3572110.6789470.186546按轴排序:
df.sort_index(axis=1,ascending=False) DCBA2018-05-010.4081490.0096390.7357100.2252792018-05-020.1654710.5512070.9002390.7050072018-05-030.1819470.5190190.3323450.6080682018-05-040.3572110.9458280.6260030.9219582018-05-050.6789470.7313510.8364940.3044232018-05-060.1865460.6707770.0532110.342860 axis:通常所说的按列排序,0,index;1,columnsascending:布尔值,升序还是降序按值排序:
df.sort_values('B') ABCD2018-05-060.3428600.0532110.6707770.1865462018-05-030.6080680.3323450.5190190.1819472018-05-040.9219580.6260030.9458280.3572112018-05-010.2252790.7357100.0096390.4081492018-05-050.3044230.8364940.7313510.6789472018-05-020.7050070.9002390.5512070.165471选择一个列,产生一个 series ,相当于 df.A
df['A'] 2018-05-01 0.225279 2018-05-02 0.705007 2018-05-03 0.608068 2018-05-04 0.921958 2018-05-05 0.304423 2018-05-06 0.342860 Freq: D, Name: A, dtype: float64 df.A 2018-05-01 0.225279 2018-05-02 0.705007 2018-05-03 0.608068 2018-05-04 0.921958 2018-05-05 0.304423 2018-05-06 0.342860 Freq: D, Name: A, dtype: float64 df[:3] ABCD2018-05-010.2252790.7357100.0096390.4081492018-05-020.7050070.9002390.5512070.1654712018-05-030.6080680.3323450.5190190.181947使用标签获取横截面的数据:
df.loc[datas[0]] A 0.225279 B 0.735710 C 0.009639 D 0.408149 Name: 2018-05-01 00:00:00, dtype: float64 df.loc[datas[2]] A 0.608068 B 0.332345 C 0.519019 D 0.181947 Name: 2018-05-03 00:00:00, dtype: float64按标签选择多轴:
df.loc[:,['A','B']] AB2018-05-010.2252790.7357102018-05-020.7050070.9002392018-05-030.6080680.3323452018-05-040.9219580.6260032018-05-050.3044230.8364942018-05-060.3428600.053211两个端点也包括:
df.loc['20180502':'20180504',['A','C']] AC2018-05-020.7050070.5512072018-05-030.6080680.5190192018-05-040.9219580.945828 df.loc['20180505',['A','C']] A 0.304423 C 0.731351 Name: 2018-05-05 00:00:00, dtype: float64获取某一个值:
df.loc[datas[0],'A'] 0.22527926638468565也可以用 at
df.at[datas[0],'A'] 0.22527926638468565通过传递的整数位置选择:
df.iloc[3] A 0.921958 B 0.626003 C 0.945828 D 0.357211 Name: 2018-05-04 00:00:00, dtype: float64切片
df.iloc[0:6,0:3] ABC2018-05-010.2252790.7357100.0096392018-05-020.7050070.9002390.5512072018-05-030.6080680.3323450.5190192018-05-040.9219580.6260030.9458282018-05-050.3044230.8364940.7313512018-05-060.3428600.0532110.670777通过整数位置位置列表
df.iloc[[0,2,4],[1,3]] BD2018-05-010.7357100.4081492018-05-030.3323450.1819472018-05-050.8364940.678947 df.iloc[:,1:3] BC2018-05-010.7357100.0096392018-05-020.9002390.5512072018-05-030.3323450.5190192018-05-040.6260030.9458282018-05-050.8364940.7313512018-05-060.0532110.670777获取某一个值:
df.iloc[1,1] 0.9002387294615217使用 isin() 过滤:
df2 = df.copy() df2['E'] = ['one','two','three','four','five','six'] df2 ABCDE2018-05-010.2252790.7357100.0096390.408149one2018-05-020.7050070.9002390.5512070.165471two2018-05-030.6080680.3323450.5190190.181947three2018-05-040.9219580.6260030.9458280.357211four2018-05-050.3044230.8364940.7313510.678947five2018-05-060.3428600.0532110.6707770.186546six df2['E'].isin(['one','five']) 2018-05-01 True 2018-05-02 False 2018-05-03 False 2018-05-04 False 2018-05-05 True 2018-05-06 False Freq: D, Name: E, dtype: bool df2[df2['E'].isin(['one','five'])] ABCDE2018-05-010.2252790.7357100.0096390.408149one2018-05-050.3044230.8364940.7313510.678947five按照标签设置值
df.loc[datas[0],'A'] = 0按照位置设置值
df.iloc[0,1] = 0看一下操作的结果:
df ABCDF2018-05-010.0000000.0000000.0096390.40814912018-05-020.7050070.9002390.5512070.16547122018-05-030.6080680.3323450.5190190.18194732018-05-040.9219580.6260030.9458280.35721142018-05-050.3044230.8364940.7313510.67894752018-05-060.3428600.0532110.6707770.1865466pandas 主要使用 np.nan 来表示缺失的数据,默认不包含在计算中。
重建索引允许您更改/添加/删除指定轴上的索引。这将返回数据的副本。
df1 = df.reindex(index=datas[0:4],columns=list(df.columns) + ['E']) df1.loc[datas[0]:datas[1],'E'] = 1 df1 ABCDFE2018-05-010.0000000.0000000.0096390.40814911.02018-05-020.7050070.9002390.5512070.16547121.02018-05-030.6080680.3323450.5190190.1819473NaN2018-05-040.9219580.6260030.9458280.3572114NaN删除任何缺少数据的行:
df1.dropna() ABCDFE2018-05-010.0000000.0000000.0096390.40814911.02018-05-020.7050070.9002390.5512070.16547121.0填充数据:
df1.fillna(value=5) ABCDFE2018-05-010.0000000.0000000.0096390.40814911.02018-05-020.7050070.9002390.5512070.16547121.02018-05-030.6080680.3323450.5190190.18194735.02018-05-040.9219580.6260030.9458280.35721145.0判断是否为空,返回布尔值
df1.isna() ABCDFE2018-05-01FalseFalseFalseFalseFalseFalse2018-05-02FalseFalseFalseFalseFalseFalse2018-05-03FalseFalseFalseFalseFalseTrue2018-05-04FalseFalseFalseFalseFalseTrue这会应用到所有的数据框中。
通过将 ignore_index 选项设置为,清除现有索引并在结果中重置它 True 。
pd.concat([s1,s2],ignore_index=True) 0 a 1 b 2 c 3 d dtype: object使用该 keys 选项在数据的最外层添加分层索引
pd.concat([s1,s2],keys=['s1','s2']) s1 0 a 1 b s2 0 c 1 d dtype: object pd.concat([s1,s2],keys=['s1','s2'],ignore_index=True) 0 a 1 b 2 c 3 d dtype: object将行追加到数据框:
df = pd.DataFrame(np.random.rand(4,4),columns=['A','B','C','D']) df ABCD00.3042660.5581590.6998050.96488710.0832970.2289680.8256720.48359120.4970660.2037180.8949970.83023430.0011100.3232480.0663820.074556 s = df.iloc[3] s A 0.001110 B 0.323248 C 0.066382 D 0.074556 Name: 3, dtype: float64 df.append(s,ignore_index=True) ABCD00.3042660.5581590.6998050.96488710.0832970.2289680.8256720.48359120.4970660.2037180.8949970.83023430.0011100.3232480.0663820.07455640.0011100.3232480.0663820.074556有点像列表添加元素。