数据提取（2）：pandas库入门

xiaoxiao2021-02-28 146

Pandas库 http://pandas.pydata.org Pandas是Python第三方库，提供高性能易用数据类型和分析工具。 import pandas as pd 数据类型：Series, DataFrame 基于数据类型的各类操作：基本操作、运算操作、特征类操作、关联类操作 Numpy: 基础数据类型：ndarray; 关注数据的结构表达：维度，数据间的关系 Pandas: 扩展数据类型：Series, DataFrame; 数据的应用表达：数据与索引间的关系一、Series类型一个一维的带‘标签’数组创建 Python 列表标量值Python字典ndarray其他函数，range() import pandas as pd #标量，不能省略index= s = pd.Series(25, index=['a','b','c']) s Out[295]: a 25 b 25 c 25 dtype: int64 #字典1 s = pd.Series({'a':8, 'b':7}) s Out[297]: a 8 b 7 dtype: int64 #字典2 s2 = pd.Series({'c':8, 'd':7}, index=['a','b','c','d']) s2 Out[299]: a NaN b NaN c 8.0 d 7.0 dtype: float64 #ndarray n1 = pd.Series(np.arange(5)) n 1 Out[303]: 0 0 1 1 2 2 3 3 4 4 dtype: int32 n2 =pd.Series(np.arange(5), index=np.arange(9,4,-1)) n2 Out[309]: 9 0 8 1 7 2 6 3 5 4 dtype: int32 操作 .index 输出为index类型 .values 输出为ndarray类型 b[n] 所得为数值 b[m : n]切片所得为series类 b[b > b.median()] np.exp(b),生成series类 .get() b.get('f', 100),提取索引为f的值，如果不存在则返回100 '*' in b 只会判断自定义索引是否在Series中对齐操作：Series + Series:自动对齐不同索引的数据 Series对象和索引都可以有一个名字，存在属性.name中，需要自己设置。 Series类型的修改：随时修改并即刻生效二、DataFrame类型一个表格型数据类型，梅列值类型可以不同有行索引、列索引常用于二维数据，可以表达多维数据创建：1、ndarray：d = pd.DataFrame(np.arange(10).reshape(2,5)) 2、dict：dt = {'one':pd.Series([1,2,3],index=['a','b','c']), 'two':pd.Series([9,8,7, 6],index=['a','b','c','d'])} d = pd.DataFrame(dt) pd.DataFrame(dt, index=['b','c','d'],columns=['two', 'three']) 获取：d['columnname']; d.ix['indexname']; d[columnname']['indexname'] 操作： .reindex(index = None, Column=None, ...) 重排索引：.reindex(index=['',''])；重排列：.reindex(colums: ['', '', '', ''] 后面的都看官方文档，不写了！ index和column都是一个Index类（不可变），常用方法： .append(idx) 链接另一个Index对象，产生新的Index对象 .diff(idx) 计算差集，产生新的Index .intersection(idx) 计算交集 .union(idx) 计算并集 .delete(loc) 删除loc位置处的元素 .insert(loc,e) 在loc位置增加一个元素e nc = d.columns.delete(2) ni = d.index.insert(5, 'c0') nd = d.reindex(index=ni, columns=nc, method='ffill') 运算：算数运算：根据行列索引，补齐（NaN）后运算，默认产生浮点数。 .add(d, **argws) + .sub(d, **argws) - .mul(d, **argws) * .div(d, **argws) / 不同维度运算采用广博运算，二维一维，默认轴1参与运算，可指定axis=0 比较运算：同维度，要求尺寸相同；不同维度，广播运算

一、数据的排序 .sort_index(axis=0, ascending=True) 指定轴根据索引排序默认升序 .sort_value(by, axis=0, ascending=True) 指定轴根据数值排序默认升序 Series.sort_value(axis=0, ascending=True ) by指定排序的行或列 NaN统一放在末尾二、基本数据统计分析函数 .sum(); .count(); .mean() .median(); .var()方差 .std(); .min() .max() 只适用于Series类型： .argmin() .argmax() 对应位置自动索引; .idxmin() .idxmax(）对应位置的自定义索引 .describe() 针对0轴每一列，统计汇总三、累计统计分析函数 .cumsum(); .cumprod(); .cummax(); .cummin() 按列算滚动计算(窗口计算)函数，计算相邻w个元素 .rolling(w).sum(); .rolling(w).mean(); rolling(w).var(); rolling(w).std(); rolling(w).min() .max() 四、数据的相关分析度量相关性：1、协方差cov(X,Y) 2、Pearson相关系数r:[-1,1]; |r|:0.8-1.0极强相关；0.6-0.8强相关；0.4-0.6中等程度相关；0.2-0.4弱相关；0.0-0.2极弱相关或不相关。 .cov() 计算协方差矩阵 .corr() 计算相关系数矩阵，Pearson、Searman、Kendall等系数

转载请注明原文地址: https://www.6miu.com/read-27320.html

技术

最新回复(0)