最近在弄腾讯的广告大赛,使用了Pandas作为数据处理的工具,在此记录下使用的方法,以作分享。
参考文献:http://jingyan.baidu.com/season/43456?pn=0
R语言教程:https://edu.aliyun.com/course/27/lesson/list
index_col用于指定索引列
以下几行代码实现类似于数据库里group by后计数的功能,数据库的同功能SQL语句如下
select userID, count(*) from user_installedapps group by userID;! 注意 : groupby的列必须是索引列
df = pd.read_csv('./original_data/user_installedapps.csv',index_col='userID') grouped_userid = df.groupby(level='userID') print grouped_userid.count() //若是要统计总和可以改成sum() //print grouped_userid.sum()下面是我写的API,需要的看着改吧
import pandas as pd conf = { # The path of input csv file 'input_file_path': './original_data/user_installedapps.csv', # The index of DataFrame object, which essential for the use of groupby() function 'index_col': ['userID'], # The col which required to by grouped by 'group_by_col': ['userID'], # The path of output csv file 'output_file_path': r'./processed_data/user_installedapps_count.csv' } # <1> File Import and Preparation ***************************** print 'Importing File' df = pd.read_csv(conf['input_file_path'], index_col=conf['index_col']) grouped_df = df.groupby(level=conf['index_col']) print 'Importing File Finished' # <2> Data Processing Period ********************************* # The function could be changed as required print 'Data Processing' grouped_df_count = grouped_df.count() # grouped_df_sum = grouped_df.sum() print 'Data Processing Finished' # <3> File Export ******************************************** print 'Exporting File' grouped_df_count.to_csv(conf['output_file_path'], encoding='gbk') # grouped_df_sum.to_csv(conf['output_file_path'], encoding='gbk') print 'Exporting File Finished'