利用twitter官网提供的api 及tweepy库 爬取tweets

xiaoxiao2021-02-28  28

利用twitter官网提供的api及tweepy库爬取tweets

tweepy官网文档

思路: 1.以用户为中心,爬取用户的所有推文数据 2.根据用户id寻找用户朋友的tweeter id扩展待爬用户表 3.循环1,2

几点说明: 1.爬推特数据需要翻墙,推荐用ss。代码翻墙需要http,https代理。如果是socks的话会发现浏览器能翻墙,但是代码会提示

tweepy.error.TweepError: Failed to send request: HTTPSConnectionPool(host='api.twitter.com', port=443): Max retries exceeded with url: ....

说明https连接失败。如果需要终端翻墙参考Mac命令行终端下使用shadowsocks翻墙 然后在tweepy.API中加入代理信息,端口为你设定的代理端口。

api = tweepy.API(auth, proxy="127.0.0.1:1080",)

2.使用官方api需要先申请一个应用程序以获得授权,申请地址Twitter应用程序 名字描述什么的随便写好好,没有审核时间,填写后即可获得consumer_key,consumer_secret,access_token,access_token_secret这些在求取数据时需要用到。

3.官方API有速率限制具体参见[Rate limits-Twitter Development]授权用户和授权应用的请求窗口数有差异我用的。user_timeline()状语从句:user_friends()限制如下:

所以需要协调两个接口的调用频率。

4.当请求次数超过上限时会抛出异常,然后退出程序,解决方法时tweepy.API中将参数wait_on_rate_limit,wait_on_rate_limit_notify设置为True

到达上限时,程序将自动等待,并输出提示信息。

api = tweepy.API(auth, proxy="127.0.0.1:1080", wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

5.api请求返回json格式数据如图:

6.有些用户设置不允许取数据时会提示Not authorized. 可以在异常部分处理异常,跳过改用户即可.tweepy.error信息也可以在上面的官方文档连接中查到。

代码

import tweepy import time import csv import threading consumer_key = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" consumer_secret = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx' access_token = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx' access_token_secret = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx' lock = threading.Lock() def get_tweets(): global user_ids global old_ids lock.acquire() try: num = 0 while len(user_ids) > 1: try: user_id = user_ids[num] print('crawling user %s data...' % user_id) auth = tweepy.OAuthHandler(consumer_key, consumer_secret) auth.set_access_token(access_token, access_token_secret) api = tweepy.API(auth, proxy="127.0.0.1:1080", wait_on_rate_limit=True, wait_on_rate_limit_notify=True) tweets = [] new_tweets = api.user_timeline(user_id, count=200) tweets.extend(new_tweets) old = tweets[-1].id - 1 while len(new_tweets) > 0: new_tweets = api.user_timeline(user_id=user_id, count=200, max_id=old) tweets.extend(new_tweets) old = tweets[-1].id - 1 print('%s tweets downloaded' % (len(tweets))) out_tweets = [[tweet.id, tweet.text, tweet.created_at, tweet.lang, tweet.place, tweet.geo, tweet.source, tweet.truncated, tweet.favorite_count, tweet.favorited, tweet.in_reply_to_screen_name, tweet.in_reply_to_status_id, tweet.in_reply_to_user_id, tweet.is_quote_status, tweet.retweet_count, tweet.retweeted, tweet.user.id, tweet.user.name, tweet.user.screen_name, tweet.user.statuses_count, tweet.user.time_zone, tweet.user.url, tweet.user.notifications, tweet.user.profile_background_image_url, tweet.user.profile_image_url, tweet.user.profile_image_url_https, tweet.user.location, tweet.user.contributors_enabled, tweet.user.created_at, tweet.user.default_profile, tweet.user.default_profile_image, tweet.user.description, tweet.user.favourites_count, tweet.user.follow_request_sent, tweet.user.followers_count, tweet.user.following, tweet.user.friends_count, tweet.user.geo_enabled] for tweet in tweets] user_ids.remove(user_id) old_ids.append(user_id) with open('./data1/%s_tweets.csv' % user_id, 'w',encoding='utf-8') as file: writer = csv.writer(file) writer.writerows(out_tweets) print('saved data') except tweepy.TweepError as e: if e.reason=='Not authorized.': print('this user not authorized.') user_ids.remove(user_id) old_ids.append(user_id) continue else:print(e) finally: lock.release() def get_friends(): global user_ids global old_ids global oldest lock.acquire() try: print('getting user friends id...') auth = tweepy.OAuthHandler(consumer_key, consumer_secret) auth.set_access_token(access_token, access_token_secret) api = tweepy.API(auth, proxy="127.0.0.1:1080", wait_on_rate_limit=True, wait_on_rate_limit_notify=True) ids = [] for user in old_ids[10]: try: friends = api.friends_ids(user) friend = [] for idd in friends: if (idd not in old_ids) and (idd not in user_ids) and(idd not in oldest): friend.append(idd) ids.extend(friend) except tweepy.TweepError as e: if e.reason == 'Not authorized.': print('this user not authorized.') old_ids.remove(user) oldest.append(user) continue else: print(e) old_ids.remove(user) oldest.append(user) user_ids.extend(ids) print('done!') with open('crawled and expened user.txt','w',encoding='utf-8') as file: for x in oldest: file.write(str(x)) file.write(' ') finally: lock.release() if __name__ == '__main__': user_ids = [25073877,198599889] with open('old_ids.txt','r',encoding='utf-8') as file: old_ids=[x for x in file.read().split(' ')] while len(user_ids) > 0: t1=threading.Thread(target=get_tweets) t2=threading.Thread(target=get_friends) t1.start() t1.join() t2.start() t2.join()
转载请注明原文地址: https://www.6miu.com/read-2629581.html

最新回复(0)