工作需要,在爬虫的时候要获取代理,但是一个代理接口获取的速度不满足需求,故用两个代理多线程进行获取,提高速度(有多的接口会更快)
原理是两个接口同时运行,把得到的代理IP放入集合中,然后加入队列中,先进先出。每次入队之前判断是否已存在于集合中(即已经使用过,如果使用过则舍弃),保证每次使用的代理都是不重复的。
1、首先我们设置好队列以及集合,如果队列是空的(已透支),则睡眠2秒等待插入。
ip_set = set() que_proxy = Queue() # 从队列获取有效代理 def useful_proxies(): while 1: if not que_proxy.empty(): return que_proxy.get() else: print("queue empty. please wait for 2 second") time.sleep(2)2、接口获取代理插入队列(当然,接口自己填啦~)
# 获取代理IP def get_proxies(): while 1: text = requests.get('your proxies API').text try: ip = re.findall('(\d{2,3}.\d{2,3}.\d{2,3}.\d{2,3}:\d{4,5})', text)[0] except: ip = '' if ip != '': proxies = { "http": "http://" + ip, "https": "https://" + ip } # 未出现在旧集合,新代理 if ip not in ip_set: print("获取到爬虫IP %s" % str(ip)) # 加入集合 ip_set.add(ip) # 插入队列 que_proxy.put(proxies) else: print("未获取到IP,休眠2秒") time.sleep(2) # 获取普通代理 def get_origin(): while 1: time.sleep(1) old_proxies = [] # 获取 jsonstr = requests.get('your proxies API', timeout=2).text if "data" in jsonstr: jsontrees = json.loads(jsonstr) for data in jsontrees["data"]: ip = str(data["ip"]) + ":" + str(data["port"]) old_proxies.append(ip) # 验证 for ip in old_proxies: proxies = { 'https': "https://" + str(ip), 'http': "http://" + str(ip) } if ip not in ip_set: ip_set.add(ip) try: if requests.get('http://www.baidu.com/', proxies=proxies, timeout=2).status_code == 200: # 插入队列 print("successful proxies") print(proxies) que_proxy.put(proxies) except: print("fail proxies") print(proxies) else: print("Too many requests to origin ip. Wait for 3 seconds.") time.sleep(3)3、创建线程
threads = [] # 第一个代理接口 t1 = threading.Thread(target=get_proxies) threads.append(t1) # 第二个代理接口 t2 = threading.Thread(target=get_origin) threads.append(t2)4、好啦,接下来只要在主函数测试就可以啦~
5、完整代码(一样,接口自己填啦~)
import json import random import re import time from queue import Queue import threading import requests ip_set = set() que_proxy = Queue() # 从队列获取有效代理 def useful_proxies(): while 1: if not que_proxy.empty(): return que_proxy.get() else: # print("queue empty. please wait for 2 second") time.sleep(2) # 代理2 def get_origin(): while 1: time.sleep(1) old_proxies = [] # 获取 jsonstr = requests.get('your proxies API', timeout=2).text if "data" in jsonstr: jsontrees = json.loads(jsonstr) for data in jsontrees["data"]: ip = str(data["ip"]) + ":" + str(data["port"]) old_proxies.append(ip) # 验证 for ip in old_proxies: proxies = { 'https': "https://" + str(ip), 'http': "http://" + str(ip) } if ip not in ip_set: ip_set.add(ip) try: if requests.get('http://www.baidu.com/', proxies=proxies, timeout=2).status_code == 200: # 插入队列 # print("successful proxies") # print(proxies) que_proxy.put(proxies) except: # print("fail proxies") # print(proxies) pass else: # print("Too many requests to origin ip. Wait for 3 seconds.") time.sleep(3) # 代理1 def get_proxies(): while 1: text = requests.get('your proxies API').text try: ip = re.findall('(\d{2,3}.\d{2,3}.\d{2,3}.\d{2,3}:\d{4,5})', text)[0] except: ip = '' if ip != '': proxies = { "http": "http://" + ip, "https": "https://" + ip } if ip not in ip_set: # print("获取到爬虫IP %s" % str(ip)) ip_set.add(ip) que_proxy.put(proxies) else: # print("未获取到IP,休眠2秒") time.sleep(2) threads = [] t1 = threading.Thread(target=get_proxies) threads.append(t1) t2 = threading.Thread(target=get_origin) threads.append(t2) if __name__ == '__main__': start_time = time.time() for t in threads: t.setDaemon(True) t.start() for i in range(5): proxies = useful_proxies() print(proxies) end_time = time.time() print("耗时:") print(end_time - start_time)测试结果:(5个代理16秒,速度快了一些啦,觉得不够快的欢迎给出建议~~!)