最近在抓取某网站时候,遇到了验证码的问题,想要破解一下验证码,示例验证码如下
这种验证码位置不固定,并且有倾斜,垂直投影会有交叉的部分,所以不适合用垂直切割的办法,受图像处理中洪水算法的启发,想到一种思路
洪水算法 参见 http://blog.csdn.net/Trent1985/article/details/44904431
常见实现办法有递归,扫描线算法等,本文针对的验证码不适合用扫描线算法,遂用递归实现
以下是核心代码:
import numpy as np import cv2 from PIL import Image def getPoint(x,y,data,subdata=None): a=[0,-1,0,1,0,-2,0,2,0,-3,0,3,0,-4,0,4,0,-5,0,5] b=[1,0,-1,0,2,0,-2,0,3,0,-3,0,4,0,-4,0,5,0,-5,0] width,height=data.shape if subdata is None: subdata=[] if x>5 and y<height-5 and y>5 and x<width-5: for i in range(20): if data[x+a[i]][y+b[i]]==1: subdata.append((x+a[i],y+b[i])) data[x+a[i]][y+b[i]]=2 getPoint(x+a[i],y+b[i],data,subdata) subdata.append((x,y)) def getcell(data): list1=[] index=0 flag=True for y in range(data.shape[1]): for x in range(data.shape[0]): if data[x][y]==1: if list1: for i in range(len(list1)): if (x,y) in list1[i]: flag=False if not flag: continue list1.append([]) getPoint(x,y,data,list1[index])#调用流水算法 index+=1 else : continue for index in range(len(list1)): l=list1[index][0][0] t=list1[index][0][1] r=list1[index][0][0] b=list1[index][0][1] for i in list1[index]: x=i[0] y=i[1] l=min(l,x) t=min(t,y) r=max(r,x) b=max(b,y) w=r-l+1 h=b-t+1 if (w*h <8):#去除小色块 continue img0=np.zeros([w,h])#创建全0矩阵 for x,y in list1[index]: img0[x-l][y-t]=1 img0[img0<1]=255 img1=Image.fromarray(img0) img1=img1.convert('RGB') img1.save('img2/'+str(index)+'.png') if __name__=="__main__": filename='captcha1.png' data=cv2.imread(filename,2) allimg=getcell(data) 以上是我个人实现的解决方案,虽然算不上最优,但基本能够满足需求效果图
9: 5: 4: n:
有不足之处欢迎指出,也欢迎提出更优的解决方案
