去除html标签的正则表达式

xiaoxiao2024-10-02 110

在建立html文件的索引的时候，我们需要去除文件中的这些html标签，比如<a...></a>,<script></script>,<style></style>等等，一般而言，我们使用正则表达式来匹配是最方便快速的。

下面为匹配这些标签的正则表达式：

<\s*script.*?>[^<>]*?<\s*/\s*script\s*>

或者

<\s*script.*?>[\s\S]*?<\s*/\s*script\s*>

类似

<\s*style[^>]*>[^<>]*?<\s*/\s*style\s*>

如果要不区分大小写，需要在后面添加相应的模式，如Java中为

Pattern p = Pattern.compile(regx, Pattern.CASE_INSENSITIVE);

但下面这种情况下，似乎无法通过正则表达式来去除html标签，如：

写道 <IMG onmousewheel="return bbimg(this)" style="CURSOR: pointer" οnclick=javascript:window.open(this.src); alt="同学 16P" src="http://xxx.com/xxx.jpg" οnlοad="javascript:if(this.width style=" border=0 ? cursor: pointer?> <IMG onmousewheel="return imgzoom(this);" οnmοuseοver="if(this.width>screen.width*0.7) {this.resized=true; this.width=screen.width*0.7; this.style.cursor='hand';}" οnclick="window.open('http://xxx.com/xxx.jpg');" alt="按这里可在新视窗开启或按 CTRL+Mouse卷动可进行放大/缩细"src="http://xxx-teen-tv.com/ntc-53/02.jpg" οnlοad="if(this.width > screen.width*0.7) {this.resized=true;this.width=screen.width*0.7}" border=0>

标签内部写有js代码，同时含有大于号或者小于号。

这种情况的页面似乎只有通过对html进行解析parser，正则表达式行不通？希望有高人看到能路过解答一下，谢谢

转载请注明原文地址: https://www.6miu.com/read-5017957.html

Java

最新回复(0)