解析 html 怎么绕过script中的<>
解析 html 如何绕过script中的<>
rt
我的html中包含一段script的代码 其中加红部分包含<> 每次使用sgmlparse解析的时候都会报错
求问如何绕开script中的这个<>呢?
<div ...>
....
<script>
function cutLength(str, maxLen, appended, appendLength){
appended = appended||"...";
appendLength = appendLength||2;
str = str.replace(/<!.*?>/g, "");
if (len(str) > maxLen){
do{
str = str.substring(0, str.length-1);
}while(str && (len(str)+appendLength > maxLen));
if (str.lastIndexOf("</") != str.lastIndexOf("<")){
str = str.substring(0, str.lastIndexOf("<"))+str.substring(str.lastIndexOf(">")+1);
}
return str+appended;
}
return str;
}
</script></div>
错误:
Traceback (most recent call last):
File "/Users/Bonnie/Desktop/test_url.py", line 54, in <module>
lister.feed(content)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sgmllib.py", line 104, in feed
self.goahead(0)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sgmllib.py", line 174, in goahead
k = self.parse_declaration(i)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/markupbase.py", line 98, in parse_declaration
decltype, j = self._scan_name(j, i)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/markupbase.py", line 392, in _scan_name
% rawdata[declstartpos:declstartpos+20])
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sgmllib.py", line 111, in error
raise SGMLParseError(message)
sgmllib.SGMLParseError: expected name token at '<!.*?>/g, "");\n\t\tif '
------解决思路----------------------
SGMLParser有literal属性. 在检测到script时, 把literal设为1即可. 具体原理我也没弄懂, 需要看sgmllib的源码.
输出:
可以看到, 上面的混在script中的<a> tag并没有被当作html的一部分, 而是在data中.
另外, script后的第二个div被正确的识别出了.
rt
我的html中包含一段script的代码 其中加红部分包含<> 每次使用sgmlparse解析的时候都会报错
求问如何绕开script中的这个<>呢?
<div ...>
....
<script>
function cutLength(str, maxLen, appended, appendLength){
appended = appended||"...";
appendLength = appendLength||2;
str = str.replace(/<!.*?>/g, "");
if (len(str) > maxLen){
do{
str = str.substring(0, str.length-1);
}while(str && (len(str)+appendLength > maxLen));
if (str.lastIndexOf("</") != str.lastIndexOf("<")){
str = str.substring(0, str.lastIndexOf("<"))+str.substring(str.lastIndexOf(">")+1);
}
return str+appended;
}
return str;
}
</script></div>
错误:
Traceback (most recent call last):
File "/Users/Bonnie/Desktop/test_url.py", line 54, in <module>
lister.feed(content)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sgmllib.py", line 104, in feed
self.goahead(0)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sgmllib.py", line 174, in goahead
k = self.parse_declaration(i)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/markupbase.py", line 98, in parse_declaration
decltype, j = self._scan_name(j, i)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/markupbase.py", line 392, in _scan_name
% rawdata[declstartpos:declstartpos+20])
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sgmllib.py", line 111, in error
raise SGMLParseError(message)
sgmllib.SGMLParseError: expected name token at '<!.*?>/g, "");\n\t\tif '
------解决思路----------------------
SGMLParser有literal属性. 在检测到script时, 把literal设为1即可. 具体原理我也没弄懂, 需要看sgmllib的源码.
import urllib2
from sgmllib import SGMLParser
from time import *
class URLLister(SGMLParser):
def __init__(self):
SGMLParser.__init__(self)
self.data = []
self.getdata = False
def handle_data(self,text):
print 'data:'+text
if self.getdata:
self.data.append(text)
print text
def start_script(self, attr):
print "script found"
self.literal = 1
def start_div(self, attr):
print "div started"
content = '<div><script>function cutLength(str, maxLen, appended, appendLength){appended = appended
------解决思路----------------------
"...";appendLength = appendLength
------解决思路----------------------
2;str = str.replace(/<!.*?>/g, "<a>");if (len(str) > maxLen){do{str = str.substring(0, str.length-1);}while(str && (len(str)+appendLength > maxLen));if (str.lastIndexOf("</") != str.lastIndexOf("<")){str = str.substring(0, str.lastIndexOf("<"))+str.substring(str.lastIndexOf(">")+1);}return str+appended;}return str;}</script></div><div>2nd</div>'
lister=URLLister()
lister.feed(content)
print lister.data
输出:
可以看到, 上面的混在script中的<a> tag并没有被当作html的一部分, 而是在data中.
另外, script后的第二个div被正确的识别出了.