数学之路(3)-机器学习(3)-机器学习算法-贝叶斯定理(3)

分类别自动提取网页链接

>>> runfile(r'K:ook_prog ext_bayes.py', wdir=r'K:ook_prog')
. . . . . 
爬取汽车类网页:http://finance.chinanews.com/auto/gd.shtml
http://www.chinanews.com/auto/2013/09-17/5295068.shtml
http://www.chinanews.com/auto/2013/09-17/5294694.shtml
http://www.chinanews.com/auto/2013/09-17/5294292.shtml
http://www.chinanews.com/auto/2013/09-17/5294285.shtml
http://www.chinanews.com/auto/2013/09-17/5294279.shtml
http://www.chinanews.com/auto/2013/09-17/5294275.shtml
http://www.chinanews.com/auto/2013/09-17/5294268.shtml
http://www.chinanews.com/auto/2013/09-17/5294261.shtml
http://www.chinanews.com/auto/2013/09-17/5294247.shtml
http://www.chinanews.com/auto/2013/09-17/5294242.shtml

........

.......



爬取军事类网页:http://www.chinanews.com/mil/news.shtml
http://www.chinanews.com/mil/2013/09-17/5295038.shtml
http://www.chinanews.com/mil/2013/09-17/5295037.shtml
http://www.chinanews.com/mil/2013/09-17/5295021.shtml
http://www.chinanews.com/mil/2013/09-17/5295016.shtml

分类提取网页正文

.................

<title>炊事兵勤学18年成导弹燃料技师 备战“嫦娥三号”-中新网</title>
<title>中国国产航母电磁弹射器曝光 作战能力提升2-3倍-中新网</title>
<title>装步班长对抗训练揪出目标被赞“炯炯侠”(图)-中新网</title>
<title>西班牙逮捕1名涉嫌自杀袭击的恐怖组织头目-中新网</title>
<title>武警陕西总队原政委颜晓东提任*总队政委-中新网</title>
<title>解放军营房楼梯加宽到4米 集合速度增快一倍-中新网</title>
............

本博客所有内容是原创,如果转载请注明来源

http://blog.csdn.net/myhaspl/

下一步,提取正文词条

#分类提取正文词条
yb_txt=[]
for ci in xrange(0,len(ybtxt)):
    yb_txt.append([])
    for cj in xrange(0,len(ybtxt[ci])):
        my_str = ybtxt[ci][cj]
        my_txt=jieba.cut(my_str)            
        for myword in my_txt:
            if not(myword.strip() in f_stop_seg_list) and len(myword.strip())>2:
                yb_txt[ci][cj].append(myword) 
                print ".",