使用Selenium从网页获取所有可见文本

问题描述:

我整天忙于搜寻,没有找到答案,所以如果已经回答了,请提前道歉.

I've been googling this all day with out finding the answer, so apologies in advance if this is already answered.

我正试图从大量不同的网站上获取所有可见的文本.原因是我要处理文本以最终对网站进行分类.

I'm trying to get all visible text from a large number of different websites. The reason is that I want to process the text to eventually categorize the websites.

经过几天的研究,我认为硒是我最好的机会.我发现了一种使用Selenium来捕获所有文本的方法,不幸的是同一文本被多次捕获:

After a couple of days of research, I decided that Selenium was my best chance. I've found a way to grab all the text, with Selenium, unfortunately the same text is being grabbed multiple times:

from selenium import webdriver
import codecs

filen = codecs.open('outoput.txt', encoding='utf-8', mode='w+')

driver = webdriver.Firefox()

driver.get("http://www.examplepage.com")

allelements = driver.find_elements_by_xpath("//*")

ferdigtxt = []

for i in allelements:

      if i.text in ferdigtxt:
          pass
  else:
         ferdigtxt.append(i.text)
         filen.writelines(i.text)

filen.close()

driver.quit()

for循环中的if条件是为了消除多次读取相同文本的问题的尝试-但是,它不能仅按计划在某些网页上工作. (这也使脚本运行得很慢)

The if condition inside the for loop is an attempt at eliminating the problem of fetching the same text multiple times - it does not however, only work as planned on some webpages. (it also makes the script A LOT slower)

我猜想我的问题的原因是-当请求元素的内部文本时,我还获得了嵌套在有问题的元素内部的元素的内部文本.

I'm guessing the reason for my problem is that - when asking for the inner text of an element - I also get the inner text of the elements nested inside the element in question.

有什么办法解决吗?我是否掌握某种内部元素?还是完全不同的方式可以使我实现自己的目标?任何帮助都将不胜感激,因为我对此一无所知.

Is there any way around this? Is there some sort of master element I grab the inner text of? Or a completely different way that would enable me to reach my goal? Any help would be greatly appreciated as I'm out of ideas for this one.

之所以使用Selenium而不是机械化和精美的汤是因为我想要JavaScript招标文本

the reason I used Selenium and not Mechanize and Beautiful Soup is because I wanted JavaScript tendered text

使用 lxml ,您可以尝试一下像这样:

Using lxml, you might try something like this:

import contextlib
import selenium.webdriver as webdriver
import lxml.html as LH
import lxml.html.clean as clean

url="http://www.yahoo.com"
ignore_tags=('script','noscript','style')
with contextlib.closing(webdriver.Firefox()) as browser:
    browser.get(url) # Load page
    content=browser.page_source
    cleaner=clean.Cleaner()
    content=cleaner.clean_html(content)    
    with open('/tmp/source.html','w') as f:
       f.write(content.encode('utf-8'))
    doc=LH.fromstring(content)
    with open('/tmp/result.txt','w') as f:
        for elt in doc.iterdescendants():
            if elt.tag in ignore_tags: continue
            text=elt.text or ''
            tail=elt.tail or ''
            words=' '.join((text,tail)).strip()
            if words:
                words=words.encode('utf-8')
                f.write(words+'\n') 

除了图像中的文本和一些随时间变化的文本(可能使用javascript并刷新),这似乎可以获取www.yahoo.com上几乎所有的文本.

This seems to get almost all of the text on www.yahoo.com, except for text in images and some text that changes with time (done with javascript and refresh perhaps).