使用Selenium从网页获取所有可见文本

问题描述：

我整天忙于搜寻，没有找到答案，所以如果已经回答了，请提前道歉.

I've been googling this all day with out finding the answer, so apologies in advance if this is already answered.

我正试图从大量不同的网站上获取所有可见的文本.原因是我要处理文本以最终对网站进行分类.

I'm trying to get all visible text from a large number of different websites. The reason is that I want to process the text to eventually categorize the websites.

经过几天的研究，我认为硒是我最好的机会.我发现了一种使用Selenium来捕获所有文本的方法，不幸的是同一文本被多次捕获:

After a couple of days of research, I decided that Selenium was my best chance. I've found a way to grab all the text, with Selenium, unfortunately the same text is being grabbed multiple times:

from selenium import webdriver
import codecs

filen = codecs.open('outoput.txt', encoding='utf-8', mode='w+')

driver = webdriver.Firefox()

driver.get("http://www.examplepage.com")

allelements = driver.find_elements_by_xpath("//*")

ferdigtxt = []

for i in allelements:

      if i.text in ferdigtxt:
          pass
  else:
         ferdigtxt.append(i.text)
         filen.writelines(i.text)

filen.close()

driver.quit()

for循环中的if条件是为了消除多次读取相同文本的问题的尝试-但是，它不能仅按计划在某些网页上工作. (这也使脚本运行得很慢)

The if condition inside the for loop is an attempt at eliminating the problem of fetching the same text multiple times - it does not however, only work as planned on some webpages. (it also makes the script A LOT slower)

我猜想我的问题的原因是-当请求元素的内部文本时，我还获得了嵌套在有问题的元素内部的元素的内部文本.

I'm guessing the reason for my problem is that - when asking for the inner text of an element - I also get the inner text of the elements nested inside the element in question.

有什么办法解决吗?我是否掌握某种内部元素?还是完全不同的方式可以使我实现自己的目标?任何帮助都将不胜感激，因为我对此一无所知.

Is there any way around this? Is there some sort of master element I grab the inner text of? Or a completely different way that would enable me to reach my goal? Any help would be greatly appreciated as I'm out of ideas for this one.

之所以使用Selenium而不是机械化和精美的汤是因为我想要JavaScript招标文本

the reason I used Selenium and not Mechanize and Beautiful Soup is because I wanted JavaScript tendered text

答

使用 lxml ，您可以尝试一下像这样:

Using lxml, you might try something like this:

import contextlib
import selenium.webdriver as webdriver
import lxml.html as LH
import lxml.html.clean as clean

url="http://www.yahoo.com"
ignore_tags=('script','noscript','style')
with contextlib.closing(webdriver.Firefox()) as browser:
    browser.get(url) # Load page
    content=browser.page_source
    cleaner=clean.Cleaner()
    content=cleaner.clean_html(content)    
    with open('/tmp/source.html','w') as f:
       f.write(content.encode('utf-8'))
    doc=LH.fromstring(content)
    with open('/tmp/result.txt','w') as f:
        for elt in doc.iterdescendants():
            if elt.tag in ignore_tags: continue
            text=elt.text or ''
            tail=elt.tail or ''
            words=' '.join((text,tail)).strip()
            if words:
                words=words.encode('utf-8')
                f.write(words+'\n')

除了图像中的文本和一些随时间变化的文本(可能使用javascript并刷新)，这似乎可以获取www.yahoo.com上几乎所有的文本.

This seems to get almost all of the text on www.yahoo.com, except for text in images and some text that changes with time (done with javascript and refresh perhaps).

使用Selenium从网页获取所有可见文本

相关推荐