Python + 网页抓取 + scrapy:如何从 IMDb 页面获取所有电影的链接?
我必须从这个 IMDb 页面抓取所有电影:https://www.imdb.com/list/ls055386972/.
I have to scrape all movies from this IMDb page : https://www.imdb.com/list/ls055386972/.
我的方法是首先抓取 <a href="/title/tt0068646/?ref_=ttls_li_tt"
的所有值,即提取 /title/tt0068646/?ref_=ttls_li_tt
部分,然后添加'https://www.imdb.com' 准备电影的完整 URL,即 https://www.imdb.com/标题/tt0068646/?ref_=ttls_li_tt.但是每当我给 response.xpath('//h3[@class]/a[@href]').extract()
时,它都会提取所需的部分以及电影标题:[u'<a href="/title/tt0068646/?ref_=ttls_li_tt">教父</a>', u'<a href="/title/tt0108052/?ref_=ttls_li_tt">辛德勒的名单</a>......]'
我只想要"/title/tt0068646/?ref_=ttls_li_tt"
部分.
My approach is first to scrape all the values of <a href="/title/tt0068646/?ref_=ttls_li_tt"
, i.e., to extract /title/tt0068646/?ref_=ttls_li_tt
portions and then add 'https://www.imdb.com' to prepare the complete URL to the movie, i.e., https://www.imdb.com/title/tt0068646/?ref_=ttls_li_tt . But whenever I am giving response.xpath('//h3[@class]/a[@href]').extract()
it is extracting the desired portion along with the movie title: [u'<a href="/title/tt0068646/?ref_=ttls_li_tt">The Godfather</a>', u'<a href="/title/tt0108052/?ref_=ttls_li_tt">Schindler\'s List</a>......]'
I want only the "/title/tt0068646/?ref_=ttls_li_tt"
portion.
如何进行?
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.imdb.com/list/ls055386972/")
soup = BeautifulSoup(page.content, 'html.parser')
movies = soup.findAll('h3', attrs={'class' : 'lister-item-header'})
for movie in movies:
print(movie.a['href'])
输出:
/title/tt0068646/?ref_=ttls_li_tt
/title/tt0108052/?ref_=ttls_li_tt
/title/tt0050083/?ref_=ttls_li_tt
/title/tt0118799/?ref_=ttls_li_tt
.
.
.
.
/title/tt0088763/?ref_=ttls_li_tt
/title/tt0266543/?ref_=ttls_li_tt