Python + 网页抓取 + scrapy:如何从 IMDb 页面获取所有电影的链接?

问题描述:

我必须从这个 IMDb 页面抓取所有电影:https://www.imdb.com/list/ls055386972/.

I have to scrape all movies from this IMDb page : https://www.imdb.com/list/ls055386972/.

我的方法是首先抓取 <a href="/title/tt0068646/?ref_=ttls_li_tt" 的所有值,即提取 /title/tt0068646/?ref_=ttls_li_tt 部分,然后添加'https://www.imdb.com' 准备电影的完整 URL,即 https://www.imdb.com/标题/tt0068646/?ref_=ttls_li_tt.但是每当我给 response.xpath('//h3[@class]/a[@href]').extract() 时,它都会提取所需的部分以及电影标题:[u'<a href="/title/tt0068646/?ref_=ttls_li_tt">教父</a>', u'<a href="/title/tt0108052/?ref_=ttls_li_tt">辛德勒的名单</a>......]'我只想要"/title/tt0068646/?ref_=ttls_li_tt"部分.

My approach is first to scrape all the values of <a href="/title/tt0068646/?ref_=ttls_li_tt" , i.e., to extract /title/tt0068646/?ref_=ttls_li_tt portions and then add 'https://www.imdb.com' to prepare the complete URL to the movie, i.e., https://www.imdb.com/title/tt0068646/?ref_=ttls_li_tt . But whenever I am giving response.xpath('//h3[@class]/a[@href]').extract() it is extracting the desired portion along with the movie title: [u'<a href="/title/tt0068646/?ref_=ttls_li_tt">The Godfather</a>', u'<a href="/title/tt0108052/?ref_=ttls_li_tt">Schindler\'s List</a>......]'I want only the "/title/tt0068646/?ref_=ttls_li_tt" portion.

如何进行?

import requests
from bs4 import BeautifulSoup

page = requests.get("https://www.imdb.com/list/ls055386972/")
soup = BeautifulSoup(page.content, 'html.parser')

movies = soup.findAll('h3', attrs={'class' : 'lister-item-header'})
for movie in movies:
    print(movie.a['href'])

输出:

/title/tt0068646/?ref_=ttls_li_tt
/title/tt0108052/?ref_=ttls_li_tt
/title/tt0050083/?ref_=ttls_li_tt
/title/tt0118799/?ref_=ttls_li_tt
.
.
.
.
/title/tt0088763/?ref_=ttls_li_tt
/title/tt0266543/?ref_=ttls_li_tt