如何自动执行Amazon AWS EC2进行抓取

问题描述:

我想设置一些Amazon EC2实例(多个)以从任意站点抓取数据.我以为它的设置方式是一个亚马逊实例,它是一个主机,它将以编程方式设置其他实例以进行抓取.现在,我有一些php脚本可以抓取我想要的方式,但是我该如何将我的主服务器设置为...

Hi I'd like to set up some amazon EC2 instances (multiple) to scrape data from arbitrary sites. The way I imagine it being set up is one amazon instance that's a master which will programatically set up other instances to scrape. Right now I have php scripts that can scrape the way I want it to, but how can I set up my master server to...

1)创建其他ec2实例

1) make other ec2 instances

2)主服务器和从服务器之间的通信

2) communicate between the master server and slave servers

可以通过在需要时使用主启动工作程序实例来自己构建此实例,向其发送抓取请求,并在需要时终止它们,并且通常自己编写所有业务流程的代码,并尝试使其高度可用.但这不是执行此操作的好方法.相反,您应该利用AWS功能.

You could build this yourself by having your master launch worker instances when needed, send them scrape requests, terminate them when needed and generally code all the orchestration yourself and try to make it highly available. But that's not a good way to do this. Instead, you should take advantage of AWS features.

您可以结合使用SQS组和Auto Scaling组.您的主实例会将刮刮请求添加到SQS队列,并且您将拥有一个Auto Scaling组

You could use a combination of SQS and Auto Scaling Groups. Your master instance would add scrape requests to an SQS queue and you would have an Auto Scaling Group triggered on SQS queue depth that launches new worker instances - this helps you to automate the launching of workers (scrapers) when the workload is high and terminate the workers when the workload is low. Those worker instances would pull a scrape request from the SQS queue, do the scraping work, and then repeat.

执行此操作的另一种方法是使用AWS Lambda.您可以从SQS或SNS触发Lambda函数.让您的主服务器将抓取请求添加到SQS队列中,或者让主服务器将请求发布到SNS主题中,然后从SQS队列或SNS主题驱动一个网络抓取Lambda函数(用JavaScript编写).

Another way to do this would be to use AWS Lambda. You can trigger Lambda functions from SQS or from SNS. Have your master add scrape requests to an SQS queue or have the master publish requests to an SNS topic, and then drive a web-scraper Lambda function (written in JavaScript) from the SQS queue or SNS topic.

我个人将首先调查Lambda路线.

Personally, I would investigate the Lambda route first.