在Python中将NLTK语料库与AWS Lambda函数一起使用
在AWS Lambda中使用NLTK语料库(特别是停用词)时遇到了困难.我知道需要下载语料库,并已使用NLTK.download('stopwords')进行了下载,并将其包含在zip文件中,该文件用于上传lambda模块到nltk_data/corpora/stopwords中.
I'm encountering a difficulty when using NLTK corpora (in particular stop words) in AWS Lambda. I'm aware that the corpora need to be downloaded and have done so with NLTK.download('stopwords') and included them in the zip file used to upload the lambda modules in nltk_data/corpora/stopwords.
代码中的用法如下:
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
nltk.data.path.append("/nltk_data")
这将从Lambda日志输出中返回以下错误
This returns the following error from the Lambda log output
module initialization error:
**********************************************************************
Resource u'corpora/stopwords' not found. Please use the NLTK
Downloader to obtain the resource: >>> nltk.download()
Searched in:
- '/home/sbx_user1062/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- '/nltk_data'
**********************************************************************
我还试图通过包含直接加载数据
I have also tried to load the data directly by including
nltk.data.load("/nltk_data/corpora/stopwords/english")
下面会产生不同的错误
module initialization error: Could not determine format for file:///stopwords/english based on its file
extension; use the "format" argument to specify the format explicitly.
从Lambda zip加载数据可能有问题,需要将其存储在外部.例如在S3上说,但这似乎有些奇怪.
It's possible that it has a problem loading the data from the Lambda zip and needs it stored externally.. say on S3, but that seems a bit strange.
任何想法
有人知道我要去哪里错吗?
Does anyone know where I could be going wrong?
我之前遇到过同样的问题,但是我使用环境变量解决了.
I had the same problem before but I solved it using the environment variable.
- 执行"nltk.download()"并将其复制到您的AWS lambda应用程序的根文件夹中. (该文件夹应称为"nltk_data".)
- 在lambda函数的用户界面(在AWS控制台中)中,添加"NLTK_DATA" ="./nltk_data".请看图片.
- Execute "nltk.download()" and copy it to the root folder of your AWS lambda application. (The folder should be called "nltk_data".)
- In the user interface of your lambda function (in the AWS console), you add "NLTK_DATA" = "./nltk_data". Please see the image.