在FTP服务器上获取zip文件中的文件名,而无需下载整个档案

问题描述:

我在远程FTP服务器上有很多zip档案,它们的大小高达20TB.我只需要这些zip归档文件中的文件名,以便可以将它们插入我的Python脚本中.

I have a lot of zip archives in a remote FTP server and their sizes go up to 20TB. I just need the file names inside those zip archives, so that I can plug them into my Python scripts.

有什么方法可以获取文件名而无需实际下载文件并将其提取到我的本地计算机上?如果是这样,有人可以将我定向到正确的库/程序包吗?

Is there any way to just get the file names without actually downloading files and extracting them on my local machine? If so, can someone direct me to the right library/package?

您可以实现一个类似文件的对象,该对象从FTP读取数据,而不是本地文件.并将其传递给 ZipFile构造函数,而不是(本地)文件名.

You can implement a file-like object that reads data from FTP, instead of a local file. And pass that to ZipFile constructor, instead of a (local) file name.

一个简单的实现可以像这样:

A trivial implementation can be like:

from ftplib import FTP
from ssl import SSLSocket

class FtpFile:

    def __init__(self, ftp, name):
        self.ftp = ftp
        self.name = name
        self.size = ftp.size(name)
        self.pos = 0

    def seek(self, offset, whence):
        if whence == 0:
            self.pos = offset
        if whence == 1:
            self.pos += offset
        if whence == 2:
            self.pos = self.size + offset

    def tell(self):
        return self.pos

    def read(self, size = None):
        if size == None:
            size = self.size - self.pos
        data = B""

        # based on FTP.retrbinary 
        # (but allows stopping after certain number of bytes read)
        ftp.voidcmd('TYPE I')
        cmd = "RETR {}".format(self.name)
        conn = ftp.transfercmd(cmd, self.pos)
        try:
            while len(data) < size:
                buf = conn.recv(min(size - len(data), 8192))
                if not buf:
                    break
                data += buf
            # shutdown ssl layer (can be removed if not using TLS/SSL)
            if SSLSocket is not None and isinstance(conn, SSLSocket):
                conn.unwrap()
        finally:
            conn.close()
        try:
            ftp.voidresp()
        except:
            pass
        self.pos += len(data)
        return data

然后您可以像这样使用它:

And then you can use it like:

ftp = FTP(host, user, passwd)
ftp.cwd(path)

ftpfile = FtpFile(ftp, "archive.zip")
zip = zipfile.ZipFile(ftpfile)
print(zip.namelist())


以上实现相当琐碎且效率低下.它开始下载小块数据(至少三个),以检索包含文件的列表.可以通过读取和缓存更大的块来对其进行优化.但这应该能给您带来想法.


The above implementation is rather trivial and inefficient. It starts numerous (three at minimum) downloads of small chunks of data to retrieve a list of contained files. It can be optimized by reading and caching larger chunks. But it should give your the idea.

特别是,您可以利用仅阅读清单的事实.该列表位于ZIP归档文件的和.因此,您一开始就可以下载最后一个(约)10 KB的数据.这样,您就可以从该缓存中完成所有read调用.

Particularly you can make use of the fact that you are going to read a listing only. The listing is located at the and of a ZIP archive. So you can just download last (about) 10 KB worth of data at the start. And you will be able to fulfill all read calls out of that cache.

了解到这一点,您实际上可以进行一些小小的改动.由于清单位于档案的末尾,因此您实际上只能下载档案的末尾.虽然下载的ZIP将被破坏,但仍可以列出.这样,您就不需要FtpFile类.您可以甚至将列表下载到内存中(StringIO).

Knowing that, you can actually do a small hack. As the listing is at the end of the archive, you can actually download the end of the archive only. While the downloaded ZIP will be broken, it still can be listed. This way, you won't need the FtpFile class. You can even download the listing to memory (StringIO).

zipstring = StringIO()
name = "archive.zip"
size = ftp.size(name)
ftp.retrbinary("RETR " + name, zipstring.write, rest = size - 10*2024)

zip = zipfile.ZipFile(zipstring)

print(zip.namelist())

如果由于10 KB太小而无法包含整个列表,则出现BadZipfile异常,则可以重试较大的代码.

If you get BadZipfile exception because the 10 KB is too small to contain whole listing, you can retry the code with a larger chunk.