subprocess.Popen 标准输入读取文件

subprocess.Popen 标准输入读取文件

问题描述:

我正在尝试在读取文件的一部分后调用该文件的进程.例如:

I'm trying to call a process on a file after part of it has been read. For example:

with open('in.txt', 'r') as a, open('out.txt', 'w') as b:
  header = a.readline()
  subprocess.call(['sort'], stdin=a, stdout=b)

如果我在执行 subprocess.call 之前没有从 a 中读取任何内容,这将正常工作,但是如果我从中读取任何内容,则子流程将看不到任何内容.这是使用 python 2.7.3.我在文档中找不到任何解释这种行为的内容,并且(非常)简短地浏览了子流程源并没有发现原因.

This works fine if I don't read anything from a before doing the subprocess.call, but if I read anything from it, the subprocess doesn't see anything. This is using python 2.7.3. I can't find anything in the documentation that explains this behaviour, and a (very) brief glance at the subprocess source didn't reveal a cause.

如果你打开文件无缓冲,那么它可以工作:

If you open the file unbuffered then it works:

import subprocess

with open('in.txt', 'rb', 0) as a, open('out.txt', 'w') as b:
    header = a.readline()
    rc = subprocess.call(['sort'], stdin=a, stdout=b)

subprocess 模块在文件描述符级别(操作系统的低级别无缓冲 I/O)工作.它可以与 os.pipe()socket.socket()pty.openpty()、任何具有有效 的东西一起使用.fileno() 方法(如果操作系统支持).

subprocess module works at a file descriptor level (low-level unbuffered I/O of the operating system). It may work with os.pipe(), socket.socket(), pty.openpty(), anything with a valid .fileno() method if OS supports it.

不建议在同一个文件中混合缓冲和非缓冲 I/O.

在 Python 2 上,file.flush() 会导致输出出现,例如:

On Python 2, file.flush() causes the output to appear e.g.:

import subprocess
# 2nd
with open(__file__) as file:
    header = file.readline()
    file.seek(file.tell()) # synchronize (for io.open and Python 3)
    file.flush()           # synchronize (for C stdio-based file on Python 2)
    rc = subprocess.call(['cat'], stdin=file)

该问题可以在没有 subprocess 模块和 os.read() 的情况下重现:

The issue can be reproduced without subprocess module with os.read():

#!/usr/bin/env python
# 2nd
import os

with open(__file__) as file: #XXX fully buffered text file EATS INPUT
    file.readline() # ignore header line
    os.write(1, os.read(file.fileno(), 1<<20))

如果缓冲区很小,则打印文件的其余部分:

If the buffer size is small then the rest of the file is printed:

#!/usr/bin/env python
# 2nd
import os

bufsize = 2 #XXX MAY EAT INPUT
with open(__file__, 'rb', bufsize) as file:
    file.readline() # ignore header line
    os.write(2, os.read(file.fileno(), 1<<20))

如果第一行的大小不能被 bufsize 整除,它会消耗更多的输入.

It eats more input if the first line size is not evenly divisible by bufsize.

默认的 bufsizebufsize=1(行缓冲)在我的机器上表现相似:文件的开头消失了——大约 4KB.

The default bufsize and bufsize=1 (line-buffered) behave similar on my machine: the beginning of the file vanishes -- around 4KB.

file.tell() 报告所有缓冲区大小的第 2 行开头的位置.使用 next(file) 而不是 file.readline() 导致 file.tell() 在我的 Python 2 机器上大约 5K,原因是预读缓冲区错误(io.open() 给出了预期的第二个行位置).

file.tell() reports for all buffer sizes the position at the beginning of the 2nd line. Using next(file) instead of file.readline() leads to file.tell() around 5K on my machine on Python 2 due to the read-ahead buffer bug (io.open() gives the expected 2nd line position).

在子进程调用之前尝试 file.seek(file.tell()) 对具有默认基于 stdio 的文件对象的 Python 2 没有帮助.它适用于 Python 2 上 io_pyio 模块中的 open() 函数和默认的 open(也io-based) 在 Python 3 上.

Trying file.seek(file.tell()) before the subprocess call doesn't help on Python 2 with default stdio-based file objects. It works with open() functions from io, _pyio modules on Python 2 and with the default open (also io-based) on Python 3.

在带有和不带有 file.flush() 的 Python 2 和 Python 3 上尝试 io_pyio 模块会产生各种结果.它确认在同一个文件描述符上混合缓冲和非缓冲 I/O 不是一个好主意.

Trying io, _pyio modules on Python 2 and Python 3 with and without file.flush() produces various results. It confirms that mixing buffered and unbuffered I/O on the same file descriptor is not a good idea.