将CSV列加载到numpy memmap中(快速)

问题描述：

我有一个包含两列的csv文件，用于保存来自示波器的测量值:

I have a csv file with two columns, holding measurements from an oscilloscope:

Model,MSO4034
Firmware Version,2.48
# ... (15 lines of header) ...
-5.0000000e-02,-0.0088
-4.9999990e-02,0.0116
-4.9999980e-02,0.006
-4.9999970e-02,-0.0028
-4.9999960e-02,-0.002
-4.9999950e-02,-0.0028
-4.9999940e-02,0.0092
-4.9999930e-02,-0.0072
-4.9999920e-02,-0.0008
-4.9999910e-02,-0.0056

我想将此数据加载到numpy数组中.我可以使用 np.loadtxt :

This data I'd like to load into a numpy array. I could use np.loadtxt:

np.loadtxt('data.csv', delimiter=',', skiprows=15, usecols=[1])

但是，我的数据文件很大(100 MSamples)，加载和解析(每1000行21.5 ms)将花费numpy超过半小时.

However, my data file is huge (100 MSamples), which would take numpy over half an hour to load and parse (21.5 ms per 1000 lines).

我的首选方法是直接创建内存映射" 文件，该文件仅由二进制值组成，并串联到一个文件中.它基本上是内存中的数组，只是它不是在内存中而是在磁盘上.

My preferred approach would be to directly create a Memory Map file for numpy, which just consists of the binary values, concatenated into a single file. It basically is the array in memory, just that it's not in the memory but on disk.

有什么方便的方法吗? 使用Linux，我可以 tail 删除标头，然后

Is there any convenient way of doing this? Using Linux, I could tail away the header and cut out the second column, but I'd still need to parse the values string-representation before writing it into a binary file on disk:

$ tail -n +16 data.csv | cut -d',' -f2
-0.0088
0.0116
0.006
-0.0028
-0.002
-0.0028
0.0092
-0.0072
-0.0008
-0.0056

是否有Linux命令用于解析浮点数的字符串表示形式并将其写入磁盘?

Is there any Linux command for parsing the string representation of floats and writing them on disk?

答

我也建议使用Pandas的CSV解析器，但是与其一次性将整个文件读入内存，我不如逐个遍历它并将其写入即时访问内存映射的数组:

I'd also recommend using Pandas' CSV parser, but instead of reading the whole file into memory in one go I would iterate over it in chunks and write these to a memory-mapped array on the fly:

import numpy as np
from numpy.lib.format import open_memmap
import pandas as pd

# make some test data
data = np.random.randn(100000, 2)
np.savetxt('/tmp/data.csv', data, delimiter=',', header='foo,bar')

# we need to specify the shape and dtype in advance, but it would be cheap to
# allocate an array with more rows than required since memmap files are sparse.
mmap = open_memmap('/tmp/arr.npy', mode='w+', dtype=np.double, shape=(100000, 2))

# parse at most 10000 rows at a time, write them to the memmaped array
n = 0
for chunk in pd.read_csv('/tmp/data.csv', chunksize=10000):
    mmap[n:n+chunk.shape[0]] = chunk.values
    n += chunk.shape[0]

print(np.allclose(data, mmap))
# True

您可以根据一次可以在内存中容纳多少文件来调整块大小.请记住，您在解析块时需要将原始文本以及转换后的值保存在内存中.

You can adjust the chunk size according to how much of the file you can fit in memory at a time. Bear in mind that you'll need to hold the raw text as well as the converted values in memory while you parse a chunk.

将CSV列加载到numpy memmap中(快速)

相关推荐