在 Python/Pandas 中读取带有缺失值的空格分隔文件
我正在尝试使用 panda 的 read_csv 读取 Python 中的空格分隔文件.它通过指定 delimiter="" 来工作.当列中存在某些缺失值时会出现问题,因为它将缺失值视为分隔符而忽略了缺失值.
I am trying to read a space delimited file in Python using read_csv from panda. It works by specifying delimiter=" ". Problem arises when there are certain missing values in columns, because it ignores the missing value by considering it as a delimiter.
有没有办法解决这个问题?
Is there a way to resolve this problem?
1600 1141.0000 020006 600 1141.0000 69.0000 OAUC 0.0000
1 1070.5000 020032 1 1070.5000 400.0000 0.0000
您可以看到值为 OAUC 的列中有一个缺失值.列之间的间距不均匀,这使它变得更加困难.此外,列是固定的,因此可能会发现某些值丢失,但尚无法找出丢失的值.
You can see there is a missing value in the column with value OAUC. There is uneven spacing between columns which is making it more difficult. Also the columns are fixed, so it's possible to find out that some value is missing but finding out which value is missing hasn't been possible yet.
我同意 Justin 的观点,即首先清理它是确保正确处理的最佳方法.如果您可以浏览结果以验证质量控制,那么在这种情况下,此 hack 可能会完成工作.
I agree with Justin that cleaning it up first is the best way to be sure to get it right. If you can skim your results to verify quality control, than this hack might get the job done in this case.
pd.read_csv(header=None, sep='\s{1, 7}')
我再说一遍,这不是一个好主意.如果您只想加载一个较小的数据集,它就可以完成这项工作.但是,如果您无法验证它是否有效,最好使用 read_fwf 并仔细指定 colspecs,或者按照 Justin 的建议清理文件.
I'll say again, this is not a great idea. If you just want to get a smallish data set loaded, it will do the job. But if you can't verify that it worked, better use read_fwf and carefully specify colspecs, or follow Justin's advice and clean up the file.