如何使用大 pandas 分割键值,从文本文件中获取值?
问题描述:
我有这样的输入文本文件:
I'm having input text file like this :
Input.txt -
1=88|2=1438|3=KKK|4=7.7|5=00|7=66|8=a
1=13|2=1388|3=DDD|4=157.73|6=00|7=08|8=b|9=k
我想拆分这个键和值对,并以下列格式显示:
I want to split this key and value pairs and showing in the format like this :
Output.txt -
index[0]
1 88
2 1438
3 kkk
4 7.7
5 00
6
7 66
8 a
9
index[1]
1 13
2 1438
3 DDD
4 157.73
5
6 00
7 08
8 b
9 k
查看索引[0] 6和9条记录的值为空,因为6在另一列可用,但不在此。像这样在索引[1]第5条记录是空白。
see In the index[0] 6 and 9 th record's values are Blank because 6 is available in the other column but not in this. Like this in index[1] 5th record is Blank.
程序代码
df = pd.read_csv(inputfile, index_col=None, names=['text'])
#spliting two times with respect to (= & |) and saving into stack
s = df.text.str.split('|', expand=True).stack().str.split('=', expand=True)
#giving index's as empty string ('') i.e. for removing
s.columns = ['idx','']
#rename_axis(None) for excluding index values
dfs = [g.set_index('idx').rename_axis(None) for i, g in s.groupby(level=0)]
#length for iterating through list
dfs_length = len(dfs)
#opening output file
with open(outputfile + 'output.txt','w') as file_obj:
i = 0
while i < dfs_length:
#index of each column
s = '\nindex[%d]\n'%i
#writing index to file
file_obj.write(str(s))
#print '\nindex[%d]'%i
#print dfs[i]
#wriring actual contents to file
file_obj.write(str(dfs[i])+'\n')
i = i + 1
得到这个输出:
output.txt -
index[0]
1 88
2 1438
3 kkk
4 7.7
5 00
7 66
8 a
index[1]
1 13
2 1438
3 DDD
4 157.73
6 00
7 08
8 b
9 k
我只收到可用的记录在输入文本文件中。如何将记录值保留为空白?
I'm getting only that records which are available in the input text files. How can I keep record value as a Blank?
答
可以使用 .str.extract ()
函数与生成的RegEx结合:
you can do it using .str.extract()
function in conjunction with a generated RegEx:
pat = r'(?:1=)?(?P<a1>[^\|]*)?'
# you may want to adjust the right bound of the range interval
for i in range(2, 12):
pat += r'(?:\|{0}=)?(?P<a{0}>[^\|]*)?'.format(i)
new = df.val.str.extract(pat, expand=True)
测试:
In [178]: df
Out[178]:
val
0 1=88|2=1438|3=KKK|4=7.7|5=00|7=66|8=a
1 1=13|2=1388|3=DDD|4=157.73|6=00|7=08|8=b|9=k
2 1=11|3=33|5=55
In [179]: new
Out[179]:
a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11
0 88 1438 KKK 7.7 00 66 a
1 13 1388 DDD 157.73 00 08 b k
2 11 33 55