Python/Pandas:从列表中的字符串匹配中删除数据框中的行
我有一个联系信息的.csv文件,我将其导入为熊猫数据框.
I have a .csv file of contact information that I import as a pandas data frame.
>>> import pandas as pd
>>>
>>> df = pd.read_csv('data.csv')
>>> df.head()
fName lName email title
0 John Smith jsmith@gmail.com CEO
1 Joe Schmo jschmo@business.com Bagger
2 Some Person some.person@hotmail.com Clerk
导入数据后,我想删除其中一个字段包含列表中多个子字符串之一的行.例如:
After importing the data, I'd like to drop rows where one field contains one of several substrings in a list. For example:
to_drop = ['Clerk', 'Bagger']
for i in range(len(df)):
for k in range(len(to_drop)):
if to_drop[k] in df.title[i]:
# some code to drop the rows from the data frame
df.to_csv("results.csv")
在Pandas中执行此操作的首选方法是什么?这是否应该是一个后处理步骤,还是最好先将其过滤后再写入数据帧?我的想法是,一旦在数据框对象中操作起来会更容易.
What is the preferred way to do this in Pandas? Should this even be a post-processing step, or is it preferred to filter this prior to writing to the data frame in the first place? My thought was that this would be easier to manipulate once in a data frame object.
Use isin
and pass your list of terms to search for you can then negate the boolean mask using ~
and this will filter out those rows:
In [6]:
to_drop = ['Clerk', 'Bagger']
df[~df['title'].isin(to_drop)]
Out[6]:
fName lName email title
0 John Smith jsmith@gmail.com CEO
Another method is to join the terms so it becomes a regex and use the vectorised str.contains
:
In [8]:
df[~df['title'].str.contains('|'.join(to_drop))]
Out[8]:
fName lName email title
0 John Smith jsmith@gmail.com CEO
IMO,作为后处理步骤执行过滤将更加容易,并且可能更快,因为如果您决定在读取时进行过滤,那么您将迭代地增加效率不高的数据帧.
IMO it will be easier and probably faster to perform the filtering as a post processing step because if you decide to filter whilst reading then you are iteratively growing the dataframe which is not efficient.
或者,您可以分块读取csv,过滤掉不需要的行,然后将这些块附加到输出csv中
Alternatively you can read the csv in chunks, filter out the rows you don't want and append the chunks to your output csv