如何根据另一个文件中的列表值从csv文件中删除行?
我有两个文件:
candidates.csv
:
id,value
1,123
4,1
2,5
50,5
blacklist.csv
:
1
2
5
3
10
我想从 candidates.csv
中删除所有行,其中第一列( id
)的值包含在 blacklist.csv 代码>.
id
始终为数字.在这种情况下,我希望输出看起来像这样:
I'd like to remove all rows from candidates.csv
in which the first column (id
) has a value contained in blacklist.csv
. id
is always numeric. In this case I'd like my output to look like this:
id,value
4,1
50,5
到目前为止,我用于识别重复行的脚本如下所示:
So far, my script for identifying the duplicate lines looks like this:
cat candidates.csv | cut -d \, -f 1 | grep -f blacklist.csv -w
这给了我输出
1
2
现在,我需要以某种方式将此信息通过管道传回 sed
/ awk
/ gawk
/...,以删除重复项,但是我不知道如何有什么想法我可以从这里继续吗?还是有更好的解决方案?我唯一的限制是它必须在bash中运行.
Now I somehow need to pipe this information back into sed
/awk
/gawk
/... to delete the duplicates, but I don't know how. Any ideas how I can continue from here? Or is there a better solution altogether? My only restriction is that it has to run in bash.
有关以下内容:
awk -F, '(NR==FNR){a[$1];next}!($1 in a)' blacklist.csv candidates.csv
这是如何工作的?
awk程序是一系列模式-动作对,写为:
An awk program is a series of pattern-action pairs, written as:
condition { action }
condition { action }
...
其中 condition
通常是一个表达式,而 action
是一系列命令.在这里,第一个条件操作对显示为:
where condition
is typically an expression and action
a series of commands. Here, the first condition-action pairs read:
-
(NR == FNR){a [$ 1]; next}
,如果总记录计数NR
等于文件FNR
(即,如果我们正在读取第一个文件),将所有值存储在数组a
中,然后跳至下一条记录(不执行其他任何操作) -
!($ 1 in a)
如果第一个字段不在数组a
中,则执行默认操作,即打印行.这将仅对第二个文件起作用,因为第一个条件操作对的条件不成立.
-
(NR==FNR){a[$1];next}
if the total record countNR
equals the record count of the fileFNR
(i.e. if we are reading the first file), store all values in arraya
and skip to the next record (do not do anything else) -
!($1 in a)
if the first field is not in the arraya
then perform the default action which is print the line. This will only work on the second file as the condition of the first condition-action pair does not hold.