如何在powershell中合并两个具有相同标题的csv文件并丢弃重复的行

问题描述：

我正在从 NetApp Performance Manager 软件 (OPM) 收集性能计数器.OPM 在 MySQL 数据库中保存了 30 天的数据.所以我必须输入两个查询来检索数据:

I am collecting performance counters from NetApp Performance Manager software (OPM). OPM saves 30 days worth of data in MySQL database. So i have to put two queries to retrieve the data:

每月 30 日第一次查询并保存在 csv 文件中.
每月 1 号进行第二次查询并保存在 csv 文件中.

然后合并两个csv文件，如果一个月有31天就得到数据.

Then merge the two csv files to get data if there are 31 days in a month.

两个文件如下所示:

"Time","objid","cpuBusy","cifsOps","avgLatency"
"2016:06:04 00:04","72","50.6196","2069.11","7622.1"
"2016:06:04 00:09","72","30.2233","2075.94","7633.27"
"2016:06:04 00:14","72","35.2559","1980.64","8352.17"

当我使用以下代码合并两个 csv 文件时.我从相同的数据/时间得到重复的行.

When i merge the two csv files with below code. I get duplicate rows with data from same data/time.

@(Import-Csv au2004npsa003-mm-business.csv) + @(Import-Csv au2004npsa003-nn-business.csv) | export-csv joined.csv -NoTypeInformation

如何在不获取重复数据的情况下合并两个 csv 文件?我试过 select -unique 但是，它只给出一行.

How can i merge the two csv files without getting duplicate data? I have tried select -unique however, it gives just one row.

答

至于 为什么 Select-Object -Unique 不起作用:

As for why Select-Object -Unique didn't work:

Select-Object -Unique，当给定引用类型(字符串除外)的实例时，比较它们的.ToString() 值 以确定唯一性.

Select-Object -Unique, when given instances of reference types (other than strings), compares their .ToString() values in order to determine uniqueness.

[pscustomobject] 实例，例如 Import-Csv 创建的实例，遗憾的是 返回 空字符串来自他们的.ToString()方法.

[pscustomobject] instances, such as the ones Import-Csv creates, regrettably return the empty string from their .ToString() method.

因此，所有输入对象都比较相同，并且只会返回第一个输入对象.

Thus, all input objects compare the same, and only the first input object is ever returned.

S9uare 有用的Select-Object -Property * -Unique 方法通过强制所有属性单独比较，但附带一个性能警告:
输入对象被有效地重新创建，在这种情况下比较所有属性值是多余的，因为比较Time值就足够了；对于大型输入文件，处理可能需要很长时间.

S9uare's helpful Select-Object -Property * -Unique approach overcomes this problem by forcing all properties to be compared invidually, but comes with a performance caveat:
The input objects are effectively recreated, and comparing all property values is overkill in this case, because comparing Time values would suffice; with large input files, processing can take a long time.

由于手头的数据来自 CSV 文件，性能问题可以通过字符串处理来解决，使用 Get-Content 而不是导入-CSV:

Since the data at hand comes from CSV files, the performance problem can be helped with string processing, using Get-Content rather than Import-Csv:

Get-Content au2004npsa003-mm-business.csv, au2004npsa003-nn-business.csv |
  Select-Object -Unique |
    Set-Content -Encoding ASCII joined.csv

^{请注意，我使用 -Encoding ASCII 来模仿 Export-Csv 的默认行为；根据需要更改.}

^{Note that I'm using -Encoding ASCII to mimic Export-Csv's default behavior; change as needed.}

对于字符串形式的输入对象，Select-Object -Unique 按预期工作 - 并且速度更快.
但是请注意，对于大型输入文件，您可能会耗尽内存，因为 Select-Object 需要构建一个包含所有行的内存数据结构以确定唯一性.

With input objects that are strings, Select-Object -Unique works as expected - and is faster.
Note, however, that with large input files that you may run out of memory, given that Select-Object needs to build up an in-memory data structure containing all rows in order to determine uniqueness.

如何在powershell中合并两个具有相同标题的csv文件并丢弃重复的行

相关推荐