如何在mongodb中删除具有特定条件的重复项?

问题描述：

例如，我的收藏夹中包含以下文档:

For example, I have the following documents in my collection:

{
    "_id" : "GuqXmAkkARqhBDqhy",
    "beatmapset_id" : "342537",
    "version" : "MX",
    "diff_approach" : "5",
    "artist" : "Yousei Teikoku",
    "title" : "Kokou no Sousei",
    "difficultyrating" : "3.5552737712860107"
}
{
    "_id" : "oHLT7KqsB7bztBGvu",
    "beatmapset_id" : "342537",
    "version" : "HD",
    "diff_approach" : "5",
    "artist" : "Yousei Teikoku",
    "title" : "Kokou no Sousei",
    "difficultyrating" : "2.7515676021575928"
}
{
    "_id" : "GbotZfrPEwW69FkGD",
    "beatmapset_id" : "342537",
    "version" : "NM",
    "diff_approach" : "5",
    "artist" : "Yousei Teikoku",
    "title" : "Kokou no Sousei",
    "difficultyrating" : "0"
}

这些文档具有相同的键beatmapset_id 我想删除所有重复项，但将文档保留为difficultyrating最多.

These documents have the same key beatmapset_id I want to delete all duplicates but leave the document with the most difficultyrating.

我尝试了db.collection.ensureIndex({beatmapset_id: 1}, {unique: true, dropDups: true})，但是它留下了一个随机文档，我想要上面的条件.

I tried db.collection.ensureIndex({beatmapset_id: 1}, {unique: true, dropDups: true}) but it leaves a random document and I want the condition above.

我该怎么做?

答

首先，您需要更新文档并将difficultyrating和beatmapset_id更改为浮点数.为此，您需要使用批量" 操作可最大程度地提高效率.

First you need to update your documents and change difficultyrating and beatmapset_id to float point number. To do that you need to loop over each document using the .forEach method and update each document with "Bulk" operations for maximum efficiency..

var bulk = db.collection.initializeOrderedBulkOp();
var count = 0;
db.collection.find().forEach(function(doc) { 
    bulk.find({ '_id': doc._id }).update({ 
        '$set': { 
            'beatmapset_id': parseFloat(doc.beatmapset_id), 
            'difficultyrating': parseFloat(doc.difficultyrating) 
        } 
    });
    count++; 
    if(count % 100 == 0) {     
        bulk.execute();     
        bulk = db.collection.initializeOrderedBulkOp(); 
    } 
})

if(count > 0) { 
    bulk.execute(); 
}

现在，由于MongoDB 2.6以来，索引创建的"dropDups"语法已被弃用"，而在MongoDB 3.0中已删除.这样可以删除公母.

Now and since The "dropDups" syntax for index creation has been "deprecated" as of MongoDB 2.6 and removed in MongoDB 3.0. This is how you can remove the dups.

这里的主要思想是首先按difficultyrating降序对文档进行排序.

The main idea here is first sort your document by difficultyrating in descending order.

bulk  = db.collection.initializeUnorderedBulkOp();
count = 0;
db.collection.aggregate([
    { '$sort': { 'difficultyrating': -1 }}, 
    { '$group': { '_id': '$beatmapset_id', 'ids': { '$push': '$_id' }, 'count': { '$sum': 1 }}}, 
    { '$match': { 'count': { '$gt': 1 }}}
]).forEach(function(doc) {
    doc.ids.shift();
    bulk.find({'_id': { '$in': doc.ids }}).remove(); 
    count++; 
    if(count === 100) { 
        bulk.execute(); 
        bulk = db.collection.initializeUnorderedBulkOp();
    }
})

if(count !== 0) { 
    bulk.execute(); 
}

此 answer 涵盖了该主题，以提供更多详细信息.

This answer cover the topic for more detail.

如何在mongodb中删除具有特定条件的重复项?

相关推荐