如何查找最新的修改文件并使用SHELL代码删除它们

问题描述：

我需要一些有关Shell代码的帮助。现在，我有了以下代码：

I need some help with a shell code. Now I have this code:

找到$ dirname -type f -exec md5sum‘{}’’;’|排序uniq --all-repeated =单独-w 33 | cut -c 35-

此代码在给定目录中查找重复的文件（内容相同）。我需要做的就是更新它-从重复的文件列表中找出最新的（按日期）修改的文件，打印该文件名，并提供在终端中删除该文件的机会。

This code finds duplicated files (with same content) in a given directory. What I need to do is to update it - find out latest (by date) modified file (from duplicated files list), print that file name and also give opportunity to delete that file in terminal.

答

这是在 bash （两个外部命令除外： md5sum 当然是 stat 仅用于用户的舒适度，它不是算法的一部分）。这个东西实现了100％的Bash快速排序（令我感到骄傲）：

Here's a "naive" solution implemented in bash (except for two external commands: md5sum, of course, and stat used only for user's comfort, it's not part of the algorithm). The thing implements a 100% Bash quicksort (that I'm kind of proud of):

#!/bin/bash

# Finds similar (based on md5sum) files (recursively) in given
# directory. If several files with same md5sum are found, sort
# them by modified (most recent first) and prompt user for deletion
# of the oldest

die() {
   printf >&2 '%s\n' "$@"
   exit 1
}

quicksort_files_by_mod_date() {
    if ((!$#)); then
        qs_ret=()
        return
    fi
    # the return array is qs_ret
    local first=$1
    shift
    local newers=()
    local olders=()
    qs_ret=()
    for i in "$@"; do
        if [[ $i -nt $first ]]; then
            newers+=( "$i" )
        else
            olders+=( "$i" )
        fi
    done
    quicksort_files_by_mod_date "${newers[@]}"
    newers=( "${qs_ret[@]}" )
    quicksort_files_by_mod_date "${olders[@]}"
    olders=( "${qs_ret[@]}" )
    qs_ret=( "${newers[@]}" "$first" "${olders[@]}" )
}

[[ -n $1 ]] || die "Must give an argument"
[[ -d $1 ]] || die "Argument must be a directory"

dirname=$1

shopt -s nullglob
shopt -s globstar

declare -A files
declare -A hashes

for file in "$dirname"/**; do
    [[ -f $file ]] || continue
    read md5sum _ < <(md5sum -- "$file")
    files[$file]=$md5sum
    ((hashes[$md5sum]+=1))
done

has_found=0
for hash in "${!hashes[@]}"; do
    ((hashes[$hash]>1)) || continue
    files_with_same_md5sum=()
    for file in "${!files[@]}"; do
        [[ ${files[$file]} = $hash ]] || continue
        files_with_same_md5sum+=( "$file" )
    done
    has_found=1
    echo "Found ${hashes[$hash]} files with md5sum=$hash, sorted by modified (most recent first):"
    # sort them by modified date (using quicksort :p)
    quicksort_files_by_mod_date "${files_with_same_md5sum[@]}"
    for file in "${qs_ret[@]}"; do
      printf "   %s %s\n" "$(stat --printf '%y' -- "$file")" "$file"
    done
    read -p "Do you want to remove the oldest? [yn] " answer
    if [[ ${answer,,} = y ]]; then
       echo rm -fv -- "${qs_ret[@]:1}"
    fi
done

if((!has_found)); then
    echo "Didn't find any similar files in directory \`$dirname'. Yay."
fi

我认为脚本是不言自明的（您可以像读故事一样读）。它使用了我所了解的最佳实践，并且对于文件名中的任何愚蠢字符（例如，空格，换行符，以连字符开头的文件名，以换行符结尾的文件名等）都是100％安全的。

I guess the script is self-explanatory (you can read it like a story). It uses the best practices I know of, and is 100% safe regarding any silly characters in file names (e.g., spaces, newlines, file names starting with hyphens, file names ending with a newline, etc.).

它使用bash的glob，因此如果您的目录树a肿，可能会有点慢。

It uses bash's globs, so it might be a bit slow if you have a bloated directory tree.

有一些错误检查，但许多检查遗失，因此请勿在生产中使用原样！（添加这些琐事是一件琐碎但又乏味的任务）。

There are a few error checkings, but many are missing, so don't use as-is in production! (it's a trivial but rather tedious taks to add these).

算法如下：扫描给定目录树中的每个文件；对于每个文件，将计算其md5sum并存储在关联数组中：

The algorithm is as follows: scan each file in the given directory tree; for each file, will compute its md5sum and store in associative arrays:

文件

哈希带有键的文件名和值md5sum的值。

files with keys the file names and values the md5sums.
hashes with keys the hashes and values the number of files the md5sum of which is the key.

完成此操作后，我们将扫描所有找到的md5sum，仅选择与一个文件，然后选择所有带有此md5sum的文件，然后按修改的日期对它们进行快速排序，并提示用户。

After this is done, we'll scan through all the found md5sum, select only the ones that correspond to more than one file, then select all files with this md5sum, then quicksort them by modified date, and prompt the user.

找不到重复项时的甜美效果：脚本很好

A sweet effect when no dups are found: the script nicely informs the user about it.

我不会说这是最有效的处理方式（例如，在Perl中可能会更好），但实际上很多

I would not say it's the most efficient way of doing things (might be better in, e.g., Perl), but it's really a lot of fun, surprisingly easy to read and follow, and you can potentially learn a lot by studying it!

它使用了一些bashisms和功能，仅在bash版本≥中有效。 4

It uses a few bashisms and features that only are in bash version ≥ 4

希望这会有所帮助！

注释。如果在系统上 date 具有 -r 开关，可以替换 stat 命令创建者：

Remark. If on your system date has the -r switch, you can replace the stat command by:

date -r "$file"

备注。我将 echo 留在 rm 。如果您对脚本的行为感到满意，请将其删除。然后，您将获得一个使用3个外部命令：）的脚本。

Remark. I left the echo in front of rm. Remove it if you're happy with how the script behaves. Then you'll have a script that uses 3 external commands :).

如何查找最新的修改文件并使用SHELL代码删除它们

相关推荐