最好的算法在字符串数组中删除重复项

问题描述:

今天在学校老师要求我们实现了一个重复,删除算法。这并不难,每个人都想出了以下解决方案(伪code):

Today at school the teacher asked us to implement a duplicate-deletion algorithm. It's not that difficult, and everyone came up with the following solution (pseudocode):

for i from 1 to n - 1
    for j from i + 1 to n
        if v[i] == v[j] then remove(v, v[j])    // remove(from, what)
    next j
next i

此算法中的计算复杂度 N(N-1)/ 2 。 (我们在高中的时候,我们还没有谈到大O,但它似乎是为O(n ^ 2))。该解决方案出现,当然丑陋,缓慢的,所以我试图code东西速度快:

The computational complexity for this algo is n(n-1)/2. (We're in high school, and we haven't talked about big-O, but it seems to be O(n^2)). This solution appears ugly and, of course, slow, so I tried to code something faster:

procedure binarySearch(vector, element, *position)
    // this procedure searches for element in vector, returning
    // true if found, false otherwise. *position will contain the
    // element's place (where it is or where it should be)
end procedure

----

// same type as v
vS = new array[n]

for i from 1 to n - 1
    if binarySearch(vS, v[i], &p) = true then
        remove(v, v[i])
    else
        add(vS, v[i], p)      // adds v[i] in position p of array vS
    end if
next i

这样 VS 将包含所有我们已经通过了元素。如果元素 v [电流] 在这个数组,那么它是一个重复的和被删除。对于二进制搜索的计算复杂度的log(n)和主回路(第二片段)是 N 。因此,整个CC是的n * log(n)的如果我没有记错。

This way vS will contain all the elements we've already passed. If element v[i] is in this array, then it is a duplicate and is removed. The computational complexity for the binary search is log(n) and for the main loop (second snippet) is n. Therefore the whole CC is n*log(n) if I'm not mistaken.

然后,我有关于使用二叉树另一个想法,但我不能把它放下。
基本上,我的问题是:

Then I had another idea about using a binary tree, but I can't put it down.
Basically my questions are:

  • 是我的CC计算吧? (如果没有,为什么?)
  • 有没有更快的方法呢?

感谢

最简单的解决方法是将简单的排序阵列(需要为O(n log n)的标准执行,如果你可以使用它们。否则,可考虑使一个简单的随机快速排序(code是即使在*))。

The easiest solution will be to simply sort the array (takes O(n log n) with standard implementation if you may use them. otherwise consider making an easy randomized quicksort (code is even on wikipedia)).

然后扫描其中的一个额外的时间。 在此期间,扫描简单的消除连续相同的元素。

Afterwards scan it for one additional time. During that scan simple eliminate consecutive identical elements.

。 只是想迭代一次在你的阵列,每个元素检查,如果它在你的HashSet。

If you want to do it in O(n), you can also use a HashSet with elements you have already seen. Just iterate once over your array, for each element check if it is in your HashSet.

如果它不在那里,添加它。 如果是在那里,从数组中取出。

If it isn't in there, add it. If it is in there, remove it from the array.

请注意,这将需要一些额外的存储器和哈希将有利于你的运行一个常数因子。 Althought的时间复杂度比较好,实际运行时将只磺酰基会更快,一旦你超过一定的数组大小

Note, that this will take some additional memory and the hashing will have a constant factor that contributes to your runtime. Althought the time complexity is better, the practical runtime will only be onyl be faster once you exceed a certain array size