NSPredicate与NSString:哪个更好/更快找到超弦?

问题描述:

我有大量的字符串,我正在搜索是否存在给定的子字符串。似乎有两种合理的方法可以做到这一点。

I have a large number of strings that I'm searching to see if a given substring exists. It seems there are two reasonable ways to do this.

选项1:使用 NSString 方法 rangeOfSubstring 并测试 .location 是否存在:

Option 1: Use the NSString method rangeOfSubstring and test whether .location exists:

NSRange range = [string rangeOfSubstring:substring];
return (range.location != NSNotFound);

选项2.使用 NSPredicate 语法 CONTAINS

NSPredicate *regex = [NSPredicate predicateWithFormat:@"SELF CONTAINS %@", substring];
return ([regex evaluateWithObject:string] == YES)

哪种方法更好,还是有一个很好的选项3,我完全失踪了?不,我不确定更好是什么意思,但是我可能意味着更快地迭代许多字符串 s。

Which method is better, or is there a good Option 3 that I'm completely missing? No, I'm not sure exactly what I mean by "better", but possibly I mean faster when iterated over many, many strings.

您应该对使用 NSPredicate 的任何解决方案进行基准测试和计时,因为根据我的经验 NSPredicate 可能会很慢。

You should benchmark and time any solution that uses NSPredicate because in my experience NSPredicate can be very slow.

为简单起见,我会选择一个简单的(NSString * string in stringsArray){} 循环类型。循环体将包含一个简单的 rangeOfSubstring 检查。通过使用 CFStringFind() ,但是如果你正在搜索批次,你只会看到一个好处字符串。使用 CFStringFind()的好处是可以避免(非常小的)Objective-C消息调度开销。同样,当你搜索很多字符串时(通常会改变一些很多的值),通常只能转换到那个,并且你应该始终确定基准。如果可以的话,更喜欢更简单的Objective-C rangeOfString:方法。

For simplicity, I would go with a simple for(NSString *string in stringsArray) { } type of loop. The loop body would contain a simple rangeOfSubstring check. You might be able improve the performance of this by a few percent by using CFStringFind(), but you'll only see a benefit if you're searching through lots of strings. The advantage to using CFStringFind() is that you can avoid the (very small) Objective-C message dispatch overhead. Again, it's usually only a win to switch to that when you're search "a lot" of strings (for some always changing value of "a lot"), and you should always benchmark to be sure. Prefer the simpler Objective-C rangeOfString: way if you can.

更复杂的方法是使用^使用 NSEnumerationConcurrent 选项阻止功能。 NSEnumerationConcurrent 只是提示您希望枚举在可能的情况下同时发生,并且如果实现可以不支持并发枚举,则可以*忽略此提示。但是,您的标准 NSArray 很可能会实现并发枚举。实际上,这可以将 NSArray 中的所有对象分开,并将它们分配到可用的CPU上。您需要注意如何改变^ Block跨多个线程访问的状态和对象。这是一种可行的方法:

A much more complicated approach is to use the ^Blocks feature with the NSEnumerationConcurrent option. NSEnumerationConcurrent is only a hint that you'd like the enumeration to happen concurrently if possible, and an implementation is free to ignore this hint if it can't support concurrent enumeration. However, your standard NSArray is most likely going to implement concurrent enumeration. In practice, this has the effect of dividing up all the objects in the NSArray and splitting them across the available CPU's. You need to be careful about how to mutate state and objects that is accessed by the ^Block across multiple threads. Here's one potential way of doing it:

// Be sure to #include <libkern/OSAtomic.h>

__block volatile OSSpinLock spinLock = OS_SPINLOCK_INIT;
__block NSMutableArray *matchesArray = [NSMutableArray array];

[stringsToSearchArray enumerateObjectsWithOptions:NSEnumerationConcurrent usingBlock:^(id obj, NSUInteger idx, BOOL *stop) {
    NSRange matchedRange = [obj rangeOfString:@"this"];
    if(matchedRange.location != NSNotFound) {
      OSSpinLockLock((volatile OSSpinLock * volatile)&spinLock);
      [matchesArray addObject:obj];
      OSSpinLockUnlock((volatile OSSpinLock * volatile)&spinLock);
    }
  }];

// At this point, matchesArray will contain all the strings that had a match.

这使用轻量级 OSSpinLock 来确保只有一个线程可以访问并一次更新 matchesArray 。你也可以在这里使用相同的 CFStringFind()建议。

This uses a lightweight OSSpinLock to make sure only one thread has access to and updates matchesArray at a time. You can use the same CFStringFind() suggestion from above here as well.

另外,你应该知道 rangeOfString:本身不会匹配单词边界。在上面的例子中,我使用了这个这个词,它将匹配字符串一个走进酒吧的旧石器时代...... 即使它不包含单词这个

Also, you should be aware that rangeOfString: won't, by itself, match on "word boundaries". In the example above, I used the word this, which would match the string A paleolithist walked in to the bar... even though it does not contain the word this.

这个小皱纹的最简单的解决方案是使用ICU正则表达式,并利用它的增强的断字功能。要做到这一点,你有几个选择:

The simplest solution to this little wrinkle is to use an ICU regular expression and take advantage of it's "enhanced word breaking" functionality. To do this, you have a few options:


  • NSRegularExpression ,目前只提供on> 4.2或> 4.3 iOS(我忘记了)。

  • RegexKit Lite ,通过 RegexKitLite-4.0 .tar.bz2

  • NSPredicate ,通过 SELF MATCHES'(?w)\ b ... \b'。这样做的好处是它不需要额外的东西(即RegexKit Lite ),并且可用于所有(?)版本的Mac OS X和iOS> 3.0。

  • NSRegularExpression, currently only available on >4.2 or >4.3 iOS (I forget which).
  • RegexKitLite, via RegexKitLite-4.0.tar.bz2
  • NSPredicate, via SELF MATCHES '(?w)\b...\b'. The advantage to this is that it requires nothing extra (i.e., RegexKitLite) and is available on all(?) versions of Mac OS X, and iOS > 3.0.

以下代码显示如何通过 NSPredicate

在ICU正则表达式中使用增强的分词功能>

The following code shows how to use the enhanced word breaking functionality in ICU regular expressions via NSPredicate:

NSString *searchForString = @"this";
NSString *regexString = [NSString stringWithFormat:@".*(?w:\\b\\Q%@\\E\\b).*", searchForString];
NSPredicate *wordBoundaryRegexPredicate = [NSPredicate predicateWithFormat:@"SELF MATCHES %@", regexString];
NSArray *matchesArray = [stringsToSearchArray filteredArrayUsingPredicate:wordBoundaryRegexPredicate];

您可以通过替换(?w: in regexString with (?wi:

You can make the search case insensitive by replacing the (?w: in regexString with (?wi:.

正则表达式,如果你有兴趣,基本上说

The regex, if you're interested, basically says


  • 。*(?w: ...)。* 说在(?w:...)部分之前和之后匹配任何东西(即我们'只对(?w:...)部分感兴趣。

  • (?w: ...)说在括号内打开ICU增强的断字/查找功能。

  • \\ b ... \\b (这实际上只是一个反斜杠,当它在 @内时,任何反斜杠都必须反斜杠转义>字符串)说在单词边界处匹配。

  • \\Q ... \\\\ 说在 \ Q 之后立即开始处理文本,最多以 \ E 作为文字文本(想想引用)和结束)。换句话说,任何字符我n引用的文字文本没有特殊的正则表达式含义。

  • .*(?w:...).* says "match anything up to and after the (?w:...) part" (i.e., we're only interested in the (?w:...) part).
  • (?w:...) says "Turn on the ICU enhanced word breaking / finding feature inside the parenthesis".
  • \\b...\\b (which is really only a single backslash, any backslash has to be backslash escaped when it's inside a @"" string) says "Match at a word boundary".
  • \\Q...\\E says "Treat the text starting immediately after \Q and up to \E as literal text (think "Quote" and "End")". In other words, any characters in the "quoted literal text" do not have their special regex meaning.

\的原因Q ... \ E 是您可能希望匹配 searchForString 中的文字字符。如果没有这个, searchForString 将被视为正则表达式的一部分。例如,如果 searchForString 这个?,那么没有 \ Q ... \ E 匹配文字字符串这个?,但是 thi 这个,这可能不是你想要的。 :)

The reason for the \Q...\E is that you probably want to match the literal characters in searchForString. Without this, searchForString would be treated as part of the regex. As an example, if searchForString was this?, then without \Q...\E it would not match the literal string this?, but either thi or this, which is probably not what you want. :)