使用星号PHP脚本匹配电话前缀的最快方法

使用星号PHP脚本匹配电话前缀的最快方法

问题描述:

and thanks in advance for the help.

Background - I am writing a PHP script that needs to figure out the destination the caller is trying to reach. Telephony prefixes are strings that identify a destination. For each call, the program must find the longest prefix that matches the string. For example the number 30561234567 would be matched by 305 but not by 3057 or 304. If 3056 existed it would be the preferred match.

After investigating several data structures, a tree in which each node stores a digit and contains pointers to the other 10 possible choices seems ideal. This could be implemented as an array of arrays, where the script could check 3, find an array there, then check 0 on that new array, find another array and so on until it finds a value instead of an array. This value would be the destination id (the output of the script).

Requirements - Performance is absolutely critical. Time spent checking these prefixes delays the call, and each server has to handle large amounts of calls, so the data structure must be stored in memory. There are approximately 6000 prefixes at the moment.

Problem - The script is run each time the server receives a call, so the data must be held in a cache server of some sort. After checking memcached and APC (Advanced PHP Cache), I decided to use APC because it is [much faster][3] (it is a local memory store)

The problem I have is that the array of arrays can become up to 10 arrays deep, and will be a very complex data structure that, if I add to the cache as an object, will take a long time to de-serialize.

However if I add every single array to the cache separately (with some logical naming structure to be able to find it easily like 3 for array 3, then 30 for array 30, 305 for the array that follows that patch etc...) I will have to fetch different arrays from the cache many times (up to 10 per call), making me hit the cache too often.

Am I going about this the right way? Maybe there is another solution? Or is one of the methods I have proposed superior to the other?

Thank you for you input,

Alex

并提前感谢您的帮助。 p>

背景 - 我正在编写一个PHP脚本,需要找出调用者试图达到的目的地。 电话号码前缀是标识目的地的字符串。 对于每个调用,程序必须找到与字符串匹配的最长前缀。 例如,数字30561234567将匹配305但不匹配3057或304.如果存在3056,则它将是首选匹配。 p>

在调查多个数据结构之后,每个节点都有一个树 存储一个数字并包含指向其他10个可能选择的指针似乎是理想的。 这可以实现为一个数组数组,其中脚本可以检查3,在那里找到一个数组,然后在该新数组上检查0,找到另一个数组,依此类推,直到找到一个值而不是数组。 此值将是目标ID(脚本的输出)。 p>

要求 strong> - 性能绝对至关重要。 检查这些前缀所花费的时间会延迟调用,并且每个服务器都必须处理大量调用,因此数据结构必须存储在内存中。 目前大约有6000个前缀。 p>

问题 strong> - 每次服务器接到呼叫时都会运行脚本,因此数据必须保存在 某种缓存服务器。 在检查了memcached和APC(高级PHP缓存)之后,我决定使用APC,因为它[更快] [3](它是一个本地内存存储器) p>

我遇到的问题是 数组数组最多可以变为10个数组,并且将是一个非常复杂的数据结构,如果我作为对象添加到缓存中,将需要很长时间来反序列化。 p> \ n

但是如果我将每个单独的数组分别添加到缓存中(使用一些逻辑命名结构可以很容易地找到它,就像数组3中的3一样,那么30代表数组30,305代表该补丁之后的数组等) ..)我将不得不从缓存中多次获取不同的数组(每次调用最多10次),这使得我经常点击缓存。 p>

我是否正确地采用了这种方式 ? 也许有另一种解决方案? 或者我提出的方法之一优于另一方法? p>

感谢您输入, p>

Alex p>

Here is some sample code for an N-ary tree structure;

class PrefixCache {
 const EOS = 'eos';
 protected $data;

 function __construct() {
  $this->data = array();
  $this->data[self::EOS] = false;
 }

 function addPrefix($str) {
  $str = (string) $str;
  $len = strlen($str);

  for ($i=0, $t =& $this->data; $i<$len; ++$i) {
   $ch = $str[$i];

   if (!isset($t[$ch])) {
    $t[$ch] = array();
    $t[$ch][self::EOS] = false;
   }

   $t =& $t[$ch];
  }

  $t[self::EOS] = true;
 }

 function matchPrefix($str) {
  $str = (string) $str;
  $len = strlen($str);

  $so_far = '';
  $best = '';

  for ($i=0, $t =& $this->data; $i<$len; ++$i) {
   $ch = $str[$i];

   if (!isset($t[$ch]))
    return $best;
   else {
    $so_far .= $ch;
    if ($t[$ch][self::EOS])
     $best = $so_far;

    $t =& $t[$ch];     
   }
  }

  return false; // string not long enough - potential longer matches remain
 }

 function dump() {
  print_r($this->data);
 }
}

this can then be called as

$pre = new PrefixCache();

$pre->addPrefix('304');
$pre->addPrefix('305');
// $pre->addPrefix('3056');
$pre->addPrefix('3057');

echo $pre->matchPrefix('30561234567');

which performs as required (returns 305; if 3056 is uncommented, returns 3056 instead).

Note that I add a terminal-flag to each node; this avoids false partial matches, ie if you add prefix 3056124 it will properly match 3056 instead of returning 305612.

The best way to avoid reloading each time is to turn it into a service; however, before doing so I would measure run-times with APC. It may well be fast enough as is.

Alex: your answer is absolutely correct - but not applicable to this question :)

I do it by using a hashtable of string, destination where the keys are strings that represent the prefix of the destination. The critical factor is that the hashtable must be sorted so that the longest strings are checked first. As soon as a matching prefix is found the call destination is known.

I actually also have a round of regular expressions that for more convoluted destinations and check the regular expressions before the destination prefixes.

I haven't measured how long it takes to get a match but I suspect 15ms max. The whole process of checking the desitnation and then the user's balance and finally setting a call time limit takes around 150ms. In my case I am using FastAGI and C# Windows service. As long as you take less than 500ms it will be impercetible to your users.

I also run a telephony application... what I have done is provided an internal REST API to call, this is what will cache known phone numbers and do all of the prefix checking.

Also I assume that you are looking for country codes as well. There are only a few overlapping country codes with the NANP. So I look first for a NANP, and do a quick match on the number of following numbers (7) to make sure, otherwise I fall back on to a country code. I then have a a rough idea of how many numbers in a telephone number each country is supposed to have through a regular expression.

I'm using an XML document and matching on XPath, then caching the XPath result when possible.

The cool thing about using a REST API as well, is that it can be used to clean up numbers before I store them in the DB for billing.

It's not an exact science but it seems to work.

The way i see it, using a simple array structure should work ok...

Sample code: (note that for performance the prefixes are the keys in the array, not values)

// $prefixes = array(3=>1, 30=>1, 304=>1,305=>1,3056=>1,306=>1,31=>1, 40=>1);

function matchNumber($number)
{
  $prefixes = getPrefixesFromCache();

  $number = "$number";
  // try to find the longest prefix matching $number
  while ($number != '') {
    if (isset($keys[$number]))
      break;
    // not found yet, subtract last digit
    $number = substr($number, 0, -1);
  }
  return $number;
}

Another way would be to query the cache directly for the number - in this case, it could be further optimized:

  1. split number string in 2.
  2. query that string in the cache.
  3. if it doesn't exist, goto 1
  4. while it exists, store that value as result, and add another digit.

Snippet: (note that query_cache_for() should be replaced by whatever function your caching mechanism uses)

function matchNumber($number)
{
  $temp = "$number";
  $found = false;
  while (1) {
    $temp = substr($temp, 0, ceil(strlen($temp)/2) );
    $found = query_cache_for($temp);
    if ($found)
      break;
    if (strlen($temp) == 1)
      return FALSE; // should not happen!
  }
  while ($found) {
    $result = $temp;
    // add another digit
    $temp .= substr($number, strlen($temp), 1);
    $found = query_cache_for($temp);
  }
  return $result;
}

This approach has the obvious advantage that each prefix is a single element in the cache - the key could be 'asterix_prefix_<number>' for example, the value is unimportant (1 works).

Finding the longest common subsequence is a classical application of dynamic programming. The solution is O(n). http://en.wikipedia.org/wiki/Longest_common_subsequence_problem

Since you're only working with numbers, maybe working directly with strings is inefficient.

You could perform a binary search algorithm. If you store all your prefixes (numerically), padded to 15 places and then in order, you can scan 6000 codes in approximately log2(6000)~=13 steps.

For example if you have the following codes:

  • 01, 0127, 01273, 0809, 08

You would store the following in an array:

  1. 010000000000000
  2. 012700000000000
  3. 012730000000000
  4. 080000000000000
  5. 080900000000000

The steps would be:

  1. Strip incoming number down to 15 places.
  2. Perform binary search to find the nearest lowest code (and it's index in the array above)
  3. Look up the length of the code in a separate array (using the index)

Some sample code to see it in action:

// Example for prefixes 0100,01,012,0127,0200
$prefixes = array('0100','0101','0120','0127','0200');
$prefix_lengths = array(4,2,3,4,4);
$longest_length_prefix = 4;

echo GetPrefix('01003508163');

function GetPrefix($number_to_check) {
    global $prefixes;
    global $prefix_lengths;
    global $longest_length_prefix;

    $stripped_number = substr($number_to_check, 0, $longest_length_prefix);

    // Binary search
    $window_floor = 0;
    $window_ceiling = count($prefixes)-1;
    $prefix_index = -1;

    do {
        $mid_point = ($window_floor+$window_ceiling)>>1;

        if ($window_floor==($window_ceiling-1)) {
            if ($stripped_number>=$prefixes[$window_ceiling]) {
                $prefix_index=$window_ceiling;
                break;
            } elseif ($stripped_number>=$prefixes[$window_floor]) {
                $prefix_index=$window_floor;
                break;
            } else {
                break;
            }
        } else {
            if ($stripped_number==$prefixes[$mid_point]) {
                $prefix_index=$mid_point;
                break;
            } elseif ($stripped_number<$prefixes[$mid_point]) {
                $window_ceiling=$mid_point;
            } else {
                $window_floor=$mid_point;
            }
        }
    } while (true);

    if ($prefix_index==-1 || substr($number_to_check, 0, $prefix_lengths[$prefix_index])!=substr($prefixes[$prefix_index],0, $prefix_lengths[$prefix_index])) {
        return 'invalid prefix';
    } else {
        return substr($prefixes[$prefix_index], 0, $prefix_lengths[$prefix_index]);
    }
}