使用排列在MySQL中搜索

使用排列在MySQL中搜索

问题描述:

I need help.
I have a table where only two columns are: ID and NAME and these data:

ID | NAME
1    HOME
2    GAME
3    LINK

And I want show e.g. row with name: HOME if user search: HOME or OMEH or EMOH or HMEO, etc... - all permutations from word HOME.

I can't save to mysql all these permutations and search in this columns, because some words will be a too big (9-10 chars) and more than 40 MB for each 9 chars words.

我需要帮助。
我有一个只有两列的表:ID和NAME以及这些数据: p>

  ID |  NAME 
1 HOME 
2 GAME 
3 LINK 
  code>  pre> 
 
 

我想要展示,例如 名称为行的行:如果用户搜索,则为HOME:HOME或OMEH或EMOH或HMEO等... - 来自单词HOME的所有排列。 p>

我无法将所有这些排列保存到mysql并在此列中搜索,因为有些单词会太大(9-10个字符),每个9个字符超过40 MB 单词。 p> div>

One way to solve this problem is to store the sorted set of characters in each name in your database as an additional column and then sort the string the user inputs before searching e.g. database has

ID   NAME   CHARS
1    HOME   EHMO
2    GAME   AEGM
3    LINK   IKLN

Then when searching in PHP you would do this:

$search = 'MEHO';                // user input = MEHO
$chars = str_split($search);
sort($chars);
$search = implode('', $chars);   // now contains EHMO
$sql = "SELECT ID, NAME FROM table1 WHERE CHARS = '$search'";
// perform query etc.

Output

ID   NAME
1    HOME

This sounds like a "please do my homework for me" question. It is hard to conceive what real world problem this is applicable to and there is no standard solution. It is OK to ask for help with your homework here, but you should state that this is the case.

more than 40 MB for each 9 chars words

Your maths is a bit wonky, but indeed the storage does not scale well. OTOH leaving aside the amount of storage, in terms of the processing workload it does scale well as a solution.

You could simply brute-force a dynamic query:

 function mkqry($word)
 {
     $qry="SELECT * FROM yourtable WHERE 1 ";
     $last=strlen($word);
     for ($x=0; $x<$last; $x==) {
          $qry.=" AND word LIKE '%" . substr($word, $x, 1) . "%'";
     } 
     return $qry;
 }

However this will always result in a full table scan (slow) and won't correctly handle cases where a letter occurs twice in a word.

The solution is to use an indexing function which is independent of the order in which the characters appear - a non-cryptographic hash. An obvious candidate would be to XOR the characters together, although this only results in a one character identifier which is not very selective. So I would suggest simply adding the character codes:

 function pos_ind_hash($word)
 {
     $sum=0;
     for ($x=0; $x<$last; $x==) {
         $sum+=ord(substr($word, $x));
     }
     return $sum;
 }

 function mkqry($word)
 {
     $qry="SELECT * FROM yourtable WHERE 1 ";
     $last=strlen($word);
     for ($x=0; $x<$last; $x==) {
          $qry.=" AND word LIKE '%" . substr($word, $x, 1) . "%'";
     }
     $qry.=" AND yourtable.hash=" .  pos_ind_hash($word);
     return $qry;
 }

Note that the hash mechanism here does not uniquely identify a single word, but is specific enough to reduce the volume to the point where an index (on the hash) would be effective.

Multiplying rather than adding would create fewer collisions but at a greater risk of overflowing (which would create ambiguity between implementations).

But both the hash and the single character LIKE only reduce the number of potential matches. To get the query to behave definitively, you need to go further. You could add an attribute to the table (and to the index with the hash)containing the string length - this would be more selective (i.e. improve effectiveness of the index) but still not definitive.

For a definitive method you would need to specify in your query that the data does NOT contain characters which are NOT in the word you are looking for.

The wrong way to do that would be to add a loop specifying "AND NOT LIKE....".

A valid way of doing that would be to add a test in the query which replaces all the letters in the table attribute which appear in the word you are searching for which results in a zero length string.