如何找到数组中每个元素中出现的最长子字符串？

问题描述：

I have a collection of texts from some authors. Each author has a unique signature or link that occurs in all of their texts.

Example for Author1:

$texts=['sdsadsad daSDA DDASd asd aSD Sd dA  SD ASD sadasdasds sadasd

@jhsad.sadas.com sdsdADSA sada',
'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg  df gfdhgf g  
hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf',
'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl  
@jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl'];

Expected output for Author1 is: @jhsad.sadas.com

Example for Author2:

$texts=['This is some random string representative of non-signature text.

This is the
*author\'s* signature.',
'Different message body text.      This is the
*author\'s* signature.

This is an afterthought that expresses that a signature is not always at the end.',
'Finally, this is unwanted stuff. This is the
*author\'s* signature.'];

Expected output for Author2 is:

This is the
 *author's* signature.

Pay particular notice to the fact there there are no reliable identifying characters (or positions) that signify the start or end of the signature. It could be a url, a Twitter mention, any kind of plain text, etc. of any length containing any sequence of characters that occurs at the start, end, or middle of the string.

I am seeking a method that will extract the longest substring that exists in all $text elements for a single author.

It is expected, for the sake of this task, that all authors WILL have a signature substring that exists in every post/text.

IDEA: I'm thinking of converting words to vectors and finding similarity between each texts. We can use cosine similarity to find the signatures. I think the solution must be some thing like this idea.

mickmackusa's commented code captures the essence of what is desired, but I would like to see if there are other ways to achieve the desired result.

我收集了一些作者的文本。每个作者都有一个独特的签名或链接，发生在他们的所有文本中。 p>

作者示例： p>

   $ texts = ['sdsadsad daSDA DDASd asd aSD Sd dA SD ASD sadasdasds sadasd 
 
@jhsad.sadas.com sdsdADSA sada'，
'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg df gfdhgf g 
hfghghjh jhg @ jhsad.sadas。  com sfgff fsdfdsf'，
'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl 
@jhsad.sadas.com dsfjdshflkds kg lsfdkg; fdgl']; 
  code>  pre> 
 
   Author1的预期输出为： @ jhsad.sadas.com  code>  p> 
  blockquote> 
 
 

 
 示例 Author2： p> 
 
 
  $ texts = ['这是一些代表非签名文本的随机字符串。
 
这是
 *作者的*签名。'  ，
'不同的消息正文。 这是
 *作者的*签名。
 
这是一个事后的想法，表示签名并不总是在最后。'，
'最后，这是不需要的东西。 这是
 * author \'s *签名。']; 
  code>  pre> 
 
  Author2的预期输出是： p> 
 
 
  这是
 *作者的*签名。
  code>  pre> 
  blockquote> 
 
 特别注意没有可靠识别字符的事实（ 或表示签名开始或结束的位置。 它可以是一个url，一个Twitter提及，任何类型的纯文本等，包含任何长度，包含出现在字符串开头，结尾或中间的任何字符序列。 p> 
 
 
我正在寻找一种方法，它将为单个作者提取所有 $ text  code>元素中存在的最长子字符串。 p> 
 
 
预计，为了 这个任务，所有作者都会在每个帖子/文本中都有一个签名子字符串。 p> 
 
 

  IDEA  strong>：
我正在考虑将单词转换为向量和 找出每个文本之间的相似性 我们可以使用余弦相似性来查找签名。 我认为解决方案必须是这样的想法。 p> 
 
 
  mickmackusa's 注释代码捕获了所需内容的本质，但我想看看是否有其他方法可以实现所需的结果。 p> 
  div>

答

Here is my thinking:

Sort an author's collection of posts by string length (ascending) so that you are working from smaller texts to larger texts.
Split each post's text on one or more white-space characters, so that you are only handling wholly non-white-space substrings during processing.
Find matching substrings that occur in each subsequent post versus an ever-narrowing array of substrings (overlaps).
Group the consecutive matching substrings by analyzing their index value.
"Reconstitute" the grouped consecutive substrings into their original string form (trimmed of leading and trailing white-space characters, of course).
Sort the reconstituted strings by string length (descending) so that the longest string is assigned the 0 index.
Print to screen the substring that is assumed to be the author's signature (as a best guess) based on commonality and length.

Code: (Demo)

$posts['Author1']=['sdsadsad daSDA DDASd asd aSD Sd dA  SD ASD sadasdasds sadasd

@jhsad.sadas.com sdsdADSA sada',
'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg  df gfdhgf g  
hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf',
'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl  
@jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl'];

$posts['Author2']=['This is some random string representative of non-signature text.

This is the
 *author\'s* signature.',
        'Different message body text.      This is the
 *author\'s* signature.

    This is an afterthought that expresses that a signature is not always at the end.',
        'Finally, this is unwanted stuff. This is the
 *author\'s* signature.'];

foreach($posts as $author=>$texts){
    echo "Author: $author
";

    usort($texts,function($a,$b){return strlen($a)-strlen($b);}); // sort ASC by strlen; mb_strlen probably isn't advantageous
    var_export($texts);
    echo "
";

    foreach($texts as $index=>$string){
        if(!$index){
            $overlaps=preg_split('/\s+/',$string,NULL,PREG_SPLIT_NO_EMPTY);  // declare with all non-white-space substrings from first text
        }else{
            $overlaps=array_intersect($overlaps,preg_split('/\s+/',$string,NULL,PREG_SPLIT_NO_EMPTY));  // filter word bank using narrowing number of words
        }
    }
    var_export($overlaps);
    echo "
";

    // batch consecutive substrings
    $group=null;
    $consecutives=[];  // clear previous iteration's data
    foreach($overlaps as $i=>$word){
        if($group===null || $i-$last>1){
            $group=$i;
        }
        $last=$i;
        $consecutives[$group][]=$word;
    }
    var_export($consecutives);
    echo "
";

    foreach($consecutives as $words){
        // match potential signatures in first text for measurement:
        if(preg_match_all('/\Q'.implode('\E\s+\Q',$words).'\E/',$texts[0],$out)){  // make alternatives characters literal using \Q & \E
            $potential_signatures=$out[0];
        }
    }
    usort($potential_signatures,function($a,$b){return strlen($b)-strlen($a);}); // sort DESC by strlen; mb_strlen probably isn't advantageous

    echo "Assumed Signature: {$potential_signatures[0]}

";
}

Output:

Author: Author1
array (
  0 => 'sdsadsad daSDA DDASd asd aSD Sd dA  SD ASD sadasdasds sadasd

@jhsad.sadas.com sdsdADSA sada',
  1 => 'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl  
@jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl',
  2 => 'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg  df gfdhgf g  
hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf',
)
array (
  11 => '@jhsad.sadas.com',
)
array (
  11 => 
  array (
    0 => '@jhsad.sadas.com',
  ),
)
Assumed Signature: @jhsad.sadas.com

Author: Author2
array (
  0 => 'Finally, this is unwanted stuff. This is the
 *author\'s* signature.',
  1 => 'This is some random string representative of non-signature text.

This is the
 *author\'s* signature.',
  2 => 'Different message body text.      This is the
 *author\'s* signature.

    This is an afterthought that expresses that a signature is not always at the end.',
)
array (
  2 => 'is',
  5 => 'This',
  6 => 'is',
  7 => 'the',
  8 => '*author\'s*',
  9 => 'signature.',
)
array (
  2 => 
  array (
    0 => 'is',
  ),
  5 => 
  array (
    0 => 'This',
    1 => 'is',
    2 => 'the',
    3 => '*author\'s*',
    4 => 'signature.',
  ),
)
Assumed Signature: This is the
 *author's* signature.

答

You can use preg_match() with a regex to achieve this.

$str = "KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg df gfdhgf g hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf";

preg_match("/\@[^\s]+/", $str, $match);

var_dump($match); //Will output the signature

如何找到数组中每个元素中出现的最长子字符串？

相关推荐