如何找到数组中每个元素中出现的最长子字符串?
I have a collection of texts from some authors. Each author has a unique signature or link that occurs in all of their texts.
Example for Author1:
$texts=['sdsadsad daSDA DDASd asd aSD Sd dA SD ASD sadasdasds sadasd @jhsad.sadas.com sdsdADSA sada', 'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg df gfdhgf g hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf', 'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl @jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl'];
Expected output for Author1 is:
@jhsad.sadas.com
Example for Author2:
$texts=['This is some random string representative of non-signature text. This is the *author\'s* signature.', 'Different message body text. This is the *author\'s* signature. This is an afterthought that expresses that a signature is not always at the end.', 'Finally, this is unwanted stuff. This is the *author\'s* signature.'];
Expected output for Author2 is:
This is the *author's* signature.
Pay particular notice to the fact there there are no reliable identifying characters (or positions) that signify the start or end of the signature. It could be a url, a Twitter mention, any kind of plain text, etc. of any length containing any sequence of characters that occurs at the start, end, or middle of the string.
I am seeking a method that will extract the longest substring that exists in all $text
elements for a single author.
It is expected, for the sake of this task, that all authors WILL have a signature substring that exists in every post/text.
IDEA: I'm thinking of converting words to vectors and finding similarity between each texts. We can use cosine similarity to find the signatures. I think the solution must be some thing like this idea.
mickmackusa's commented code captures the essence of what is desired, but I would like to see if there are other ways to achieve the desired result.
我收集了一些作者的文本。 每个作者都有一个独特的签名或链接,发生在他们的所有文本中。 p>
作者示例: p>
$ texts = ['sdsadsad daSDA DDASd asd aSD Sd dA SD ASD sadasdasds sadasd @jhsad.sadas.com sdsdADSA sada', 'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg df gfdhgf g hfghghjh jhg @ jhsad.sadas。 com sfgff fsdfdsf', 'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl @jhsad.sadas.com dsfjdshflkds kg lsfdkg; fdgl']; code> pre>
Author1的预期输出为:
@ jhsad.sadas.com code> p> blockquote>
示例 Author2: p>
$ texts = ['这是一些代表非签名文本的随机字符串。 这是 *作者的*签名。' , '不同的消息正文。 这是 *作者的*签名。 这是一个事后的想法,表示签名并不总是在最后。', '最后,这是不需要的东西。 这是 * author \'s *签名。']; code> pre>
Author2的预期输出是: p>
这是 *作者的*签名。 code> pre> blockquote>
特别注意没有可靠识别字符的事实( 或表示签名开始或结束的位置。 它可以是一个url,一个Twitter提及,任何类型的纯文本等,包含任何长度,包含出现在字符串开头,结尾或中间的任何字符序列。 p>
我正在寻找一种方法,它将为单个作者提取所有
$ text code>元素中存在的最长子字符串。 p>
预计,为了 这个任务,所有作者都会在每个帖子/文本中都有一个签名子字符串。 p>
IDEA strong>: 我正在考虑将单词转换为向量和 找出每个文本之间的相似性 我们可以使用余弦相似性来查找签名。 我认为解决方案必须是这样的想法。 p>
mickmackusa's 注释代码捕获了所需内容的本质,但我想看看是否有其他方法可以实现所需的结果。 p> div>
Here is my thinking:
- Sort an author's collection of posts by string length (ascending) so that you are working from smaller texts to larger texts.
- Split each post's text on one or more white-space characters, so that you are only handling wholly non-white-space substrings during processing.
- Find matching substrings that occur in each subsequent post versus an ever-narrowing array of substrings (
overlaps
). - Group the consecutive matching substrings by analyzing their index value.
- "Reconstitute" the grouped consecutive substrings into their original string form (trimmed of leading and trailing white-space characters, of course).
- Sort the reconstituted strings by string length (descending) so that the longest string is assigned the
0
index. - Print to screen the substring that is assumed to be the author's signature (as a best guess) based on commonality and length.
Code: (Demo)
$posts['Author1']=['sdsadsad daSDA DDASd asd aSD Sd dA SD ASD sadasdasds sadasd
@jhsad.sadas.com sdsdADSA sada',
'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg df gfdhgf g
hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf',
'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl
@jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl'];
$posts['Author2']=['This is some random string representative of non-signature text.
This is the
*author\'s* signature.',
'Different message body text. This is the
*author\'s* signature.
This is an afterthought that expresses that a signature is not always at the end.',
'Finally, this is unwanted stuff. This is the
*author\'s* signature.'];
foreach($posts as $author=>$texts){
echo "Author: $author
";
usort($texts,function($a,$b){return strlen($a)-strlen($b);}); // sort ASC by strlen; mb_strlen probably isn't advantageous
var_export($texts);
echo "
";
foreach($texts as $index=>$string){
if(!$index){
$overlaps=preg_split('/\s+/',$string,NULL,PREG_SPLIT_NO_EMPTY); // declare with all non-white-space substrings from first text
}else{
$overlaps=array_intersect($overlaps,preg_split('/\s+/',$string,NULL,PREG_SPLIT_NO_EMPTY)); // filter word bank using narrowing number of words
}
}
var_export($overlaps);
echo "
";
// batch consecutive substrings
$group=null;
$consecutives=[]; // clear previous iteration's data
foreach($overlaps as $i=>$word){
if($group===null || $i-$last>1){
$group=$i;
}
$last=$i;
$consecutives[$group][]=$word;
}
var_export($consecutives);
echo "
";
foreach($consecutives as $words){
// match potential signatures in first text for measurement:
if(preg_match_all('/\Q'.implode('\E\s+\Q',$words).'\E/',$texts[0],$out)){ // make alternatives characters literal using \Q & \E
$potential_signatures=$out[0];
}
}
usort($potential_signatures,function($a,$b){return strlen($b)-strlen($a);}); // sort DESC by strlen; mb_strlen probably isn't advantageous
echo "Assumed Signature: {$potential_signatures[0]}
";
}
Output:
Author: Author1
array (
0 => 'sdsadsad daSDA DDASd asd aSD Sd dA SD ASD sadasdasds sadasd
@jhsad.sadas.com sdsdADSA sada',
1 => 'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl
@jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl',
2 => 'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg df gfdhgf g
hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf',
)
array (
11 => '@jhsad.sadas.com',
)
array (
11 =>
array (
0 => '@jhsad.sadas.com',
),
)
Assumed Signature: @jhsad.sadas.com
Author: Author2
array (
0 => 'Finally, this is unwanted stuff. This is the
*author\'s* signature.',
1 => 'This is some random string representative of non-signature text.
This is the
*author\'s* signature.',
2 => 'Different message body text. This is the
*author\'s* signature.
This is an afterthought that expresses that a signature is not always at the end.',
)
array (
2 => 'is',
5 => 'This',
6 => 'is',
7 => 'the',
8 => '*author\'s*',
9 => 'signature.',
)
array (
2 =>
array (
0 => 'is',
),
5 =>
array (
0 => 'This',
1 => 'is',
2 => 'the',
3 => '*author\'s*',
4 => 'signature.',
),
)
Assumed Signature: This is the
*author's* signature.
You can use preg_match()
with a regex to achieve this.
$str = "KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg df gfdhgf g hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf";
preg_match("/\@[^\s]+/", $str, $match);
var_dump($match); //Will output the signature