如何使用php搜索xml文件中的多个关键字并返回包含标签?

如何使用php搜索xml文件中的多个关键字并返回包含标签?

问题描述:

I have an xml file like this, which stores video subtitles:

<videos>
    <video>
        <id>1</id>
        <enSub>Hello Foo! Good morning!</enSub>
        <cnSub>你好 Foo! 早上好!</cnSub>
    </video>
    <video>
        <id>2</id>
        <enSub>Hello Bar! Good afternoon!</enSub>
        <cnSub>你好 Bar! 下午好!</cnSub>
    </video>
</videos>

I want to search certain keywords through this xml, like I enter "hello moning" in the search text area, and the search result could find the video element with id "1".

I guess that using php xpath can only find single keyword in the xml file, and it has to iterate through the whole tree. I'm not confident that I can write a function with good performance.

I tried to use external resource like google custom search to search my web, but it turned out that I'm not using pages to display each video. I pass different video id as parameter to the video-play-page.

I also thought of regular expression, but don't know how to handle the orders of keywords.

So is there any search engine that I can use to search multiple keywords to pinpoint a video. I designed this to help my users to quickly find the video the watched.

I googled a lot. It's really slow, sometimes I just can't access google, in my place in China here. I tried "multiple keywords search xml" as searching keywords. Maybe my English isn't intelligent enough for google to understand my intent. I hope you guys here understand my question.

Thank you so much!!

Please see my example code below on how to accomplish this.

<?php
$xml = <<<XML
<videos>
    <video>
        <id>1</id>
        <enSub>Hello Foo! Good morning!</enSub>
        <cnSub>你好 Foo! 早上好!</cnSub>
    </video>
    <video>
        <id>2</id>
        <enSub>Hello Bar! Good afternoon!</enSub>
        <cnSub>你好 Bar! 下午好!</cnSub>
    </video>
</videos>
XML;
// Lowercase the XML so we can do a non-case-sensitive search.
$xml = strtolower($xml);
// Create a DOMDocument based on the xml.
$dom = new DOMDocument;
$dom->loadXML($xml);
// Create an xpath based on the dom document so we can search it.
$xpath = new DOMXpath($dom);
// Search for any video tag that contains the text good morning.
$nodes = $xpath->query('//video[contains(.,\'good morning\')]');
// Iterate all nodes
foreach($nodes as $node){
    // find the ID node and print its content.
    var_dump($xpath->query('id',$node)->item(0)->textContent);
}

-- Edit

I reread your post and it looks like you're using keywords and not strings. If that's the case, then try this snippet on for size:

<?php
$xml = <<<XML
<videos>
    <video>
        <id>1</id>
        <enSub>Hello Foo! Good morning!</enSub>
        <cnSub>你好 Foo! 早上好!</cnSub>
    </video>
    <video>
        <id>2</id>
        <enSub>Hello Bar! Good afternoon!</enSub>
        <cnSub>你好 Bar! 下午好!</cnSub>
    </video>
</videos>
XML;
// Lowercase the XML so we can do a non-case-sensitive search.
$xml = strtolower($xml);
// Create an DOMDocument based on the xml.
$dom = new DOMDocument;
$dom->loadXML($xml);
// Create an xpath based on the dom document so we can search it.
$xpath = new DOMXpath($dom);
// Define the search keywords
$searchKeywords = array('good','hello');
// Iterate all of them to make them into valid xpath
$searchKeywords = array_map(
    function($keyword){
        // Replace any single quotes with an escaped single quote.
        $keyword = str_replace('\'','\\\'',$keyword);
        return 'contains(.,\''.$keyword.'\')';
    },
    $searchKeywords
);
// Implode all the keywords using and, you could change this to be
// an"or" condition if you so desire.
$searchKeywords = implode(' and ',$searchKeywords);
// The search keywords now look like contains(.,'good') and contains(.,'hello')
// Search for any video tag that contains the text good morning.
$nodes = $xpath->query('//video['.$searchKeywords.']');
// Iterate all nodes
foreach($nodes as $node){
    // find the ID node and print its content.
    var_dump($xpath->query('id',$node)->item(0)->textContent);
}

First of all your xml is messy, the opening and closing tags has to match. You can use DomDOcument for manipulating xml.

$searchStr ="hello afternoon";
$searchArr = explode(" ",$searchStr);
$result = array();
$xmlData = "<videos>
    <video>
        <id>1</id>
        <enSub>Hello Foo! Good morning!</enSub>
        <cnSub>你好 Foo! 早上好!</cnSub>
    </video>
    <video>
        <id>2</id>
        <enSub>Hello Bar! Good afternoon!</enSub>
        <cnSub>你好 Bar! 下午好!</cnSub>
    </video>
</videos>";

$dom = new DOMDocument();
$dom->loadXML($xmlData);
foreach ($dom->documentElement->childNodes as $node) {
if($node->nodeType==1){
   $enSub = $node->getElementsByTagName('enSub')->Item(0)->nodeValue;
   $cnSub = $node->getElementsByTagName('cnSub')->Item(0)->nodeValue;
   $id = $node->getElementsByTagName('id')->Item(0)->nodeValue;
   foreach($searchArr as $key=>$val){
      $temp = array();
      if( strpos($enSub,$val) != false ){
          $temp[$id] = array(
             'id'=>$id,
             'enSub'=>$enSub,
             'cnSub'=>$cnSub
          );
          $result[$id]=$temp;
      }

   }
 }
}
echo "<pre>";
print_r($result);

You can find the working demo here

I guess you could use a search server like ElasticSearch. Its using Lucene to index any kind of content. The indexed content can then be queried via a JSON API.

This of course only makes sense when you are constantly working with a large amount of data.

The other approach would be to parse the xml and build up an array which has each term in the sub-tag as an index. The value would then be an array containing the ids of the movies which have that term in their respective tag. Basically you are building up a simple data index of your own.

You could then query your index like this:

<?php

$index = array(
    'Hello' => array(1,3),
    'World' => array(1),
    'Good' => array(2),
    'Morning' => array(2),
    'Vietnam' => array(2,3),
);

$searchTerms = array('Hello', 'World');

$found = null;
foreach($searchTerms as $term){
    if(array_key_exists($term, $index)){
        if(is_null($found)){
            $found = $index[$term];
        } else {
            $found = array_intersect($found, $index[$term]);
        }
    } else {
        $found = array();
        break;
    }
}

print_r($found);

The main benefit of this approach is that you would only have to traverse the xml document once while having a rather fast search. BTW - if you want to treat the search terms with OR instead of AND you can use array_merge and array_unique instead of array_intersect.

Somewhere in the middle would be the approach to set up a real database like MySQL and do the above search in a query.

It really depends on what you want to accomplish.