如何在PHP中解析以空格分隔的字符串?

如何在PHP中解析以空格分隔的字符串?

问题描述:

Part of the PHP application I'm building parses an RSS feed of upcoming jobs and internships. The <description> for each feed entry is a series of tags or labels containing four standard pieces of information:

  1. Internship or job
  2. Full or part time
  3. Type (one of 4 types: Local Gov, HR, Non-profit, Other)
  4. Name of organization

However, everything is space-delimited, turning each entry into a mess like this:

  • Internship Full time Local Gov NASA
  • Job Part time HR Deloitte
  • Job Full time Non-profit United Way

I'm trying to parse each line and use the pieces of the string as variables. this list were delimited in any standard way, I could easily use something like list($job, $time, $type, $name) = explode(",", $description) to parse the string and use the pieces individually.

I can't do that with this data, though. If I use explode(" ") I'll get lots of useless variables ("Full", "time", "Local", "Gov", for example).

Though the list isn't delimited, the first three pieces of information are standard and can only be one of 2–4 different options, essentially creating a dictionary of allowable terms (except the last one—the name of the organization—which is variable). Because of this it seems like I should be able to parse these strings, but I can't think of the best/cleanest/fastest way to do it.

preg_replace seems like it would require lots of messy regexes; a series of if/then statements (if the string contains "Local Gov" set $type to "Local Gov") seems tedious and would only capture the first three variables.

So, what's the most efficient way to parse a non-delimited string against a partial dictionary of allowed strings?

Update: I have no control over the structure of the incoming feed data. If I could I'd totally delimit this, but it's sadly not possible…

Update 2: To clarify, the first three options can only be the following:

  1. Internship | Job
  2. Full time | Part time
  3. Local Gov | HR | Non-profit | Other

That's the pseudo dictionary I'm talking about. I need to somehow strip those strings out of the main string and use what's left over as the organization name.

我正在构建的PHP应用程序的一部分解析即将到来的工作和实习的RSS提要。 每个供稿条目的&lt; description&gt; code>是一系列包含四条标准信息的标签或标签: p>

  1. 实习或工作
  2. 全职或兼职 li>
  3. 类型(4种类型之一:本地*,人力资源,非营利组织,其他类型) li>
  4. 名称 组织 li> ol>

    但是,所有内容都以空格分隔,将每个条目变成这样的混乱: p>

    • 实习全职本地*NASA li>
    • 工作兼职HR Deloitte li>
    • 工作全职非盈利联合之路 li> ul>

      我正在尝试解析每一行并使用字符串的各个部分作为变量。 这个列表以任何标准方式分隔,我可以轻松地使用类似 list($ job,$ time,$ type,$ name)= explode(“,”,$ description) code>来解析字符串 并且单独使用这些部分。 p>

      但是我不能用这些数据做到这一点。 如果我使用 explode(“”) code>,我会得到很多无用的变量(例如“Full”,“time”,“Local”,“Gov”)。 p>

      虽然列表没有分隔,但前三个信息是标准的,只能是2-4个不同选项中的一个,实质上是创建一个允许术语的字典(除了最后一个 - 的名称 组织 - 这是可变的)。 因为这似乎我应该能够解析这些字符串,但我想不出最好/最干净/最快的方法。 p>

      preg_replace code>似乎需要大量凌乱的正则表达式; 一系列if / then语句(如果字符串包含“Local Gov”将 $ type code>设置为“Local Gov”)似乎很乏味,只会捕获前三个变量。 p>

      那么,对于允许字符串的部分字典,解析非分隔字符串的最有效方法是什么? p>

      更新:我无法控制传入Feed数据的结构。 如果我可以完全划定这一点,但遗憾的是不可能... p>

      更新2: em>为了澄清,前三个选项只能 如下: p>

      1. 实习| 工作 li>
      2. 全职时间| 兼职 li>
      3. 本地*| 人力资源| 非营利组织| 其他 li> ol>

        那是我正在谈论的伪词典。 我需要以某种方式将这些字符串从主字符串中删除,并使用剩余的字符串作为组织名称。 p> div>

It's just a matter of getting your hands dirty it seems:

$input = 'Internship Full time Local Gov NASA';

// Preconfigure known data here; these will end up
// in the output array with the keys defined here
$known_data = array(
    'job'  => array('Internship', 'Job'),
    'time' => array('Full time', 'Part time'),
    // add more known strings here
);

$parsed = array();
foreach($known_data as $key => $options) {
    foreach($options as $option) {
        if(substr($input, 0, strlen($option)) == $option) {
            // Skip recognized token and next space
            $input = substr($input, strlen($option) + 1);
            $parsed[$key] = $option;
            break;
        }
    }
}

// Drop all remaining tokens into $parsed with numeric
// keys; you could do something else with them if desired
$parsed += explode(' ', $input);

See it in action.

Try an explode delimited by ' ' then within a foreach you could kill the key words and probably have to explode again based on ' '.

function startsWith($key, $data) {
   // get the length of the key we are looking for
   $len = strlen($key);
   // Check if the key matches the initial portion of the string
   if ($key === substr($data, 0, $len)) {
      // if yes return the remainder of the string
      return substr($data, $len);
   } else {
      // return false
      return false;
   }
}

This would allow you to check if the string starts with that and process it accordingly

<?php

$a = array (
'Internship Full time Local Gov NASA',
'Job Part time HR Deloitte',
'Job Full time Non-profit United Way',
);


foreach ($a as $s)
{
    if (preg_match ('/(Internship|Job)\s+(Part time|Full time)\s+(Local Gov|HR|Non-profit|Other)\s+(.*)/', $s, $match))
    {
        array_shift ($match);
        list($job, $time, $type, $name) =  $match;

        echo "$job, $time, $type, $name
";
    }

}

Obviously, the optimal thing to do would be to change the RSS feed to use a different delimiter or (even better) put the four items into separate tags/elements/attributes/whatever.

But assuming that's not possible: Given what you describe, I would focus on making the code clear to read and maintain (and modify) at the expense of performance and compactness. The code will be larger, and it won't scale well if you go from 4 fields to 40 fields, but if you are confident that things won't change so much, you and anyone who has to take over maintaining the code will be happier. (Include a comment explaining the space-delimiting problem so that people understand why you did it the way you did.)

So, rethink the problem. Instead of parsing the string all at once, figure out how to pull just the first item off. (I would match each of the possibilities with preg_match() using ^ in the regexp to indicate that the match has to appear at the start of the string. If the regexp is really long because the dictionary is big but there are no special chars to worry about, consider storing the dictionary as an array and using implode() to create a string delimited by | to use as your regexp.)

Do that three times for the first three elements (removing it from the string each time, probably), then the fourth element is your last element.

Maybe put each of the element retrieval routines into its own function that calls a subsequent function that gets passed the dictionary. The subsequent function can then do the implode() and pulling of the substring off the string.

Something like that, anyway. It won't be compact code, but someone reading it will be able to tell what's going on and the regexps won't be too crazy.

If the dictionary above is complete, you can just take out non-functional words.

$input = str_replace(array('time', 'Gov'), '', $input);

Now you can operate on meaningful single words.