preg_match()+ regex在TXT文件中不起作用

preg_match()+ regex在TXT文件中不起作用

问题描述:

Example 1:

I have a PDF document and used the PDF Parser (www.pdfparser.org) online to take all its content in text format. Rescued content in a TXT file (manually) and tried to filter some data using regular expression, everything worked normally.


Example 2:

To automate the process, I downloaded the PDF Parser API and made a script that follows the following rules:

1) Transforms the PDF text using the ParseFile () API method.
2) Saves the content of TXT.
3) Try to filter out this TXT using regular expression.


Example 1 -> It worked and returned me:

array (size = 2)
   'mora_dia' =>
     array (size = 1)
       0 => string 'R $ 3.44' (length = 7)
   'fine' =>
     array (size = 1)
       0 => string 'R $ 17.21' (length = 8)

Example 2 -> It did not work.

array (size = 2)
   'mora_dia' =>
     array (size = 0)
       empty
   'fine' =>
     array (size = 0)
       empty
  • Data from the two TXT are equal, but because in the second example does not work? * (I've tried to do this without saving in TXT but did not work)

Below are the codes of my two examples:

Example 1:

$data = file_get_contents('exemplo_01.txt');

$regex = [
    'mora_dia' => '/R\$ [0-9]{1,}\.[0-9]{1,}/i',
    'multa'    => '/R\$ [0-9]{1,}\,[0-9]{1,}/i'
];

foreach($regex as $title => $ex)
{
    preg_match($ex, $data, $matches[$title]);
}

var_dump($matches);

Example 2:

$parser = new \Smalot\PdfParser\Parser();
    $pdf = $parser->parseFile($PDFFile);
    $pages = $pdf->getPages();

    foreach ($pages as $page) {
        $PDFParse = $page->getText();
    }

    $txtName = __DIR__ . '/files/Txt/' . md5(uniqid(rand(), true)) . '.txt';
    $file  = fopen($txtName, 'w+');
    fwrite($file, $PDFParse);
    fclose($file);

    $dataTxt = file_get_contents($txtName);

    $regex = [
        'mora_dia' => '/R\$ [0-9]{1,}\.[0-9]{1,}/i',
        'multa'    => '/R\$ [0-9]{1,}\,[0-9]{1,}/i'
    ];

    foreach($regex as $title => $ex)
    {
        preg_match($ex, $dataTxt, $matches[$title]);
    }

Your action of copying and pasting the output text manually seems to have actually changed its contents. Based on the pastebin output, the direct to file version contains non-breaking space characters rather than regular spaces. The non-breaking spaces have hex code 0xA0, ascii 160, as opposed to a regular space, hex 0x20 ascii 32.

In fact, it looks as though all the space characters in the direct to file example are non-breaking 0xA0 spaces.

To reform your regular expression to be able to accommodate either type of space, you can place the hex code into a [] character class along with the regular space character ' ' as in [ \xA0] to match either type. Further, you will need the /u flag to work with unicode.

$regex = [
    'mora_dia' => '/R\$[ \xA0][0-9]{1,}\.[0-9]{1,}/iu',
    'multa'    => '/R\$[ \xA0][0-9]{1,},[0-9]{1,}/iu'
];

(note, the , comma does not require backslash-escaping)

This works correctly, using your raw pastebin as input:

$str = file_get_contents('http://pastebin.com/raw.php?i=H7D5xJBH');
preg_match('/R\$[ \xa0][0-9]{1,}\.[0-9]{1,}/ui', $str, $matches);
var_dump($matches);

// Prints:
array(1) {
  [0] =>
  string(8) "R$ 3.44"
}

A different solution might be to replace the non-breaking spaces with regular spaces in the entire text before applying your original regular expression:

// Replace all non-breaking spaces with regular spaces in the
// text string read from the file...
// The unicode non-breaking space is represented by 00A0
// and both are needed to replace this successfully.
$dataTxt = str_replace("\x00\xA0", " ", $dataTxt);

Whenever you have input you expect to be identical, which appears visually to be identical, be sure to inspect it with a tool capable of displaying each characters hex codes. In this case, I copied your samples from pastebin into files and inspected them with Vim, where I have setup hex and ascii display for the character under the cursor.

 $PDFParse ='';
 foreach ($pages as $page) {
     $PDFParse = $PDFParse.$page->getText();
 }

If PDFParse is string and after fwrite try fflush($file)