2016-06-19 1 views
0

J'UTILISE preg_split("/\"[^\"]*\"(*SKIP)(*F)|\x20/", $input_line); et l'exécuter sur phpliveregex.com il produit tableau:PHP preg_split sur les espaces, mais pas dans les balises

array(10 
    0=><b>test</b> 
    1=>or 
    2=><em>oh 
    3=>yeah</em> 
    4=>and 
    5=><i> 
    6=>oh 
    7=>yeah 
    8=></i> 
    9=>"ye we 'hold' it" 
) 

pas ce que je veux, il devrait être séparé par des espaces uniquement en dehors des balises html comme celui-ci :

array(5 
    0=><b>test</b> 
    1=>or 
    2=><em>oh yeah</em> 
    3=>and 
    4=><i>oh yeah</i> 
    5=>"ye we 'hold' it" 
) 

dans ce regex je ne suis peut ajouter exception « guillemets » mais vraiment besoin d'aide pour ajouter d'autres, comme tag <img/><a></a><pre></pre><code></code><strong></strong><b></b><em></em><i></i>

toute explication sur le fonctionnement de cette regex est également appréciée.

+0

Il suffit d'utiliser '<' and '>': 'preg_split ("/ <[^<]*> (* SKIP) (* F) | /", input_line de $);' –

+2

Utilisez DOMDocument et DOMXPath. –

Répondre

1

Il est plus facile d'utiliser le DOMDocument car vous n'avez pas besoin de décrire ce qu'est un tag html et à quoi il ressemble. Vous avez seulement besoin de vérifier le nodeType. Quand il est un textNode, diviser avec preg_match_all(il est plus pratique que de concevoir un modèle pour preg_split):

$html = 'spaces in a text node <b>test</b> or <em>oh yeah</em> and <i>oh yeah</i> 
"ye we \'hold\' it" 
"unclosed double quotes at the end'; 

$dom = new DOMDocument; 
$dom->loadHTML('<div>' . $html . '</div>', LIBXML_HTML_NOIMPLIED); 

$nodeList = $dom->documentElement->childNodes; 

$results = []; 

foreach ($nodeList as $childNode) { 
    if ($childNode->nodeType == XML_TEXT_NODE && 
     preg_match_all('~[^\s"]+|"[^"]*"?~', $childNode->nodeValue, $m)) 
     $results = array_merge($results, $m[0]); 
    else 
     $results[] = $dom->saveHTML($childNode); 
} 

print_r($results); 

Note: J'ai choisi un comportement par défaut lorsqu'une partie guillemet reste desserra (sans citation de fermeture), n'hésitez pas à le changer.

Note2: Parfois, les constantes LIBXML_ ne sont pas définies. Vous pouvez résoudre ce problème le tester avant et définir en cas de besoin:

if (!defined('LIBXML_HTML_NOIMPLIED')) 
    define('LIBXML_HTML_NOIMPLIED', 8192); 
+0

Enfin ce dont j'ai besoin, merci de sauver mes jours @Casimir –

+0

oh, j'ai besoin d'aide, ce code est parfait car dans 'localhost' mais après avoir déménagé sur le serveur, il a eu un problème avec dom. 'Attention: DOMDocument :: loadHTML() attend que le paramètre 2 soit long, chaîne donnée dans/home/u74' –

+0

@ Al-Jazary: dans certaines config, les constantes' LIBXML _... 'ne sont pas définies (c'est pourquoi vous obtenez ce message). Dans ce cas, testez si cette constante existe avant le script avec 'defined ('constantName')' et si elle retourne 'false', définissez-la en utilisant' define (...) '. La valeur de 'LIBXML_HTML_NOIMPLIED' est 8192. –

0

Description de

Au lieu d'utiliser une commande split simplement correspondre les sections que vous voulez

<(?:(?:img)(?=[\s>\/])(?:[^>=]|=(?:'[^']*'|"[^"]*"|[^'"\s>]*))*\s?\/?>|(a|span|pre|code|strong|b|em|i)(?=[\s>\\])(?:[^>=]|=(?:'[^']*'|"[^"]*"|[^'"\s>]*))*\s?\/?>.*?<\/\1>)|(?:"[^"]*"|[^"<]*)*

Regular expression visualization

Exemple

Live Demo

https://regex101.com/r/bK8iL3/1

Exemple de texte

Notez le cas limite difficile dans le deuxième alinéa

<b>test</b> or <strong> this </strong><em> oh yeah </em> and <i>oh yeah</i> Here we are "ye we 'hold' it" 

some<img/>gfsf<a html="droids.html" onmouseover=' var x=" Not the droid I am looking for " ; '>droides</a><pre></pre><code></code><strong></strong><b></b><em></em><i></i> 

échantillon Matchs

MATCH 1 
0. [0-11] `<b>test</b>` 

MATCH 2 
0. [11-15] ` or ` 

MATCH 3 
0. [15-38] `<strong> this </strong>` 

MATCH 4 
0. [38-56] `<em> oh yeah </em>` 

MATCH 5 
0. [56-61] ` and ` 

MATCH 6 
0. [61-75] `<i>oh yeah</i>` 

MATCH 7 
0. [75-111] ` Here we are "ye we 'hold' it" some` 

MATCH 8 
0. [111-117] `<img/>` 

MATCH 9 
0. [117-121] `gfsf` 

MATCH 10 
0. [121-213] `<a html="droids.html" onmouseover=' var x=" Not the droid I am looking for " ; '>droides</a>` 

MATCH 11 
0. [213-224] `<pre></pre>` 

MATCH 12 
0. [224-237] `<code></code>` 

MATCH 13 
0. [237-254] `<strong></strong>` 

MATCH 14 
0. [254-261] `<b></b>` 

MATCH 15 
0. [261-270] `<em></em>` 

MATCH 16 
0. [270-277] `<i></i>` 

Explication

NODE      EXPLANATION 
---------------------------------------------------------------------- 
    <      '<' 
---------------------------------------------------------------------- 
    (?:      group, but do not capture: 
---------------------------------------------------------------------- 
    (?:      group, but do not capture: 
---------------------------------------------------------------------- 
     img      'img' 
---------------------------------------------------------------------- 
    )      end of grouping 
---------------------------------------------------------------------- 
    (?=      look ahead to see if there is: 
---------------------------------------------------------------------- 
     [\s>\/]     any character of: whitespace (\n, \r, 
           \t, \f, and " "), '>', '\/' 
---------------------------------------------------------------------- 
    )      end of look-ahead 
---------------------------------------------------------------------- 
    (?:      group, but do not capture (0 or more 
          times (matching the most amount 
          possible)): 
---------------------------------------------------------------------- 
     [^>=]     any character except: '>', '=' 
---------------------------------------------------------------------- 
    |      OR 
---------------------------------------------------------------------- 
     =      '=' 
---------------------------------------------------------------------- 
     (?:      group, but do not capture: 
---------------------------------------------------------------------- 
     '      '\'' 
---------------------------------------------------------------------- 
     [^']*     any character except: ''' (0 or more 
           times (matching the most amount 
           possible)) 
---------------------------------------------------------------------- 
     '      '\'' 
---------------------------------------------------------------------- 
     |      OR 
---------------------------------------------------------------------- 
     "      '"' 
---------------------------------------------------------------------- 
     [^"]*     any character except: '"' (0 or more 
           times (matching the most amount 
           possible)) 
---------------------------------------------------------------------- 
     "      '"' 
---------------------------------------------------------------------- 
     |      OR 
---------------------------------------------------------------------- 
     [^'"\s>]*    any character except: ''', '"', 
           whitespace (\n, \r, \t, \f, and " 
           "), '>' (0 or more times (matching 
           the most amount possible)) 
---------------------------------------------------------------------- 
    )      end of grouping 
---------------------------------------------------------------------- 
    )*      end of grouping 
---------------------------------------------------------------------- 
    \s?      whitespace (\n, \r, \t, \f, and " ") 
          (optional (matching the most amount 
          possible)) 
---------------------------------------------------------------------- 
    \/?      '/' (optional (matching the most amount 
          possible)) 
---------------------------------------------------------------------- 
    >      '>' 
---------------------------------------------------------------------- 
    |      OR 
---------------------------------------------------------------------- 
    (      group and capture to \1: 
---------------------------------------------------------------------- 
     a      'a' 
---------------------------------------------------------------------- 
    |      OR 
---------------------------------------------------------------------- 
     span      'span' 
---------------------------------------------------------------------- 
    |      OR 
---------------------------------------------------------------------- 
     pre      'pre' 
---------------------------------------------------------------------- 
    |      OR 
---------------------------------------------------------------------- 
     code      'code' 
---------------------------------------------------------------------- 
    |      OR 
---------------------------------------------------------------------- 
     strong     'strong' 
---------------------------------------------------------------------- 
    |      OR 
---------------------------------------------------------------------- 
     b      'b' 
---------------------------------------------------------------------- 
    |      OR 
---------------------------------------------------------------------- 
     em      'em' 
---------------------------------------------------------------------- 
    |      OR 
---------------------------------------------------------------------- 
     i      'i' 
---------------------------------------------------------------------- 
    )      end of \1 
---------------------------------------------------------------------- 
    (?=      look ahead to see if there is: 
---------------------------------------------------------------------- 
     [\s>\\]     any character of: whitespace (\n, \r, 
           \t, \f, and " "), '>', '\\' 
---------------------------------------------------------------------- 
    )      end of look-ahead 
---------------------------------------------------------------------- 
    (?:      group, but do not capture (0 or more 
          times (matching the most amount 
          possible)): 
---------------------------------------------------------------------- 
     [^>=]     any character except: '>', '=' 
---------------------------------------------------------------------- 
    |      OR 
---------------------------------------------------------------------- 
     =      '=' 
---------------------------------------------------------------------- 
     (?:      group, but do not capture: 
---------------------------------------------------------------------- 
     '      '\'' 
---------------------------------------------------------------------- 
     [^']*     any character except: ''' (0 or more 
           times (matching the most amount 
           possible)) 
---------------------------------------------------------------------- 
     '      '\'' 
---------------------------------------------------------------------- 
     |      OR 
---------------------------------------------------------------------- 
     "      '"' 
---------------------------------------------------------------------- 
     [^"]*     any character except: '"' (0 or more 
           times (matching the most amount 
           possible)) 
---------------------------------------------------------------------- 
     "      '"' 
---------------------------------------------------------------------- 
     |      OR 
---------------------------------------------------------------------- 
     [^'"\s>]*    any character except: ''', '"', 
           whitespace (\n, \r, \t, \f, and " 
           "), '>' (0 or more times (matching 
           the most amount possible)) 
---------------------------------------------------------------------- 
    )      end of grouping 
---------------------------------------------------------------------- 
    )*      end of grouping 
---------------------------------------------------------------------- 
    \s?      whitespace (\n, \r, \t, \f, and " ") 
          (optional (matching the most amount 
          possible)) 
---------------------------------------------------------------------- 
    \/?      '/' (optional (matching the most amount 
          possible)) 
---------------------------------------------------------------------- 
    >      '>' 
---------------------------------------------------------------------- 
    .*?      any character (0 or more times (matching 
          the least amount possible)) 
---------------------------------------------------------------------- 
    <      '<' 
---------------------------------------------------------------------- 
    \/      '/' 
---------------------------------------------------------------------- 
    \1      what was matched by capture \1 
---------------------------------------------------------------------- 
    >      '>' 
---------------------------------------------------------------------- 
)      end of grouping 
---------------------------------------------------------------------- 
|      OR 
---------------------------------------------------------------------- 
    (?:      group, but do not capture (0 or more times 
          (matching the most amount possible)): 
---------------------------------------------------------------------- 
    "      '"' 
---------------------------------------------------------------------- 
    [^"]*     any character except: '"' (0 or more 
          times (matching the most amount 
          possible)) 
---------------------------------------------------------------------- 
    "      '"' 
---------------------------------------------------------------------- 
    |      OR 
---------------------------------------------------------------------- 
    [^"<]*     any character except: '"', '<' (0 or 
          more times (matching the most amount 
          possible)) 
---------------------------------------------------------------------- 
)*      end of grouping 
----------------------------------------------------------------------