PHP: Séparer la chaîne multi-octets (mot) en caractères séparés

20

essayer une expression régulière avec 'u' option, par exemple

$chars = preg_split('//u', $string, -1, PREG_SPLIT_NO_EMPTY);

Source

2010-03-31 21:56:49 user187291

+0

Cela ne fonctionnera que l'encodage UTF-8. –

+0

D'accord avec Petr. Je l'ai essayé avec BIG5, ça ne marche pas! –

9

Une façon laide de le faire est la suivante:

mb_internal_encoding("UTF-8"); // this IS A MUST!! PHP has trouble with multibyte 
           // when no internal encoding is set! 
$string = "....."; 
$chars = array(); 
for ($i = 0; $i < mb_strlen($string); $i++) { 
    $chars[] = mb_substr($string, $i, 1); // only one char to go to the array 
}

Vous devriez aussi essayer votre chemin avec mb_split avec le réglage de la internal_encoding devant elle.

Source

2010-03-31 20:46:26 bisko

+0

'mb_internal_encoding (" UTF-8 ");' Cela m'a beaucoup aidé. – ivkremer

+0

J'adore cette réponse, puisque j'ai eu beaucoup de mal à trouver un moyen simple et sûr de diviser le texte en parties égales. Pas seulement pour chanter des personnages mais des parties. J'ai dû seulement éditer $ i + 5 et 5 comme dernier paramètre mb_substr et j'ai obtenu que mon texte soit divisé en 5 chaînes de caractères utf8. Thnx beaucoup. –

+0

excellente solution :) – clarkk

3

Vous pouvez utiliser les fonctions graphèmes (PHP 5.3 ou 1.0 intl) et IntlBreakIterator (PHP 5.5 ou 3.0 intl). Le code suivant montre la différence entre les fonctions intl et mbstring et PCRE.

// http://www.php.net/manual/function.grapheme-strlen.php 
$string = "a\xCC\x8A" // 'LATIN SMALL LETTER A WITH RING ABOVE' (U+00E5) 
     ."o\xCC\x88"; // 'LATIN SMALL LETTER O WITH DIAERESIS' (U+00F6) 

$expected = ["a\xCC\x8A", "o\xCC\x88"]; 
$expected2 = ["a", "\xCC\x8A", "o", "\xCC\x88"]; 

var_dump(
    $expected === str_to_array($string), 
    $expected === str_to_array2($string), 
    $expected2 === str_to_array3($string), 
    $expected2 === str_to_array4($string), 
    $expected2 === str_to_array5($string) 
); 

function str_to_array($string) 
{ 
    $length = grapheme_strlen($string); 
    $ret = []; 

    for ($i = 0; $i < $length; $i += 1) { 
     $ret[] = grapheme_substr($string, $i, 1); 
    } 

    return $ret; 
} 

function str_to_array2($string) 
{ 
    $it = IntlBreakIterator::createCharacterInstance('en_US'); 
    $it->setText($string); 

    $ret = []; 
    $prev = 0; 

    foreach ($it as $pos) { 

     $char = substr($string, $prev, $pos - $prev); 

     if ('' !== $char) { 
      $ret[] = $char; 
     } 

     $prev = $pos; 
    } 

    return $ret; 
} 

function str_to_array3($string) 
{ 
    $it = IntlBreakIterator::createCodePointInstance(); 
    $it->setText($string); 

    $ret = []; 
    $prev = 0; 

    foreach ($it as $pos) { 

     $char = substr($string, $prev, $pos - $prev); 

     if ('' !== $char) { 
      $ret[] = $char; 
     } 

     $prev = $pos; 
    } 

    return $ret; 
} 

function str_to_array4($string) 
{ 
    $length = mb_strlen($string, "UTF-8"); 
    $ret = []; 

    for ($i = 0; $i < $length; $i += 1) { 
     $ret[] = mb_substr($string, $i, 1, "UTF-8"); 
    } 

    return $ret; 
} 

function str_to_array5($string) { 
    return preg_split('//u', $string, -1, PREG_SPLIT_NO_EMPTY); 
}

Lorsque vous travaillez sur l'environnement de production, vous devez remplacer la séquence d'octets non valides avec le caractère de substitution puisque presque toutes les graphèmes et les fonctions mbstring ne peuvent pas gérer séquence d'octets non valide. Si vous avez un intérêt, voir ma réponse passée:

Si vous ne prenez pas de performance, htmlspecialchars et htmlspecialchars_decode peuvent être utilisés. Le mérite de cette façon supporte différents codages autres que UTF-8.

function str_to_array6($string, $encoding = 'UTF-8') 
{ 
    $ret = []; 
    str_replace_callback($string, function($char, $index) use (&$ret) { $ret[] = $char; return ''; }, $encoding); 
    return $ret; 
} 

function str_replace_callback($string, $callable, $encoding = 'UTF-8') 
{ 
    $str_size = strlen($string); 
    $string = str_scrub($string, $encoding); 

    $ret = ''; 
    $char = ''; 
    $index = 0; 

    for ($pos = 0; $pos < $str_size; ++$pos) { 

     $char .= $string[$pos]; 

     if (str_check_encoding($char, $encoding)) { 

      $ret .= $callable($char, $index); 
      $char = ''; 
      ++$index; 
     } 

    } 

    return $ret; 
} 

function str_check_encoding($string, $encoding = 'UTF-8') 
{ 
    $string = (string) $string; 
    return $string === htmlspecialchars_decode(htmlspecialchars($string, ENT_QUOTES, $encoding)); 
} 

function str_scrub($string, $encoding = 'UTF-8') 
{ 
    return htmlspecialchars_decode(htmlspecialchars($string, ENT_SUBSTITUTE, $encoding)); 
}

Si vous voulez apprendre la spécification UTF-8, la manipulation des octets est la bonne façon de pratiquer.

function str_to_array6($string) 
{ 
    // REPLACEMENT CHARACTER (U+FFFD) 
    $substitute = "\xEF\xBF\xBD"; 
    $size = strlen($string); 
    $ret = []; 

    for ($i = 0; $i < $size; $i += 1) { 

     if ($string[$i] <= "\x7F") { 

      $ret[] = $string[$i]; 

     } elseif ("\xC2" <= $string[$i] && $string[$i] <= "\xDF") { 

      if (!isset($string[$i+1])) { 

       $ret[] = $substitute; 
       return $ret; 

      } elseif ($string[$i+1] < "\x80" || "\xBF" < $string[$i+1]) { 

       $ret[] = $substitute; 

      } else { 

       $ret[] = substr($string, $i, 2); 
       $i += 1; 

      } 

     } elseif ("\xE0" <= $string[$i] && $string[$i] <= "\xEF") { 

      $left = "\xE0" === $string[$i] ? "\xA0" : "\x80"; 
      $right = "\xED" === $string[$i] ? "\x9F" : "\xBF"; 

      if (!isset($string[$i+1])) { 

       $ret[] = $substitute; 
       return $ret; 

      } elseif ($string[$i+1] < $left || $right < $string[$i+1]) { 

       $ret[] = $substitute; 

      } else { 

       if (!isset($string[$i+2])) { 

        $ret[] = $substitute; 
        return $ret; 

       } elseif ($string[$i+2] < "\x80" || "\xBF" < $string[$i+2]) { 

        $ret[] = $substitute; 
        $i += 1; 

       } else { 

        $ret[] = substr($string, $i, 3); 
        $i += 2; 

       } 

      } 

     } elseif ("\xF0" <= $string[$i] && $string[$i] <= "\xF4") { 

      $left = "\xF0" === $string[$i] ? "\x90" : "\x80"; 
      $right = "\xF4" === $string[$i] ? "\x8F" : "\xBF"; 

      if (!isset($string[$i+1])) { 

       $ret[] = $substitute; 
       return $ret; 

      } elseif ($string[$i+1] < $left || $right < $string[$i+1]) { 

       $ret[] = $substitute; 

      } else { 

       if (!isset($string[$i+2])) { 

        $ret[] = $substitute; 
        return $ret; 

       } elseif ($string[$i+2] < "\x80" || "\xBF" < $string[$i+2]) { 

        $ret[] = $substitute; 
        $i += 1; 

       } else { 

        if (!isset($string[$i+3])) { 

         $ret[] = $substitute; 
         return $ret; 

        } elseif ($string[$i+3] < "\x80" || "\xBF" < $string[$i+3]) { 

         $ret[] = $substitute; 
         $i += 2; 

        } else { 

         $ret[] = substr($string, $i, 4); 
         $i += 3; 

        } 

       } 

      } 

     } else { 

      $ret[] = $substitute; 

     } 

    } 

    return $ret; 

}

Le résultat du test de performance entre ces fonctions est ici.

grapheme 
0.12967610359192 
IntlBreakIterator::createCharacterInstance 
0.17032408714294 
IntlBreakIterator::createCodePointInstance 
0.079245090484619 
mbstring 
0.081080913543701 
preg_split 
0.043133974075317 
htmlspecialchars 
0.25599694252014 
byte maniplulation 
0.13132810592651

Le code de référence est ici.

$string = '主楼怎么走'; 

foreach (timer([ 
    'grapheme' => 'str_to_array', 
    'IntlBreakIterator::createCharacterInstance' => 'str_to_array2', 
    'IntlBreakIterator::createCodePointInstance' => 'str_to_array3', 
    'mbstring' => 'str_to_array4', 
    'preg_split' => 'str_to_array5', 
    'htmlspecialchars' => 'str_to_array6', 
    'byte maniplulation' => 'str_to_array7' 
], 
[$string]) as $desc => $time) { 

    echo $desc, PHP_EOL, 
     $time, PHP_EOL; 
} 

function timer(array $callables, array $arguments, $repeat = 10000) { 

    $ret = []; 
    $save = $repeat; 

    foreach ($callables as $key => $callable) { 

     $start = microtime(true); 

     do { 

      array_map($callable, $arguments); 

     } while($repeat -= 1); 

     $stop = microtime(true); 
     $ret[$key] = $stop - $start; 
     $repeat = $save; 

    } 

    return $ret; 
}

Source

2013-05-30 13:53:36 masakielastic

PHP: Séparer la chaîne multi-octets (mot) en caractères séparés

Répondre

Questions connexes