模块

UTF8

A port of phputf8 to a unified set of files. Provides multi-byte aware replacement string functions.

For UTF-8 support to work correctly, the following requirements must be met:

  • PCRE needs to be compiled with UTF-8 support (--enable-utf8)
  • Support for Unicode properties is highly recommended (--enable-unicode-properties)
  • UTF-8 conversion will be much more reliable if the iconv extension is loaded
  • The mbstring extension is highly recommended, but must not be overloading string functions

This file is licensed differently from the rest of BootPHP. As a port of phputf8, this file is released under the LGPL.

package
BootPHP
category
Base
author
Tinsh
copyright
© 2005-2013 Kilofox Studio

该类在 SYSPATH/classes/utf8.php 第 24 行声明。

属性

public static array $called

List of called methods that have had their required file included.

array(0) 

public static boolean $server_utf8

Does the server support UTF-8 natively?

NULL

方法

public __construct( ) (在 UTF8 中定义)

源代码

public function __construct()
{
	if ( self::$server_utf8 === null  )
	{
		// Determine if this server supports UTF-8 natively
		self::$server_utf8 = extension_loaded('mbstring');
	}
}

public static clean( mixed $var [, string $charset = NULL ] ) (在 UTF8 中定义)

Recursively cleans arrays, objects, && strings. Removes ASCII control codes && converts to the requested charset while silently discarding incompatible characters.

 self::clean($_GET); // Clean GET data

This method requires Iconv

参数

  • mixed $var required - Variable to clean
  • string $charset = NULL - Character set, defaults to BootPHP::$charset

Tags

返回值

  • mixed

源代码

public static function clean($var, $charset = null)
{
	if ( !$charset )
	{
		// Use the application character set
		$charset = BootPHP::$charset;
	}
	if ( is_array($var) || is_object($var) )
	{
		foreach( $var as $key => $val )
		{
			// Recursion!
			$var[self::clean($key)] = self::clean($val);
		}
	}
	elseif ( is_string($var) && $var !== '' )
	{
		// Remove control characters
		$var = self::strip_ascii_ctrl($var);
		if ( !self::is_ascii($var) )
		{
			// Disable notices
			$error_reporting = error_reporting(~E_NOTICE);
			// iconv is expensive, so it is only used when needed
			$var = iconv($charset, $charset.'//IGNORE', $var);
			// Turn notices back on
			error_reporting($error_reporting);
		}
	}
	return $var;
}

public static from_unicode( array $arr ) (在 UTF8 中定义)

Takes an array of ints representing the Unicode characters && returns a UTF-8 string. Astral planes are supported i.e. the ints in the input can be > 0xFFFF. Occurrances of the BOM are ignored. Surrogates are not allowed.

 $str = self::to_unicode($array);

The Original Code is Mozilla Communicator client code. The Initial Developer of the Original Code is Netscape Communications Corporation. Portions created by the Initial Developer are Copyright (C) 1998 the Initial Developer. Ported to PHP by Henri Sivonen hsivonen@iki.fi, see http://hsivonen.iki.fi/php-utf8/ Slight modifications to fit with phputf8 library by Harry Fuecks hfuecks@gmail.com.

参数

  • array $arr required - Unicode code points representing a string

返回值

  • string - Utf8 string of characters
  • boolean - False if a code point cannot be found

源代码

public static function from_unicode($arr)
{
	ob_start();
	$keys = array_keys($arr);
	foreach( $keys as $k )
	{
		// ASCII range (including control chars)
		if ( ($arr[$k] >= 0) && ($arr[$k] <= 0x007f) )
		{
			echo chr($arr[$k]);
		}
		// 2 byte sequence
		elseif ( $arr[$k] <= 0x07ff )
		{
			echo chr(0xc0 | ($arr[$k] >> 6));
			echo chr(0x80 | ($arr[$k] & 0x003f));
		}
		// Byte order mark (skip)
		elseif ( $arr[$k] == 0xFEFF )
		{
			// nop -- zap the BOM
		}
		// Test for illegal surrogates
		elseif ( $arr[$k] >= 0xD800 && $arr[$k] <= 0xDFFF )
		{
			// Found a surrogate
			throw new BootPHP_Exception("UTF8::from_unicode: Illegal surrogate at index: ':index', value: ':value'", array(
				':index' => $k,
				':value' => $arr[$k],
			));
		}
		// 3 byte sequence
		elseif ( $arr[$k] <= 0xffff )
		{
			echo chr(0xe0 | ($arr[$k] >> 12));
			echo chr(0x80 | (($arr[$k] >> 6) & 0x003f));
			echo chr(0x80 | ($arr[$k] & 0x003f));
		}
		// 4 byte sequence
		elseif ( $arr[$k] <= 0x10ffff )
		{
			echo chr(0xf0 | ($arr[$k] >> 18));
			echo chr(0x80 | (($arr[$k] >> 12) & 0x3f));
			echo chr(0x80 | (($arr[$k] >> 6) & 0x3f));
			echo chr(0x80 | ($arr[$k] & 0x3f));
		}
		// Out of range
		else
		{
			throw new BootPHP_Exception("UTF8::from_unicode: Codepoint out of Unicode range at index: ':index', value: ':value'", array(
				':index' => $k,
				':value' => $arr[$k],
			));
		}
	}
	$result = ob_get_contents();
	ob_end_clean();
	return $result;
}

public static is_ascii( mixed $str ) (在 UTF8 中定义)

Tests whether a string contains only 7-bit ASCII bytes. This is used to determine when to use native functions || UTF-8 functions.

 $ascii = self::is_ascii($str);

参数

  • mixed $str required - String || array of strings to check

返回值

  • boolean

源代码

public static function is_ascii($str)
{
	if ( is_array($str) )
	{
		$str = implode($str);
	}
	return !preg_match('/[^\x00-\x7F]/S', $str);
}

public static ltrim( string $str [, string $charlist = NULL ] ) (在 UTF8 中定义)

Strips whitespace (or other UTF-8 characters) from the beginning of a string. This is a UTF8-aware version of ltrim.

 $str = self::ltrim($str);

参数

  • string $str required - Input string
  • string $charlist = NULL - String of characters to remove

Tags

  • Author - Andreas Gohr

返回值

  • string

源代码

public static function ltrim($str, $charlist = null)
{
	if ( $charlist === null )
		return ltrim($str);
	if ( self::is_ascii($charlist) )
		return ltrim($str, $charlist);
	$charlist = preg_replace('#[-\[\]:\\\\^/]#', '\\\\$0', $charlist);
	return preg_replace('/^['.$charlist.']+/u', '', $str);
}

public static ord( string $chr ) (在 UTF8 中定义)

Returns the unicode ordinal for a character. This is a UTF8-aware version of ord.

 $digit = self::ord($character);

参数

  • string $chr required - UTF-8 encoded character

Tags

  • Author - Harry Fuecks

返回值

  • integer

源代码

public static function ord($chr)
{
	$ord0 = ord($chr);
	if ( $ord0 >= 0 && $ord0 <= 127 )
		return $ord0;
	if ( !isset($chr[1]) )
	{
		throw new BootPHP_Exception('Short sequence - at least 2 bytes expected, only 1 seen');
	}
	$ord1 = ord($chr[1]);
	if ( $ord0 >= 192 && $ord0 <= 223 )
		return ($ord0 - 192) * 64 + ($ord1 - 128);
	if ( !isset($chr[2]) )
	{
		throw new BootPHP_Exception('Short sequence - at least 3 bytes expected, only 2 seen');
	}
	$ord2 = ord($chr[2]);
	if ( $ord0 >= 224 && $ord0 <= 239 )
		return ($ord0 - 224) * 4096 + ($ord1 - 128) * 64 + ($ord2 - 128);
	if ( !isset($chr[3]) )
	{
		throw new BootPHP_Exception('Short sequence - at least 4 bytes expected, only 3 seen');
	}
	$ord3 = ord($chr[3]);
	if ( $ord0 >= 240 && $ord0 <= 247 )
		return ($ord0 - 240) * 262144 + ($ord1 - 128) * 4096 + ($ord2-128) * 64 + ($ord3 - 128);
	if ( !isset($chr[4]) )
	{
		throw new BootPHP_Exception('Short sequence - at least 5 bytes expected, only 4 seen');
	}
	$ord4 = ord($chr[4]);
	if ( $ord0 >= 248 && $ord0 <= 251 )
		return ($ord0 - 248) * 16777216 + ($ord1-128) * 262144 + ($ord2 - 128) * 4096 + ($ord3 - 128) * 64 + ($ord4 - 128);
	if ( !isset($chr[5]) )
	{
		throw new BootPHP_Exception('Short sequence - at least 6 bytes expected, only 5 seen');
	}
	if ( $ord0 >= 252 && $ord0 <= 253 )
		return ($ord0 - 252) * 1073741824 + ($ord1 - 128) * 16777216 + ($ord2 - 128) * 262144 + ($ord3 - 128) * 4096 + ($ord4 - 128) * 64 + (ord($chr[5]) - 128);
	if ( $ord0 >= 254 && $ord0 <= 255 )
	{
		throw new BootPHP_Exception("Invalid UTF-8 with surrogate ordinal ':ordinal'", array(
			':ordinal' => $ord0,
		));
	}
}

public static rtrim( string $str [, string $charlist = NULL ] ) (在 UTF8 中定义)

Strips whitespace (or other UTF-8 characters) from the end of a string. This is a UTF8-aware version of rtrim.

 $str = self::rtrim($str);

参数

  • string $str required - Input string
  • string $charlist = NULL - String of characters to remove

Tags

  • Author - Andreas Gohr

返回值

  • string

源代码

public static function rtrim($str, $charlist = null)
{
	if ( $charlist === null )
		return rtrim($str);
	if ( self::is_ascii($charlist) )
		return rtrim($str, $charlist);
	$charlist = preg_replace('#[-\[\]:\\\\^/]#', '\\\\$0', $charlist);
	return preg_replace('/['.$charlist.']++$/uD', '', $str);
}

public static str_ireplace( string|array $search , string|array $replace , string|array $str [, integer & $count = NULL ] ) (在 UTF8 中定义)

Returns a string || an array with all occurrences of search in subject (ignoring case) && replaced with the given replace value. This is a UTF8-aware version of str_ireplace.

This function is very slow compared to the native version. Avoid using it when possible.

参数

  • string|array $search required - Text to replace
  • string|array $replace required - Replacement text
  • string|array $str required - Subject text
  • byref integer $count = NULL - Number of matched && replaced needles will be returned via this parameter which is passed by reference

Tags

  • Author - Harry Fuecks

    返回值

    • string - If the input was a string
    • array - If the input was an array

    源代码

    public static function str_ireplace($search, $replace, $str, & $count = null)
    {
    	if ( self::is_ascii($search) && self::is_ascii($replace) && self::is_ascii($str) )
    		return str_ireplace($search, $replace, $str, $count);
    	if ( is_array($str) )
    	{
    		foreach( $str as $key => $val )
    		{
    			$str[$key] = self::str_ireplace($search, $replace, $val, $count);
    		}
    		return $str;
    	}
    	if ( is_array($search) )
    	{
    		$keys = array_keys($search);
    		foreach( $keys as $k )
    		{
    			if ( is_array($replace) )
    			{
    				if ( array_key_exists($k, $replace) )
    				{
    					$str = self::str_ireplace($search[$k], $replace[$k], $str, $count);
    				}
    				else
    				{
    					$str = self::str_ireplace($search[$k], '', $str, $count);
    				}
    			}
    			else
    			{
    				$str = self::str_ireplace($search[$k], $replace, $str, $count);
    			}
    		}
    		return $str;
    	}
    	$search = self::strtolower($search);
    	$str_lower = self::strtolower($str);
    	$total_matched_strlen = 0;
    	$i = 0;
    	while(preg_match('/(.*?)'.preg_quote($search, '/').'/s', $str_lower, $matches))
    	{
    		$matched_strlen = strlen($matches[0]);
    		$str_lower = substr($str_lower, $matched_strlen);
    		$offset = $total_matched_strlen + strlen($matches[1]) + ($i * (strlen($replace) - 1));
    		$str = substr_replace($str, $replace, $offset, strlen($search));
    		$total_matched_strlen += $matched_strlen;
    		$i++;
    	}
    	$count += $i;
    	return $str;
    }

public static str_pad( string $str , integer $final_str_length [, string $pad_str = string(1) " " , string $pad_type = integer 1 ] ) (在 UTF8 中定义)

Pads a UTF-8 string to a certain length with another string. This is a UTF8-aware version of str_pad.

 $str = self::str_pad($str, $length);

参数

  • string $str required - Input string
  • integer $final_str_length required - Desired string length after padding
  • string $pad_str = string(1) " " - String to use as padding
  • string $pad_type = integer 1 - Padding type: STR_PAD_RIGHT, STR_PAD_LEFT, || STR_PAD_BOTH

Tags

  • Author - Harry Fuecks

返回值

  • string

源代码

public static function str_pad($str, $final_str_length, $pad_str = ' ', $pad_type = STR_PAD_RIGHT)
{
	if ( self::is_ascii($str) && self::is_ascii($pad_str) )
		return str_pad($str, $final_str_length, $pad_str, $pad_type);
	$str_length = self::strlen($str);
	if ( $final_str_length <= 0 || $final_str_length <= $str_length )
		return $str;
	$pad_str_length = self::strlen($pad_str);
	$pad_length = $final_str_length - $str_length;
	if ( $pad_type == STR_PAD_RIGHT )
	{
		$repeat = ceil($pad_length / $pad_str_length);
		return self::substr($str.str_repeat($pad_str, $repeat), 0, $final_str_length);
	}
	if ( $pad_type == STR_PAD_LEFT )
	{
		$repeat = ceil($pad_length / $pad_str_length);
		return self::substr(str_repeat($pad_str, $repeat), 0, floor($pad_length)).$str;
	}
	if ( $pad_type == STR_PAD_BOTH )
	{
		$pad_length /= 2;
		$pad_length_left = floor($pad_length);
		$pad_length_right = ceil($pad_length);
		$repeat_left = ceil($pad_length_left / $pad_str_length);
		$repeat_right = ceil($pad_length_right / $pad_str_length);
		$pad_left = self::substr(str_repeat($pad_str, $repeat_left), 0, $pad_length_left);
		$pad_right = self::substr(str_repeat($pad_str, $repeat_right), 0, $pad_length_right);
		return $pad_left.$str.$pad_right;
	}
	throw new BootPHP_Exception("UTF8::str_pad: Unknown padding type (:pad_type)", array(
			':pad_type' => $pad_type,
		));
}

public static str_split( string $str [, integer $split_length = integer 1 ] ) (在 UTF8 中定义)

Converts a UTF-8 string to an array. This is a UTF8-aware version of str_split.

 $array = self::str_split($str);

参数

  • string $str required - Input string
  • integer $split_length = integer 1 - Maximum length of each chunk

Tags

  • Author - Harry Fuecks

返回值

  • array

源代码

public static function str_split($str, $split_length = 1)
{
	$split_length = (int) $split_length;
	if ( self::is_ascii($str) )
		return str_split($str, $split_length);
	if ( $split_length < 1 )
		return false;
	if ( self::strlen($str) <= $split_length )
		return array($str);
	preg_match_all('/.{'.$split_length.'}|[^\x00]{1,'.$split_length.'}$/us', $str, $matches);
	return $matches[0];
}

public static strcasecmp( string $str1 , string $str2 ) (在 UTF8 中定义)

Case-insensitive UTF-8 string comparison. This is a UTF8-aware version of strcasecmp.

 $compare = self::strcasecmp($str1, $str2);

参数

  • string $str1 required - String to compare
  • string $str2 required - String to compare

Tags

  • Author - Harry Fuecks

返回值

  • integer - Less than 0 if str1 is less than str2
  • integer - Greater than 0 if str1 is greater than str2
  • integer - 0 if they are equal

源代码

public static function strcasecmp($str1, $str2)
{
	if ( self::is_ascii($str1) && self::is_ascii($str2) )
		return strcasecmp($str1, $str2);
	$str1 = self::strtolower($str1);
	$str2 = self::strtolower($str2);
	return strcmp($str1, $str2);
}

public static strcspn( string $str , string $mask [, integer $offset = NULL , integer $length = NULL ] ) (在 UTF8 中定义)

Finds the length of the initial segment not matching mask. This is a UTF8-aware version of strcspn.

 $found = self::strcspn($str, $mask);

参数

  • string $str required - Input string
  • string $mask required - Mask for search
  • integer $offset = NULL - Start position of the string to examine
  • integer $length = NULL - Length of the string to examine

Tags

  • Author - Harry Fuecks

返回值

  • integer - Length of the initial segment that contains characters not in the mask

源代码

public static function strcspn($str, $mask, $offset = null, $length = null)
{
	if ( $str == '' || $mask == '' )
		return 0;
	if ( self::is_ascii($str) && self::is_ascii($mask) )
		return ($offset === null) ? strcspn($str, $mask) : (($length === null) ? strcspn($str, $mask, $offset) : strcspn($str, $mask, $offset, $length));
	if ( $offset !== null || $length !== null )
	{
		$str = self::substr($str, $offset, $length);
	}
	// Escape these characters:  - [ ] . : \ ^ /
	// The . && : are escaped to prevent possible warnings about POSIX regex elements
	$mask = preg_replace('#[-[\].:\\\\^/]#', '\\\\$0', $mask);
	preg_match('/^[^'.$mask.']+/u', $str, $matches);
	return isset($matches[0]) ? self::strlen($matches[0]) : 0;
}

public static strip_ascii_ctrl( string $str ) (在 UTF8 中定义)

Strips out device control codes in the ASCII range.

 $str = self::strip_ascii_ctrl($str);

参数

  • string $str required - String to clean

返回值

  • string

源代码

public static function strip_ascii_ctrl($str)
{
	return preg_replace('/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]+/S', '', $str);
}

public static strip_non_ascii( string $str ) (在 UTF8 中定义)

Strips out all non-7bit ASCII bytes.

 $str = self::strip_non_ascii($str);

参数

  • string $str required - String to clean

返回值

  • string

源代码

public static function strip_non_ascii($str)
{
	return preg_replace('/[^\x00-\x7F]+/S', '', $str);
}

public static stristr( string $str , string $search ) (在 UTF8 中定义)

Case-insenstive UTF-8 version of strstr. Returns all of input string from the first occurrence of needle to the end. This is a UTF8-aware version of stristr.

 $found = self::stristr($str, $search);

参数

  • string $str required - Input string
  • string $search required - Needle

Tags

  • Author - Harry Fuecks

返回值

  • string - Matched substring if found
  • false - If the substring was not found

源代码

public static function stristr($str, $search)
{
	if ( self::is_ascii($str) && self::is_ascii($search) )
		return stristr($str, $search);
	if ( $search == '' )
		return $str;
	$str_lower = self::strtolower($str);
	$search_lower = self::strtolower($search);
	preg_match('/^(.*?)'.preg_quote($search_lower, '/').'/s', $str_lower, $matches);
	if ( isset($matches[1]) )
		return substr($str, strlen($matches[1]));
	return false;
}

public static strlen( string $str ) (在 UTF8 中定义)

Returns the length of the given string. This is a UTF8-aware version of strlen.

 $length = self::strlen($str);

参数

  • string $str required - String being measured for length

Tags

返回值

  • integer

源代码

public static function strlen($str)
{
	if ( self::$server_utf8 )
		return mb_strlen($str, BootPHP::$charset);
	if ( self::is_ascii($str) )
		return strlen($str);
	return strlen(utf8_decode($str));
}

public static strpos( string $str , string $search [, integer $offset = integer 0 ] ) (在 UTF8 中定义)

Finds position of first occurrence of a UTF-8 string. This is a UTF8-aware version of strpos.

 $position = self::strpos($str, $search);

参数

  • string $str required - Haystack
  • string $search required - Needle
  • integer $offset = integer 0 - Offset from which character in haystack to start searching

Tags

返回值

  • integer - Position of needle
  • boolean - False if the needle is not found

源代码

public static function strpos($str, $search, $offset = 0)
{
	if ( self::$server_utf8 )
		return mb_strpos($str, $search, $offset, BootPHP::$charset);
	$offset = (int) $offset;
	if ( self::is_ascii($str) && self::is_ascii($search) )
		return strpos($str, $search, $offset);
	if ( $offset == 0 )
	{
		$array = explode($search, $str, 2);
		return isset($array[1]) ? self::strlen($array[0]) : false;
	}
	$str = self::substr($str, $offset);
	$pos = self::strpos($str, $search);
	return ($pos === false) ? false : ($pos + $offset);
}

public static strrev( string $str ) (在 UTF8 中定义)

Reverses a UTF-8 string. This is a UTF8-aware version of strrev.

 $str = self::strrev($str);

参数

  • string $str required - String to be reversed

Tags

  • Author - Harry Fuecks

返回值

  • string

源代码

public static function strrev($str)
{
	if ( self::is_ascii($str) )
		return strrev($str);
	preg_match_all('/./us', $str, $matches);
	return implode('', array_reverse($matches[0]));
}

public static strrpos( string $str , string $search [, integer $offset = integer 0 ] ) (在 UTF8 中定义)

Finds position of last occurrence of a char in a UTF-8 string. This is a UTF8-aware version of strrpos.

 $position = self::strrpos($str, $search);

参数

  • string $str required - Haystack
  • string $search required - Needle
  • integer $offset = integer 0 - Offset from which character in haystack to start searching

Tags

返回值

  • integer - Position of needle
  • boolean - False if the needle is not found

源代码

public static function strrpos($str, $search, $offset = 0)
{
	if ( self::$server_utf8 )
		return mb_strrpos($str, $search, $offset, BootPHP::$charset);
	$offset = (int) $offset;
	if ( self::is_ascii($str) && self::is_ascii($search) )
		return strrpos($str, $search, $offset);
	if ( $offset == 0 )
	{
		$array = explode($search, $str, -1);
		return isset($array[0]) ? self::strlen(implode($search, $array)) : false;
	}
	$str = self::substr($str, $offset);
	$pos = self::strrpos($str, $search);
	return ($pos === false) ? false : ($pos + $offset);
}

public static strspn( string $str , string $mask [, integer $offset = NULL , integer $length = NULL ] ) (在 UTF8 中定义)

Finds the length of the initial segment matching mask. This is a UTF8-aware version of strspn.

 $found = self::strspn($str, $mask);

参数

  • string $str required - Input string
  • string $mask required - Mask for search
  • integer $offset = NULL - Start position of the string to examine
  • integer $length = NULL - Length of the string to examine

Tags

  • Author - Harry Fuecks

返回值

  • integer - Length of the initial segment that contains characters in the mask

源代码

public static function strspn($str, $mask, $offset = null, $length = null)
{
	if ( $str == '' || $mask == '' )
		return 0;
	if ( self::is_ascii($str) && self::is_ascii($mask) )
		return ($offset === null) ? strspn($str, $mask) : (($length === null) ? strspn($str, $mask, $offset) : strspn($str, $mask, $offset, $length));
	if ( $offset !== null || $length !== null )
	{
		$str = self::substr($str, $offset, $length);
	}
	// Escape these characters:  - [ ] . : \ ^ /
	// The . && : are escaped to prevent possible warnings about POSIX regex elements
	$mask = preg_replace('#[-[\].:\\\\^/]#', '\\\\$0', $mask);
	preg_match('/^[^'.$mask.']+/u', $str, $matches);
	return isset($matches[0]) ? self::strlen($matches[0]) : 0;
}

public static substr( string $str , integer $offset [, integer $length = NULL ] ) (在 UTF8 中定义)

Returns part of a UTF-8 string. This is a UTF8-aware version of substr.

 $sub = self::substr($str, $offset);

参数

  • string $str required - Input string
  • integer $offset required - Offset
  • integer $length = NULL - Length limit

Tags

返回值

  • string

源代码

public static function substr($str, $offset, $length = null)
{
	if ( self::$server_utf8 )
		return ($length === null) ? mb_substr($str, $offset, mb_strlen($str), BootPHP::$charset) : mb_substr($str, $offset, $length, BootPHP::$charset);
	if ( self::is_ascii($str) )
		return ($length === null) ? substr($str, $offset) : substr($str, $offset, $length);
	// Normalize params
	$str	= (string) $str;
	$strlen = self::strlen($str);
	$offset = (int) ($offset < 0) ? max(0, $strlen + $offset) : $offset; // Normalize to positive offset
	$length = ($length === null) ? null : (int) $length;
	// Impossible
	if ( $length === 0 || $offset >= $strlen || ($length < 0 && $length <= $offset - $strlen) )
		return '';
	// Whole string
	if ( $offset == 0 && ($length === null || $length >= $strlen) )
		return $str;
	// Build regex
	$regex = '^';
	// Create an offset expression
	if ( $offset > 0 )
	{
		// PCRE repeating quantifiers must be less than 65536, so repeat when necessary
		$x = (int) ($offset / 65535);
		$y = (int) ($offset % 65535);
		$regex .= ($x == 0) ? '' : ('(?:.{65535}){'.$x.'}');
		$regex .= ($y == 0) ? '' : ('.{'.$y.'}');
	}
	// Create a length expression
	if ( $length === null )
	{
		$regex .= '(.*)'; // No length set, grab it all
	}
	// Find length from the left (positive length)
	elseif ( $length > 0 )
	{
		// Reduce length so that it can't go beyond the end of the string
		$length = min($strlen - $offset, $length);
		$x = (int) ($length / 65535);
		$y = (int) ($length % 65535);
		$regex .= '(';
		$regex .= ($x == 0) ? '' : ('(?:.{65535}){'.$x.'}');
		$regex .= '.{'.$y.'})';
	}
	// Find length from the right (negative length)
	else
	{
		$x = (int) (-$length / 65535);
		$y = (int) (-$length % 65535);
		$regex .= '(.*)';
		$regex .= ($x == 0) ? '' : ('(?:.{65535}){'.$x.'}');
		$regex .= '.{'.$y.'}';
	}
	preg_match('/'.$regex.'/us', $str, $matches);
	return $matches[1];
}

public static substr_replace( string $str , string $replacement , integer $offset [, $length = NULL ] ) (在 UTF8 中定义)

Replaces text within a portion of a UTF-8 string. This is a UTF8-aware version of substr_replace.

 $str = self::substr_replace($str, $replacement, $offset);

参数

  • string $str required - Input string
  • string $replacement required - Replacement string
  • integer $offset required - Offset
  • unknown $length = NULL

Tags

  • Author - Harry Fuecks

返回值

  • string

源代码

public static function substr_replace($str, $replacement, $offset, $length = null)
{
	if ( self::is_ascii($str) )
		return ($length === null) ? substr_replace($str, $replacement, $offset) : substr_replace($str, $replacement, $offset, $length);
	$length = ($length === null) ? self::strlen($str) : (int) $length;
	preg_match_all('/./us', $str, $str_array);
	preg_match_all('/./us', $replacement, $replacement_array);
	array_splice($str_array[0], $offset, $length, $replacement_array[0]);
	return implode('', $str_array[0]);
}

public static to_unicode( string $str ) (在 UTF8 中定义)

Takes an UTF-8 string && returns an array of ints representing the Unicode characters. Astral planes are supported i.e. the ints in the output can be > 0xFFFF. Occurrences of the BOM are ignored. Surrogates are not allowed.

 $array = self::to_unicode($str);

The Original Code is Mozilla Communicator client code. The Initial Developer of the Original Code is Netscape Communications Corporation. Portions created by the Initial Developer are Copyright (C) 1998 the Initial Developer. Ported to PHP by Henri Sivonen hsivonen@iki.fi, see http://hsivonen.iki.fi/php-utf8/ Slight modifications to fit with phputf8 library by Harry Fuecks hfuecks@gmail.com

参数

  • string $str required - UTF-8 encoded string

返回值

  • array - Unicode code points
  • false - If the string is invalid

源代码

public static function to_unicode($str)
{
	// Cached expected number of octets after the current octet until the beginning of the next UTF8 character sequence
	$m_state = 0;
	// Cached Unicode character
	$m_ucs4  = 0;
	// Cached expected number of octets in the current sequence
	$m_bytes = 1;
	$out = array();
	$len = strlen($str);
	for($i = 0; $i < $len; $i++)
	{
		$in = ord($str[$i]);
		if ( $m_state == 0 )
		{
			// When m_state is zero we expect either a US-ASCII character || a multi-octet sequence.
			if ( 0 == (0x80 & $in) )
			{
				// US-ASCII, pass straight through.
				$out[] = $in;
				$m_bytes = 1;
			}
			elseif ( 0xC0 == (0xE0 & $in) )
			{
				// First octet of 2 octet sequence
				$m_ucs4 = $in;
				$m_ucs4 = ($m_ucs4 & 0x1F) << 6;
				$m_state = 1;
				$m_bytes = 2;
			}
			elseif ( 0xE0 == (0xF0 & $in) )
			{
				// First octet of 3 octet sequence
				$m_ucs4 = $in;
				$m_ucs4 = ($m_ucs4 & 0x0F) << 12;
				$m_state = 2;
				$m_bytes = 3;
			}
			elseif ( 0xF0 == (0xF8 & $in) )
			{
				// First octet of 4 octet sequence
				$m_ucs4 = $in;
				$m_ucs4 = ($m_ucs4 & 0x07) << 18;
				$m_state = 3;
				$m_bytes = 4;
			}
			elseif ( 0xF8 == (0xFC & $in) )
			{
				/** First octet of 5 octet sequence.
				 *
				 * This is illegal because the encoded codepoint must be either
				 * (a) not the shortest form or
				 * (b) outside the Unicode range of 0-0x10FFFF.
				 * Rather than trying to resynchronize, we will carry on until the end
				 * of the sequence && let the later error handling code catch it.
				 **/
				$m_ucs4 = $in;
				$m_ucs4 = ($m_ucs4 & 0x03) << 24;
				$m_state = 4;
				$m_bytes = 5;
			}
			elseif ( 0xFC == (0xFE & $in) )
			{
				// First octet of 6 octet sequence, see comments for 5 octet sequence.
				$m_ucs4 = $in;
				$m_ucs4 = ($m_ucs4 & 1) << 30;
				$m_state = 5;
				$m_bytes = 6;
			}
			else
			{
				// Current octet is neither in the US-ASCII range nor a legal first octet of a multi-octet sequence.
				trigger_error('self::to_unicode: Illegal sequence identifier in UTF-8 at byte '.$i, E_USER_WARNING);
				return false;
			}
		}
		else
		{
			// When m_state is non-zero, we expect a continuation of the multi-octet sequence
			if ( 0x80 == (0xC0 & $in) )
			{
				// Legal continuation
				$shift = ($m_state - 1) * 6;
				$tmp = $in;
				$tmp = ($tmp & 0x0000003F) << $shift;
				$m_ucs4 |= $tmp;
				// End of the multi-octet sequence. mUcs4 now contains the final Unicode codepoint to be output
				if ( 0 == --$m_state )
				{
					// Check for illegal sequences && codepoints
					// From Unicode 3.1, non-shortest form is illegal
					if ( ((2 == $m_bytes) && ($m_ucs4 < 0x0080) ) OR
						((3 == $m_bytes) && ($m_ucs4 < 0x0800)) OR
						((4 == $m_bytes) && ($m_ucs4 < 0x10000)) OR
						(4 < $m_bytes) OR
						// From Unicode 3.2, surrogate characters are illegal
						(($m_ucs4 & 0xFFFFF800) == 0xD800) OR
						// Codepoints outside the Unicode range are illegal
						($m_ucs4 > 0x10FFFF))
					{
						trigger_error('self::to_unicode: Illegal sequence || codepoint in UTF-8 at byte '.$i, E_USER_WARNING);
						return false;
					}
					if ( 0xFEFF != $m_ucs4 )
					{
						// BOM is legal but we don't want to output it
						$out[] = $m_ucs4;
					}
					// Initialize UTF-8 cache
					$m_state = 0;
					$m_ucs4  = 0;
					$m_bytes = 1;
				}
			}
			else
			{
				// ((0xC0 & (*in) != 0x80) && (m_state != 0))
				// Incomplete multi-octet sequence
				throw new BootPHP_Exception("UTF8::to_unicode: Incomplete multi-octet sequence in UTF-8 at byte ':byte'", array(
					':byte' => $i,
				));
			}
		}
	}
	return $out;
}

public static transliterate_to_ascii( string $str [, integer $case = integer 0 ] ) (在 UTF8 中定义)

Replaces special/accented UTF-8 characters by ASCII-7 "equivalents".

 $ascii = self::transliterate_to_ascii($utf8);

参数

  • string $str required - String to transliterate
  • integer $case = integer 0 - -1 lowercase only, +1 uppercase only, 0 both cases

Tags

  • Author - Andreas Gohr

返回值

  • string

源代码

public static trim( string $str [, string $charlist = NULL ] ) (在 UTF8 中定义)

Strips whitespace (or other UTF-8 characters) from the beginning and end of a string. This is a UTF8-aware version of trim.

 $str = self::trim($str);

参数

  • string $str required - Input string
  • string $charlist = NULL - String of characters to remove

Tags

  • Author - Andreas Gohr

返回值

  • string

源代码

public static function trim($str, $charlist = null)
{
	if ( $charlist === null )
		return trim($str);
	return self::ltrim(self::rtrim($str, $charlist), $charlist);
}