Sophie

Sophie

distrib > Mandriva > 2008.1 > x86_64 > by-pkgid > 05cd670d8a02b2b4a0ffb1756f2e8308 > files > 12303

php-manual-zh-5.2.4-1mdv2008.1.noarch.rpm

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<HTML
><HEAD
><TITLE
>模式语法</TITLE
><META
NAME="GENERATOR"
CONTENT="Modular DocBook HTML Stylesheet Version 1.79"><LINK
REL="HOME"
TITLE="PHP 手册"
HREF="index.html"><LINK
REL="UP"
TITLE="Perl 兼容正则表达式函数"
HREF="ref.pcre.html"><LINK
REL="PREVIOUS"
TITLE="模式修正符"
HREF="reference.pcre.pattern.modifiers.html"><LINK
REL="NEXT"
TITLE="preg_grep"
HREF="function.preg-grep.html"><META
HTTP-EQUIV="Content-type"
CONTENT="text/html; charset=UTF-8"></HEAD
><BODY
CLASS="refentry"
BGCOLOR="#FFFFFF"
TEXT="#000000"
LINK="#0000FF"
VLINK="#840084"
ALINK="#0000FF"
><DIV
CLASS="NAVHEADER"
><TABLE
SUMMARY="Header navigation table"
WIDTH="100%"
BORDER="0"
CELLPADDING="0"
CELLSPACING="0"
><TR
><TH
COLSPAN="3"
ALIGN="center"
>PHP 手册</TH
></TR
><TR
><TD
WIDTH="10%"
ALIGN="left"
VALIGN="bottom"
><A
HREF="reference.pcre.pattern.modifiers.html"
ACCESSKEY="P"
>上一页</A
></TD
><TD
WIDTH="80%"
ALIGN="center"
VALIGN="bottom"
></TD
><TD
WIDTH="10%"
ALIGN="right"
VALIGN="bottom"
><A
HREF="function.preg-grep.html"
ACCESSKEY="N"
>下一页</A
></TD
></TR
></TABLE
><HR
ALIGN="LEFT"
WIDTH="100%"></DIV
><H1
><A
NAME="reference.pcre.pattern.syntax"
></A
>模式语法</H1
><DIV
CLASS="refnamediv"
><A
NAME="AEN170738"
></A
>模式语法&nbsp;--&nbsp;解说 Perl 兼容正则表达式的语法</DIV
><DIV
CLASS="refsect1"
><A
NAME="AEN170741"
></A
><H2
>说明</H2
><P
>&#13;   PCRE 库是一组用和 Perl 5
   相同的语法和语义实现了正则表达式模式匹配的函数,不过有少许区别(见下面)。当前
   PCRE 的实现是与 Perl 5.005 相符的。
  </P
></DIV
><DIV
CLASS="refsect1"
><A
NAME="AEN170744"
></A
><H2
>与 Perl 的区别</H2
><P
>&#13;   这里谈到的区别是就 Perl 5.005 来说的。
  <P
></P
><OL
TYPE="1"
><LI
><P
>&#13;     默认情况下,空白字符是 C 语言库函数 isspace()
     所能识别的任何字符,尽管有可能与别的字符类型表编译在一起。通常
     isspace() 匹配空格,换页符,换行符,回车符,水平制表符和垂直制表符。Perl 5
     不再将垂直制表符包括在空白字符中了。事实上长久以来存在于 Perl
     文档中的转义序列 \v 从未被识别过,不过该字符至少到 5.002
     为止都被当成空白字符的。在 5.004 和 5.005 中 \s 不匹配此字符。
    </P
></LI
><LI
><P
>&#13;     PCRE 不允许在向前断言中使用重复的数量符。Perl
     允许这样,但可能不是你想象中的含义。例如,(?!a){3}
     并不是断言下面三个字符不是“a”,而是断言下一个字符不是“a”三次。
    </P
></LI
><LI
><P
>&#13;     捕获出现在排除模式断言中的子模式虽然被计数,但并未在偏移向量中设定其条目。Perl
     在匹配失败前从此种模式中设定其数字变量,但只在排触摸式断言只包含一个分支时。
     
    </P
></LI
><LI
><P
>&#13;     尽管目标字符串中支持二进制的零字符,但不能出现在模式字符串中,因为它被当作普通的
     C 字符串传递,以二进制零终止。转义序列“\x00”可以在模式中用来表示二进制零。
    </P
></LI
><LI
><P
>&#13;      不支持下列 Perl 转义序列:\l,\u,\L,\U。事实上这些是由
      Perl 的字符串处理来实现的,并不是模式匹配引擎的一部分。
     </P
></LI
><LI
><P
>&#13;      不支持 Perl 的 \G 断言,因为这和单个的模式匹配无关。
     </P
></LI
><LI
><P
>&#13;      很明显,PCRE 不支持 (?{code}) 结构。
     </P
></LI
><LI
><P
>&#13;      当部分模式重复的时候,有关 Perl 5.005_02
      捕获字符串的设定有些古怪的地方。举例说,用模式
      /^(a(b)?)+$/ 去匹配 "aba" 会将 $2 设为 "b",但是用模式
      /^(aa(bb)?)+$/ 去匹配 "aabbaa" 会使 $2 无值。然而,如果把模式改成
      /^(aa(b(b))?)+$/,则 $2(和 $3)就有值了。在
      Perl 5.004 中以上两种情况下 $2 都会被赋值,在 PCRE 中也是
      <TT
CLASS="constant"
><B
>TRUE</B
></TT
>。如果以后 Perl 改了,PCRE 可能也会跟着改。
     </P
></LI
><LI
><P
>&#13;     另一个未解决的矛盾是 Perl 5.005_02 中模式
     /^(a)?(?(1)a|b)+$/ 能匹配上字符串 "a",但是 PCRE
     不会。然而,在 Perl 和 PCRE 中用 /^(a)?a/
     去匹配 "a" 会使 $1 没有值。
    </P
></LI
><LI
><P
>&#13;      PCRE 提供了一些对 Perl 正则表达式机制的扩展:
      <P
></P
><OL
TYPE="a"
><LI
><P
>&#13;         尽管向后断言必须匹配固定长度字符串,但每个向后断言的分支可以匹配不同长度的字符串。Perl
         5.005 要求所有分支的长度相同。
        </P
></LI
><LI
><P
>&#13;         如果设定了
         <A
HREF="reference.pcre.pattern.modifiers.html"
>PCRE_DOLLAR_ENDONLY</A
>
         而没有设定
         <A
HREF="reference.pcre.pattern.modifiers.html"
>PCRE_MULTILINE</A
>,则
         $ 元字符只匹配字符串的最末尾。
        </P
></LI
><LI
><P
>&#13;         如果设定了
         <A
HREF="reference.pcre.pattern.modifiers.html"
>PCRE_EXTRA</A
>,反斜线后面跟一个没有特殊含义的字母会出错。
        </P
></LI
><LI
><P
>&#13;         如果设定了
         <A
HREF="reference.pcre.pattern.modifiers.html"
>PCRE_UNGREEDY</A
>,则重复的数量符的
         greed 被反转,即,默认时不是 greedy,但如果后面跟上一个问号就变成 greedy 了。
        </P
></LI
></OL
>
     </P
></LI
></OL
>
  </P
></DIV
><DIV
CLASS="refsect1"
><A
NAME="regexp.reference"
></A
><H2
>正则表达式详解</H2
><DIV
CLASS="refsect2"
><A
NAME="regexp.introduction"
></A
><H3
>介绍</H3
><P
>&#13;    下面说明 PCRE 所支持的正则表达式的语法和语义。Perl
    文档和很多其它书中也解说了正则表达式,有的书中有很多例子。Jeffrey
    Friedl 写的“Mastering  Regular  Expressions”,由 O'Reilly
    出版社发行(ISBN 1-56592-257-3),包含了大量细节。这里的说明只是个参考文档。
   </P
><P
>&#13;    正则表达式是从左向右去匹配目标字符串的一组模式。大多数字符在模式中表示它们自身并匹配目标中相应的字符。作为一个小例子,模式
    <TT
CLASS="literal"
>The quick brown fox</TT
> 匹配了目标字符串中与其完全相同的一部分。
   </P
></DIV
><DIV
CLASS="refsect2"
><A
NAME="regexp.reference.meta"
></A
><H3
>元字符</H3
><P
>&#13;    正则表达式的威力在于其能够在模式中包含选择和循环。它们通过使用<SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>元字符</I
></SPAN
>来编码在模式中,元字符不代表其自身,它们用一些特殊的方式来解析。
   </P
><P
>&#13;    有两组不同的元字符:一种是模式中除了方括号内都能被识别的,还有一种是在方括号内被识别的。方括号之外的元字符有这些:
    <P
></P
><DIV
CLASS="variablelist"
><DL
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>\</I
></SPAN
></DT
><DD
><P
>&#13;        有数种用途的通用转义符
       </P
></DD
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>^</I
></SPAN
></DT
><DD
><P
>&#13;        断言目标的开头(或在多行模式下行的开头,即紧随一换行符之后)
       </P
></DD
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>$</I
></SPAN
></DT
><DD
><P
>&#13;        断言目标的结尾(或在多行模式下行的结尾,即紧随一换行符之前)
       </P
></DD
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>.</I
></SPAN
></DT
><DD
><P
>&#13;        匹配除了换行符外的任意一个字符(默认情况下)
       </P
></DD
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>[</I
></SPAN
></DT
><DD
><P
>&#13;        字符类定义开始
       </P
></DD
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>]</I
></SPAN
></DT
><DD
><P
>&#13;        字符类定义结束
       </P
></DD
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>|</I
></SPAN
></DT
><DD
><P
>&#13;        开始一个多选一的分支
       </P
></DD
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>(</I
></SPAN
></DT
><DD
><P
>&#13;        子模式开始
       </P
></DD
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>)</I
></SPAN
></DT
><DD
><P
>&#13;        子模式结束
       </P
></DD
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>?</I
></SPAN
></DT
><DD
><P
>&#13;        扩展 ( 的含义,也是 0 或 1 数量限定符,以及数量限定符最小值
       </P
></DD
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>*</I
></SPAN
></DT
><DD
><P
>&#13;        匹配 0 个或多个的数量限定符
       </P
></DD
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>+</I
></SPAN
></DT
><DD
><P
>&#13;        匹配 1 个或多个的数量限定符
       </P
></DD
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>{</I
></SPAN
></DT
><DD
><P
>&#13;        最少/最多数量限定开始
       </P
></DD
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>}</I
></SPAN
></DT
><DD
><P
>&#13;        最少/最多数量限定结束
       </P
></DD
></DL
></DIV
>
    模式中方括号内的部分称为“字符类”。字符类中可用的元字符为:
    <P
></P
><DIV
CLASS="variablelist"
><DL
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>\</I
></SPAN
></DT
><DD
><P
>&#13;        通用转义字符
       </P
></DD
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>^</I
></SPAN
></DT
><DD
><P
>&#13;        排除字符类,但仅当其为第一个字符时有效
       </P
></DD
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>-</I
></SPAN
></DT
><DD
><P
>&#13;        指出字符范围
       </P
></DD
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>]</I
></SPAN
></DT
><DD
><P
>&#13;        结束字符类
       </P
></DD
></DL
></DIV
>
    以下说明了每一个元字符的用法。
   </P
></DIV
><DIV
CLASS="refsect2"
><A
NAME="regexp.reference.backslash"
></A
><H3
>反斜线(\)</H3
><P
>&#13;    反斜线字符有几种用途。首先,如果其后跟着一个非字母数字字符,则取消该字符可能具有的任何特殊含义。此种将反斜线用作转义字符的用法适用于无论是字符类之中还是之外。
   </P
><P
>&#13;    例如,如果想匹配一个“*”字符,则在模式中用“\*”。这适用于无论下一个字符是否会被当作元字符来解释,因此在非字母数字字符之前加上一个“\”来指明该字符就代表其本身总是安全的。尤其是,如果要匹配一个反斜线,用“\\”。
   </P
><DIV
CLASS="note"
><BLOCKQUOTE
CLASS="note"
><P
><B
>注意: </B
>
     单引号或双引号括起来的 PHP <A
HREF="language.types.string.html#language.types.string.syntax"
>字符串</A
>中的反斜线有特殊含义。因此必须用正则表达式的
     \\ 来匹配 \,而在 PHP 代码中要用 "\\\\" 或 '\\\\'。
    </P
></BLOCKQUOTE
></DIV
><P
>&#13;    如果模式编译时加上了
    <A
HREF="reference.pcre.pattern.modifiers.html"
>PCRE_EXTENDED</A
>
    选项,模式中的空白字符(字符类中以外的)以及字符类之外的“#”到换行符之间的字符都被忽略。可以用转义的反斜线将空白字符或者“#”字符包括到模式中去。
   </P
><P
>&#13;    反斜线的第二种用途提供了一种在模式中以可见方式去编码不可打印字符的方法。并没有不可打印字符出现的限制,除了代表模式结束的二进制零以外。但用文本编辑器来准备模式的时候,通常用以下的转义序列来表示那些二进制字符更容易一些:
   </P
><P
>&#13;    <P
></P
><DIV
CLASS="variablelist"
><DL
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>\a</I
></SPAN
></DT
><DD
><P
>&#13;        alarm,即 BEL 字符(0x07)
       </P
></DD
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>\cx</I
></SPAN
></DT
><DD
><P
>&#13;        "control-x",其中 x 是任意字符
       </P
></DD
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>\e</I
></SPAN
></DT
><DD
><P
>&#13;        escape(0x1B)
       </P
></DD
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>\f</I
></SPAN
></DT
><DD
><P
>&#13;        换页符 formfeed(0x0C)
       </P
></DD
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>\n</I
></SPAN
></DT
><DD
><P
>&#13;        换行符 newline(0x0A)
       </P
></DD
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>\r</I
></SPAN
></DT
><DD
><P
>&#13;        回车符 carriage return(0x0D)
       </P
></DD
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>\t</I
></SPAN
></DT
><DD
><P
>&#13;        制表符 tab(0x09)
       </P
></DD
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>\xhh</I
></SPAN
></DT
><DD
><P
>&#13;        十六进制代码为 hh 的字符
       </P
></DD
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>\ddd</I
></SPAN
></DT
><DD
><P
>&#13;        八进制代码为 ddd 的字符,或 backreference
       </P
></DD
></DL
></DIV
>
   </P
><P
>&#13;    “<TT
CLASS="literal"
>\cx</TT
>”的精确效果如下:如果“<TT
CLASS="literal"
>x</TT
>”是小写字母,则被转换为大写字母。接着字符中的第
    6 位(0x40)被反转。从而“<TT
CLASS="literal"
>\cz</TT
>”成为
    0x1A,但“<TT
CLASS="literal"
>\c{</TT
>”成为
    0x3B,而“<TT
CLASS="literal"
>\c;</TT
>”成为 0x7B。
   </P
><P
>&#13;    在“<TT
CLASS="literal"
>\x</TT
>”之后最多再读取两个十六进制数字(其中的字母可以是大写或小写)。在 <SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>UTF-8
    模式</I
></SPAN
>下,允许用“<TT
CLASS="literal"
>\x{...}</TT
>”,花括号中的内容是表示十六进制数字的字符串。原来的十六进制转义序列
    <TT
CLASS="literal"
>\xhh</TT
> 如果其值大于 127 的话则匹配了一个双字节 UTF-8 字符。
   </P
><P
>&#13;    在“<TT
CLASS="literal"
>\0</TT
>”之后最多再读取两个八进制数字。以上两种情况下,如果少于两个数字,则只使用已出现的。因此序列“<TT
CLASS="literal"
>\0\x\07</TT
>”代表两个二进制的零加一个
    BEL 字符。如果是八进制数字则确保在开始的零后面再提供两个数字。
   </P
><P
>&#13;    处理反斜线后面跟着一个不是 0
    的数字比较复杂。在字符类之外,PCRE
    以十进制数字读取该数字及其后面的数字。如果数字小于
    10,或者之前表达式中捕获到至少该数字的左圆括号,则这个序列将被作为<SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>逆向引用</I
></SPAN
>。有关此如何运作的说明在后面,以及括号内的子模式。
   </P
><P
>&#13;    在字符类之中,或者如果十进制数字大于 9
    并且之前没有那么多捕获的子模式,PCRE 重新从反斜线开始读取其后的最多三个八进制数字,并以最低位的
    8 个比特产生出一个单一字节。任何其后的数字都代表自身。例如:
   </P
><P
>&#13;    <P
></P
><DIV
CLASS="variablelist"
><DL
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>\040</I
></SPAN
></DT
><DD
><P
>&#13;        另一种表示空格的方法
       </P
></DD
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>\40</I
></SPAN
></DT
><DD
><P
>&#13;        同上,如果之前捕获的子模式少于 40 个的话
       </P
></DD
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>\7</I
></SPAN
></DT
><DD
><P
>&#13;        总是一个逆向引用
       </P
></DD
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>\11</I
></SPAN
></DT
><DD
><P
>&#13;        可能是个逆向引用,或者是制表符 tab
       </P
></DD
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>\011</I
></SPAN
></DT
><DD
><P
>&#13;        总是表示制表符 tab
       </P
></DD
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>\0113</I
></SPAN
></DT
><DD
><P
>&#13;        表示制表符 tab 后面跟着一个字符“3”
       </P
></DD
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>\113</I
></SPAN
></DT
><DD
><P
>&#13;        表示八进制代码为 113 的字符(因为不能超过 99 个逆向引用)
       </P
></DD
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>\377</I
></SPAN
></DT
><DD
><P
>&#13;        表示一个所有的比特都是 1 的字节
       </P
></DD
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>\81</I
></SPAN
></DT
><DD
><P
>&#13;        要么是一个逆向引用,要么是一个二进制的零后面跟着两个字符“8”和“1”
       </P
></DD
></DL
></DIV
>
   </P
><P
>&#13;    注意八进制值 100 或更大的值之前不能以零打头,因为不会读取(反斜线后)超过三个八进制数字。
   </P
><P
>&#13;    所有的定义了一个单一字节的序列可以用于字符类之中或之外。此外,在字符类之中,序列“<TT
CLASS="literal"
>\b</TT
>”被解释为反斜线字符(0x08),而在字符类之外有不同含义(见下面)。
   </P
><P
>&#13;    反斜线的第三个用法是指定通用字符类型:
   </P
><P
>&#13;    <P
></P
><DIV
CLASS="variablelist"
><DL
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>\d</I
></SPAN
></DT
><DD
><P
>&#13;        任一十进制数字
       </P
></DD
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>\D</I
></SPAN
></DT
><DD
><P
>&#13;        任一非十进制数的字符
       </P
></DD
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>\s</I
></SPAN
></DT
><DD
><P
>&#13;        任一空白字符
       </P
></DD
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>\S</I
></SPAN
></DT
><DD
><P
>&#13;        任一非空白字符
       </P
></DD
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>\w</I
></SPAN
></DT
><DD
><P
>&#13;        任一“字”的字符
       </P
></DD
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>\W</I
></SPAN
></DT
><DD
><P
>&#13;        任一“非字”的字符
       </P
></DD
></DL
></DIV
>
   </P
><P
>&#13;    任何一个转义序列将完整的字符组合分割成两个分离的部分。任一给定的字符匹配一个且仅一个转义序列。
   </P
><P
>&#13;    “字”的字符是指任何一个字母或数字或下划线,也就是说,任何可以是
    Perl "<TT
CLASS="literal"
>word</TT
>" 的字符。字母和数字的定义由
    PCRE 字符表控制,可能会根据指定区域的匹配而改变。举例说,在
    "fr" (French) 区域,某些编码大于 128 的字符用来表示重音字母,这些字符能够被
    <TT
CLASS="literal"
>\w</TT
> 所匹配。
   </P
><P
>&#13;    这些字符类型序列可以出现在字符类之中和之外。每一个匹配相应类型中的一个字符。如果当前匹配点在目标字符串的结尾,以上所有匹配都失败,因为没有字符可供匹配。
   </P
><P
>&#13;    反斜线的第四个用法是某些简单的断言。断言是指在一个匹配中的特定位置必须达到的条件,并不会消耗目标字符串中的任何字符。子模式中更复杂的断言的用法在下面描述。反斜线的断言有:
   </P
><P
>&#13;    <P
></P
><DIV
CLASS="variablelist"
><DL
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>\b</I
></SPAN
></DT
><DD
><P
>&#13;        字分界线
       </P
></DD
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>\B</I
></SPAN
></DT
><DD
><P
>&#13;        非字分界线
       </P
></DD
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>\A</I
></SPAN
></DT
><DD
><P
>&#13;        目标的开头(独立于多行模式)
       </P
></DD
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>\Z</I
></SPAN
></DT
><DD
><P
>&#13;        目标的结尾或位于结尾的换行符前(独立于多行模式)
       </P
></DD
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>\z</I
></SPAN
></DT
><DD
><P
>&#13;        目标的结尾(独立于多行模式)
       </P
></DD
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>\G</I
></SPAN
></DT
><DD
><P
>目标中的第一个匹配位置</P
></DD
></DL
></DIV
>
   </P
><P
>&#13;    这些断言可能不能出现在字符类中(但是注意
    "<TT
CLASS="literal"
>\b</TT
>" 有不同的含义,在字符类之中也就是反斜线字符)。
   </P
><P
>&#13;    字边界是目标字符串中的一个位置,其当前字符和前一个字符不能同时匹配
    <TT
CLASS="literal"
>\w</TT
> 或者 <TT
CLASS="literal"
>\W</TT
>(也就是其中一个匹配
    <TT
CLASS="literal"
>\w</TT
> 而另一个匹配
    <TT
CLASS="literal"
>\W</TT
>),或者是字符串的开头或结尾,假如第一个或最后一个字符匹配
    \w 的话。
   </P
><P
>&#13;    <TT
CLASS="literal"
>\A</TT
>,<TT
CLASS="literal"
>\Z</TT
> 和 <TT
CLASS="literal"
>\z</TT
>
    断言与传统的音调符和美元符(下面说明)的不同之处在于它们仅匹配目标字符串的绝对开头和结尾而不管设定了任何选项。它们不受
    <A
HREF="reference.pcre.pattern.modifiers.html"
>PCRE_NOTBOL</A
> 或
    <A
HREF="reference.pcre.pattern.modifiers.html"
>PCRE_NOTEOL</A
>
    选项的影响。<TT
CLASS="literal"
>\Z</TT
> 和
    <TT
CLASS="literal"
>\z</TT
> 的不同之处在于 <TT
CLASS="literal"
>\Z</TT
>
    匹配了作为字符串最后一个字符的换行符之前以及字符串的结尾,而
    <TT
CLASS="literal"
>\z</TT
> 仅匹配字符串的结尾。
   </P
><P
>&#13;    <TT
CLASS="literal"
>\G</TT
> 断言仅在当前匹配位置是匹配开始那一点时为真,如
    <A
HREF="function.preg-match.html"
><B
CLASS="function"
>preg_match()</B
></A
> 的 <CODE
CLASS="parameter"
>offset</CODE
>
    参数指定那样。当 <CODE
CLASS="parameter"
>offset</CODE
>
    的值非零时这和 <TT
CLASS="literal"
>\A</TT
> 不同。自 PHP 4.3.3 起可用。
   </P
><P
>&#13;    <TT
CLASS="literal"
>\Q</TT
> 和 <TT
CLASS="literal"
>\E</TT
> 自 PHP 4.3.3
    起可被用来在模式中忽略正则表达式匹配字符。例如:<TT
CLASS="literal"
>\w+\Q.$.\E$</TT
>
    将匹配一个或多个可组成字的字符,其后接着的是字面上的 <TT
CLASS="literal"
>.$.</TT
>
    并且位于字符串末尾。
   </P
></DIV
><DIV
CLASS="refsect2"
><A
NAME="regexp.reference.unicode"
></A
><H3
>Unicode 字符属性</H3
><P
>&#13;    自 PHP 4.4.0 和 5.1.0 起,当选择了
    additional escape sequences to match generic character types are available
    <SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>UTF-8 模式</I
></SPAN
>时有三种更多的转移序列可用:
   </P
><P
></P
><DIV
CLASS="variablelist"
><DL
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>\p{xx}</I
></SPAN
></DT
><DD
><P
>具有 xx 属性的一个字符</P
></DD
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>\P{xx}</I
></SPAN
></DT
><DD
><P
>没有 xx 属性的一个字符</P
></DD
><DT
><SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>\X</I
></SPAN
></DT
><DD
><P
>一个扩展 Unicode 序列</P
></DD
></DL
></DIV
><P
>&#13;    以上由 <TT
CLASS="literal"
>xx</TT
> 所表示的属性名限于 Unicode
    通用类型属性。每个字符具有一个此种属性,由两个缩写字母指定。为和
    Perl 兼容,可以在左花括号和属性名中间加入一个上箭头符号来表示排除属性。例如
    <TT
CLASS="literal"
>\p{^Lu}</TT
> 就和 <TT
CLASS="literal"
>\P{Lu}</TT
> 相同。
   </P
><P
>&#13;    如果在 <TT
CLASS="literal"
>\p</TT
> 或 <TT
CLASS="literal"
>\P</TT
>
    中只用了一个字母,则包括了所有该字母开头的属性。此情况下,如果不是排除的话,可以省略花括号。下面例子中的两项具有相同效果:
   </P
><P
CLASS="literallayout"
><br>
&nbsp;&nbsp;&nbsp;&nbsp;\p{L}<br>
&nbsp;&nbsp;&nbsp;&nbsp;\pL<br>
&nbsp;&nbsp;&nbsp;</P
><DIV
CLASS="table"
><A
NAME="AEN171137"
></A
><P
><B
>表 1. 所支持的属性代码</B
></P
><TABLE
BORDER="1"
CLASS="CALSTABLE"
><COL><COL><TBODY
><TR
><TD
><TT
CLASS="literal"
>C</TT
></TD
><TD
>Other - 其它</TD
></TR
><TR
><TD
><TT
CLASS="literal"
>Cc</TT
></TD
><TD
>Control - 控制</TD
></TR
><TR
><TD
><TT
CLASS="literal"
>Cf</TT
></TD
><TD
>Format - 格式</TD
></TR
><TR
><TD
><TT
CLASS="literal"
>Cn</TT
></TD
><TD
>Unassigned - 无符号</TD
></TR
><TR
><TD
><TT
CLASS="literal"
>Co</TT
></TD
><TD
>Private use - 私有</TD
></TR
><TR
><TD
><TT
CLASS="literal"
>Cs</TT
></TD
><TD
>Surrogate - 代替</TD
></TR
><TR
><TD
><TT
CLASS="literal"
>L</TT
></TD
><TD
>Letter -字母</TD
></TR
><TR
><TD
><TT
CLASS="literal"
>Ll</TT
></TD
><TD
>Lower case letter - 小写字母</TD
></TR
><TR
><TD
><TT
CLASS="literal"
>Lm</TT
></TD
><TD
>Modifier letter - 修正符字母</TD
></TR
><TR
><TD
><TT
CLASS="literal"
>Lo</TT
></TD
><TD
>Other letter - 其它字母</TD
></TR
><TR
><TD
><TT
CLASS="literal"
>Lt</TT
></TD
><TD
>Title case letter - 标题大写字母</TD
></TR
><TR
><TD
><TT
CLASS="literal"
>Lu</TT
></TD
><TD
>Upper case letter - 大写字母</TD
></TR
><TR
><TD
><TT
CLASS="literal"
>M</TT
></TD
><TD
>Mark - 标记</TD
></TR
><TR
><TD
><TT
CLASS="literal"
>Mc</TT
></TD
><TD
>Spacing mark - 空格标记</TD
></TR
><TR
><TD
><TT
CLASS="literal"
>Me</TT
></TD
><TD
>Enclosing mark - 环绕标记</TD
></TR
><TR
><TD
><TT
CLASS="literal"
>Mn</TT
></TD
><TD
>Non-spacing mark - 非空格标记</TD
></TR
><TR
><TD
><TT
CLASS="literal"
>N</TT
></TD
><TD
>Number - 数字</TD
></TR
><TR
><TD
><TT
CLASS="literal"
>Nd</TT
></TD
><TD
>Decimal number - 十进制数字</TD
></TR
><TR
><TD
><TT
CLASS="literal"
>Nl</TT
></TD
><TD
>Letter number - 字母数字</TD
></TR
><TR
><TD
><TT
CLASS="literal"
>No</TT
></TD
><TD
>Other number - 其它数字</TD
></TR
><TR
><TD
><TT
CLASS="literal"
>P</TT
></TD
><TD
>Punctuation - 标点符号</TD
></TR
><TR
><TD
><TT
CLASS="literal"
>Pc</TT
></TD
><TD
>Connector punctuation - 连接标点符</TD
></TR
><TR
><TD
><TT
CLASS="literal"
>Pd</TT
></TD
><TD
>Dash punctuation - 横线标点符</TD
></TR
><TR
><TD
><TT
CLASS="literal"
>Pe</TT
></TD
><TD
>Close punctuation - 结束标点符</TD
></TR
><TR
><TD
><TT
CLASS="literal"
>Pf</TT
></TD
><TD
>Final punctuation - 最终标点符</TD
></TR
><TR
><TD
><TT
CLASS="literal"
>Pi</TT
></TD
><TD
>Initial punctuation - 起始标点符</TD
></TR
><TR
><TD
><TT
CLASS="literal"
>Po</TT
></TD
><TD
>Other punctuation - 其它标点符号</TD
></TR
><TR
><TD
><TT
CLASS="literal"
>Ps</TT
></TD
><TD
>Open punctuation - 开始标点符</TD
></TR
><TR
><TD
><TT
CLASS="literal"
>S</TT
></TD
><TD
>Symbol - 符号</TD
></TR
><TR
><TD
><TT
CLASS="literal"
>Sc</TT
></TD
><TD
>Currency symbol - 货币符号</TD
></TR
><TR
><TD
><TT
CLASS="literal"
>Sk</TT
></TD
><TD
>Modifier symbol - 修正符号</TD
></TR
><TR
><TD
><TT
CLASS="literal"
>Sm</TT
></TD
><TD
>Mathematical symbol - 算术符号</TD
></TR
><TR
><TD
><TT
CLASS="literal"
>So</TT
></TD
><TD
>Other symbol - 其它符号</TD
></TR
><TR
><TD
><TT
CLASS="literal"
>Z</TT
></TD
><TD
>Separator - 分隔符</TD
></TR
><TR
><TD
><TT
CLASS="literal"
>Zl</TT
></TD
><TD
>Line separator - 行分隔符</TD
></TR
><TR
><TD
><TT
CLASS="literal"
>Zp</TT
></TD
><TD
>Paragraph separator - 段落分隔符</TD
></TR
><TR
><TD
><TT
CLASS="literal"
>Zs</TT
></TD
><TD
>Space separator - 空格分隔符</TD
></TR
></TBODY
></TABLE
></DIV
><P
>&#13;    PCRE 不支持扩展属性例如 "Greek" 或 "InMusicalSymbols"。
   </P
><P
>&#13;    指定不区分大小写的匹配不影响此类转义序列。例如 <TT
CLASS="literal"
>\p{Lu}</TT
>
    总是仅和大写字母匹配。
   </P
><P
>&#13;    <TT
CLASS="literal"
>\X</TT
> 转移符匹配能组成扩展 Unicode 序列的任何数目的 Unicode 字符。<TT
CLASS="literal"
>\X</TT
>
    和 <TT
CLASS="literal"
>(?&#62;\PM\pM*)</TT
> 相同。
   </P
><P
>&#13;    也就是,它匹配一个没有“mark”属性,后面跟着零或多个具有“mark”属性的字符,并且将此序列看成原子组(见下)。典型的具有“mark”属性的字母是影响到前面的字符的重音符。
   </P
><P
>&#13;    用 Unicode 属性来匹配字符并不快,因为 PCRE
    不得不搜索一个包含超过一万五千字符数据的结构。这正是为什么在
    PCRE 中传统的转义序列例如 <TT
CLASS="literal"
>\d</TT
> 和
    <TT
CLASS="literal"
>\w</TT
> 不使用 Unicode 属性。
   </P
></DIV
><DIV
CLASS="refsect2"
><A
NAME="regexp.reference.circudollar"
></A
><H3
>音调符(^)和美元符($)</H3
><P
>&#13;    在字符类之外,默认匹配模式下,音调符是一个仅在当前匹配点是目标字符串的开头时才为真的断言。在字符类之中,音调符的含义完全不同(见下面)。
   </P
><P
>&#13;    如果涉及到几选一时音调符不需要是模式的第一个字符,但如果出现在某个分支中则应该是该选择分支的第一个字符。如果所有的选择分支都以音调符开头,这就是说,如果模式限制为只匹配目标的开头,那么这是一个紧固模式。(也有其它结构可以使模式成为紧固的。)
   </P
><P
>&#13;    美元符是一个仅在当前匹配点是目标字符串的结尾或者当最后一个字符是换行符时其前面的位置时为
    <TT
CLASS="constant"
><B
>TRUE</B
></TT
> 的断言(默认情况下)。如果涉及到几选一时美元符不需要是模式的最后一个字符,但应该是其出现的分支中的最后一个字符。美元符在字符类之中没有特殊含义。
   </P
><P
>&#13;    美元符的含义可被改变使其仅匹配字符串的结尾,只要在编译或匹配时设定了
    <A
HREF="reference.pcre.pattern.modifiers.html"
>PCRE_DOLLAR_ENDONLY</A
>
    选项即可。这并不影响 \Z 断言。
   </P
><P
>&#13;    如果设定了
    <A
HREF="reference.pcre.pattern.modifiers.html"
>PCRE_MULTILINE</A
>
    选项则音调符和美元符的含义被改变了。此种情况下,它们分别匹配紧接着内部
    "\n" 字符的之后和之前,再加上目标字符串的开头和结尾。例如模式
    /^abc$/ 在多行模式下匹配了目标字符串
    "def\nabc",但正常时不匹配。因此,由于所有分支都以
    "^" 开头而在单行模式下成为紧固的模式在多行模式下为非紧固的。如果设定了
    <A
HREF="reference.pcre.pattern.modifiers.html"
>PCRE_MULTILINE</A
>,则
    <A
HREF="reference.pcre.pattern.modifiers.html"
>PCRE_DOLLAR_ENDONLY</A
>
    选项会被忽略。
   </P
><P
>&#13;    注意 \A,\Z 和 \z 序列在两种情况下都可以用来匹配目标的开头和结尾,如果模式所有的分支都以
    \A 开始则其总是紧固的,不论是否设定了
    <A
HREF="reference.pcre.pattern.modifiers.html"
>PCRE_MULTILINE</A
>。
   </P
></DIV
><DIV
CLASS="refsect2"
><A
NAME="regexp.reference.dot"
></A
><H3
>句号(.)</H3
><P
>&#13;    在字符类之外,模式中的圆点可以匹配目标中的任何一个字符,包括不可打印字符,但不匹配换行符(默认情况下)。如果设定了
    <A
HREF="reference.pcre.pattern.modifiers.html"
>PCRE_DOTALL</A
>
    则圆点也会匹配换行符。处理圆点与处理音调符和美元符是完全独立的,唯一的联系就是它们都涉及到换行符。圆点在字符类之中没有特殊含义。
   </P
><P
>&#13;    <SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>\C</I
></SPAN
> 可以用来匹配单一字节。在
    <SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>UTF-8 模式</I
></SPAN
>下这有意义,因为句号可以匹配由多个字节组成的整个字符。
   </P
></DIV
><DIV
CLASS="refsect2"
><A
NAME="regexp.reference.squarebrackets"
></A
><H3
>方括号([])</H3
><P
>&#13;    左方括号开始了一个字符类,右方括号结束之。单独一个右方括号不是特殊字符。如果在字符类之中需要一个右方括号,则其应该是字符类中的第一个字符(如果有音调符的话,则紧接音调符之后),或者用反斜线转义。
   </P
><P
>&#13;    字符类匹配目标中的一个字符,该字符必须是字符类定义的字符集中的一个;除非字符类中的第一个字符是音调符,此情况下目标字符必须不在字符类定义的字符集中。如果在字符类中需要音调符本身,则其必须不是第一个字符,或用反斜线转义。
   </P
><P
>&#13;    举例说,字符类 [aeiou] 匹配了任何一个小写元音字母,而 [^aeiou]
    匹配了任何一个不是小写元音字母的字符。注意音调符只是一个通过枚举指定那些不在字符类之中的字符的符号。不是断言:仍旧会消耗掉目标字符串中的一个字符,如果当前位置在字符串结尾的话则失败。
   </P
><P
>&#13;    当设定了不区分大小写的匹配时,字符类中的任何字母同时代表了其大小写形式,因此举例说,小写的
    [aeiou] 同时匹配了 "A" 和 "a",小写的
    [^aeiou] 不匹配 "A",但区分大小写时则会匹配。
   </P
><P
>&#13;    换行符在字符类中不会特殊对待,不论
   <A
HREF="reference.pcre.pattern.modifiers.html"
>PCRE_DOTALL</A
> 或者
   <A
HREF="reference.pcre.pattern.modifiers.html"
>PCRE_MULTILINE</A
>
   选项设定了什么值。形如 [^a] 的字符类总是能够和换行符相匹配的。
   </P
><P
>&#13;    减号(-)字符可以在字符类中指定一个字符范围。例如,[d-m]
    匹配了 d 和 m 之间的任何字符,包括两者。如果字符类中需要减号本身,则必须用反斜线转义或者放到一个不能被解释为指定范围的位置,典型的位置是字符类中的第一个或最后一个字符。
   </P
><P
>&#13;    字面上的 "]" 不可能被当成字符范围的结束。形如
    [W-]46] 的模式会被解释为包括两个字符的字符类("W" and "-")后面跟着字符串
    "46]",因此其会匹配 "W46]" 或者 "-46]"。然而,如果将
    "]" 用反斜线转义,则会被当成范围的结束来解释。因此
    [W-\]46] 会被解释为一个字符类,包含有一个范围以及两个单独的字符。八进制或十六进制表示的
    "]" 也可以用来表示范围的结束。
   </P
><P
>&#13;    范围是以 ASCII 比较顺序来操作的。也可以用于用数字表示的字符,例如
    [\000-\037]。在不区分大小写匹配中如果范围里包括了字母,则同时匹配大小写字母。例如
    [W-c] 等价于 [][\^_`wxyzabc] 不区分大小写地匹配。如果使用了
    "fr" 区域的字符表,[\xc8-\xcb] 匹配了大小写的重音 E 字符。
   </P
><P
>&#13;    字符类型 \d,\D,\s,\S,\w 和 \W
    也可以出现于字符类中,并将其所能匹配的字符添加进字符类中。例如,[\dABCDEF]
    匹配了任何十六进制数字。用音调符可以很方便地制定严格的字符集,例如
    [^\W_] 匹配了任何字母或数字,但不匹配下划线。
   </P
><P
>&#13;    任何除了 \,-,^(位于开头)以及结束的 ]
    之外的非字母数字字符在字符类中都没有特殊含义,但是将它们转义也没有坏处。
   </P
></DIV
><DIV
CLASS="refsect2"
><A
NAME="regexp.reference.verticalbar"
></A
><H3
>竖线(|)</H3
><P
>&#13;    竖线字符用来分隔多选一模式。例如,模式:
    <TABLE
BORDER="0"
BGCOLOR="#E0E0E0"
CELLPADDING="5"
><TR
><TD
><PRE
CLASS="screen"
>gilbert|sullivan</PRE
></TD
></TR
></TABLE
>
    匹配了 "gilbert" 或者 "sullivan"
    中的一个。可以有任意多个分支,也可以有空的分支(匹配空字符串)。匹配进程从左到右轮流尝试每个分支,并使用第一个成功匹配的分支。如果分支在子模式(在下面定义)中,则“成功匹配”表示同时匹配了子模式中的分支以及主模式的其它部分。
   </P
></DIV
><DIV
CLASS="refsect2"
><A
NAME="regexp.reference.internal-options"
></A
><H3
>内部选项设定</H3
><P
>&#13;    <A
HREF="reference.pcre.pattern.modifiers.html"
>PCRE_CASELESS</A
>,<A
HREF="reference.pcre.pattern.modifiers.html"
>PCRE_MULTILINE</A
>,<A
HREF="reference.pcre.pattern.modifiers.html"
>PCRE_DOTALL</A
>,<A
HREF="reference.pcre.pattern.modifiers.html"
>PCRE_EXTRA</A
> 和
    <A
HREF="reference.pcre.pattern.modifiers.html"
>PCRE_EXTENDED</A
>
    的设定可以在模式内部通过包含在
    "(?" 和 ")" 之间的 Perl 选项字母序列来改变。选项字母为:
    <DIV
CLASS="table"
><A
NAME="AEN171347"
></A
><P
><B
>表 2. 内部选项字母</B
></P
><TABLE
BORDER="1"
CLASS="CALSTABLE"
><COL><COL><TBODY
><TR
><TD
><TT
CLASS="literal"
>i</TT
></TD
><TD
>代表 <A
HREF="reference.pcre.pattern.modifiers.html"
>PCRE_CASELESS</A
></TD
></TR
><TR
><TD
><TT
CLASS="literal"
>m</TT
></TD
><TD
>代表 <A
HREF="reference.pcre.pattern.modifiers.html"
>PCRE_MULTILINE</A
></TD
></TR
><TR
><TD
><TT
CLASS="literal"
>s</TT
></TD
><TD
>代表 <A
HREF="reference.pcre.pattern.modifiers.html"
>PCRE_DOTALL</A
></TD
></TR
><TR
><TD
><TT
CLASS="literal"
>x</TT
></TD
><TD
>代表 <A
HREF="reference.pcre.pattern.modifiers.html"
>PCRE_EXTENDED</A
></TD
></TR
><TR
><TD
><TT
CLASS="literal"
>U</TT
></TD
><TD
>代表 <A
HREF="reference.pcre.pattern.modifiers.html"
>PCRE_UNGREEDY</A
></TD
></TR
><TR
><TD
><TT
CLASS="literal"
>X</TT
></TD
><TD
>代表 <A
HREF="reference.pcre.pattern.modifiers.html"
>PCRE_EXTRA</A
></TD
></TR
></TBODY
></TABLE
></DIV
>
   </P
><P
>&#13;    例如,(?im) 设定了不区分大小写,多行匹配。也可以通过在字母前加上减号来取消这些选项。例如组合的选项
    (?im-sx),设定了
    <A
HREF="reference.pcre.pattern.modifiers.html"
>PCRE_CASELESS</A
> 和
    <A
HREF="reference.pcre.pattern.modifiers.html"
>PCRE_MULTILINE</A
>,并取消了
    <A
HREF="reference.pcre.pattern.modifiers.html"
>PCRE_DOTALL</A
> 和
    <A
HREF="reference.pcre.pattern.modifiers.html"
>PCRE_EXTENDED</A
>。如果一个字母在减号之前与之后都出现了,则该选项被取消设定。
   </P
><P
>&#13;    如果选项改变出现于顶层(即不在子模式的括号中),则改变应用于其后的剩余模式。因此
    <TT
CLASS="literal"
>/ab(?i)c/</TT
> 只匹配 "abc" 和
    and "abC"。此行为是自 PHP 4.3.3 起绑定的 PCRE 4.0 中被修改的。在此版本之前
    <TT
CLASS="literal"
>/ab(?i)c/</TT
> 的执行与
    <TT
CLASS="literal"
>/abc/i</TT
> 相同(例如匹配 "ABC" 和 "aBc")。
   </P
><P
>&#13;    如果选项改变出现于子模式中,则效果不同。这是
    Perl 5.005 的行为的一个变化。子模式中的选项改变只影响到子模式内部其后的部分,因此
    <TT
CLASS="literal"
>(a(?i)b)c</TT
>
    将只匹配 "abc" 和 "aBc"(假定没有使用
    <A
HREF="reference.pcre.pattern.modifiers.html"
>PCRE_CASELESS</A
>)。这意味着选项在模式的不同部位可以造成不同的设定。在一个分支中的改变可以传递到同一个子模式中后面的分支中,例如
    <TT
CLASS="literal"
>(a(?i)b|c)</TT
>
    将匹配 "ab","aB","c" 和 "C",尽管在匹配 "C"
    的时候第一个分支会在选项设定之前就被丢弃。这是因为选项设定的效果是在编译时确定的,否则会造成非常怪异的行为。
   </P
><P
>&#13;    PCRE 专用选项
    <A
HREF="reference.pcre.pattern.modifiers.html"
>PCRE_UNGREEDY</A
> 和
    <A
HREF="reference.pcre.pattern.modifiers.html"
>PCRE_EXTRA</A
>
    可以和 Perl 兼容选项以同样的方式来改变,分别使用字母
    U 和 X。(?X) 标记设定有些特殊,它必须出现于任何其它特性之前。最好放在最开头的位置。
   </P
></DIV
><DIV
CLASS="refsect2"
><A
NAME="regexp.reference.subpatterns"
></A
><H3
>子模式</H3
><P
>&#13;    子模式由圆括号定界,可以嵌套。将模式中的一部分标记为子模式可以:
   </P
><P
>&#13;    1. 将多选一的分支局部化。例如,模式:
    <TABLE
BORDER="0"
BGCOLOR="#E0E0E0"
CELLPADDING="5"
><TR
><TD
><PRE
CLASS="screen"
>cat(aract|erpillar|)</PRE
></TD
></TR
></TABLE
>
    匹配了 "cat","cataract" 或 "caterpillar"
    之一,没有圆括号的话将匹配 "cataract","erpillar" 或空字符串。
   </P
><P
>&#13;    2. 将子模式设定为捕获子模式(如同以前定义的)。当整个模式匹配时,目标字符串中匹配了子模式的部分会通过
    <B
CLASS="function"
>pcre_exec()</B
> 的 <SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>ovector</I
></SPAN
>
    参数传递回调用者。左圆括号从左到右计数(从 1 开始)以取得捕获子模式的数目。
   </P
><P
>&#13;    例如,如果将字符串 "the red king" 来和模式
    <TABLE
BORDER="0"
BGCOLOR="#E0E0E0"
CELLPADDING="5"
><TR
><TD
><PRE
CLASS="screen"
>the ((red|white) (king|queen))</PRE
></TD
></TR
></TABLE
>
    进行匹配,捕获的子串为 "red king","red"
    以及 "king",并被计为 1,2 和 3。
   </P
><P
>&#13;    简单的括号实现两种功能的事实不总是有帮助的。经常有需要一组子模式但不需要捕获的时候。如果左括号后面跟着
    "?:",子模式不做任何捕获,并且在计算任何之后捕获的子模式时也不算在内。例如,如果用字符串
    "the white queen" 去和模式  <TT
CLASS="literal"
>the ((?:red|white) (king|queen))</TT
> 匹配,捕获的子串是
    "white queen" 和 "queen",并被计为 1 和 2。所捕获的子串的最大数目是
    99,所有子模式,包括捕获的和没捕获的,最大数目是 200。
   </P
><P
>&#13;    作为方便的速记,如果在非捕获子模式的开头需要任何选项设定,则选项字母可以出现在
    "?" 和 ":" 中间。因此下面两个模式
   </P
><P
CLASS="literallayout"
><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(?i:saturday|sunday)<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(?:(?i)saturday|sunday)<br>
&nbsp;&nbsp;&nbsp;</P
><P
>&#13;    匹配了完全相同的一组字符串。因为分支选项是从左向右尝试的,并且直到子模式结束前都不会重置选项,因此在一个分支中的选项设定会影响到之后的分支,所以以上模式会匹配
    "SUNDAY" 和 "Saturday"。
   </P
><P
>&#13;    自 PHP 4.3.3 起有可能通过 <TT
CLASS="literal"
>(?P&#60;name&#62;pattern)</TT
>
    来给一个模式命名。匹配结果的数组会同时包含以模式名为索引和以数字为索引的部分。
   </P
></DIV
><DIV
CLASS="refsect2"
><A
NAME="regexp.reference.repetition"
></A
><H3
>重复</H3
><P
>&#13;    重复是由数量符指定的,可以接以下任何一项:
    <P
></P
><UL
><LI
><P
>单个字符,可以是被转义的</P
></LI
><LI
><P
>. 匹配字符</P
></LI
><LI
><P
>一类字符</P
></LI
><LI
><P
>一个反向引用(见下一节)</P
></LI
><LI
><P
>一个括号中的子模式(除非是个断言 - 见下)</P
></LI
></UL
>
  </P
><P
>&#13;   普通的重复数量符指定了所允许的匹配的最小和最大数目,方法是将两个数字放在花括号中,中间用逗号分隔。数字必须小于
   65536,并且第一个数字必须小于或等于第二个数字。例如:<TT
CLASS="literal"
>z{2,4}</TT
>
   匹配了 "zz","zzz" 或 "zzzz"。单个的右花括号不算是特殊字符。如果省略了第二个数字但是有逗号,则表示没有上限。如果同时省略了第二个数字和逗号,则数量符指定了匹配的准确数目。因此
   <TT
CLASS="literal"
>[aeiou]{3,}</TT
> 匹配至少连续 3 个元音,但是可以匹配更多。<TT
CLASS="literal"
>\d{8}</TT
>
   则匹配了正好 8 个数字。出现在不允许放置数量符位置或者不符合数量符语法的左花括号,被当成字面上的该字符。例如
   {,6} 不是一个数量符,而是字面上的这四个字符。
  </P
><P
>&#13;   数量符 {0} 是允许的,导致表达式理解为前一项和数量符不存在。
  </P
><P
>&#13;   为方便起见(以及历史性的兼容),三个最常用的数量符都有单字符的缩写:
   <DIV
CLASS="table"
><A
NAME="AEN171434"
></A
><P
><B
>表 3. 单字符数量符</B
></P
><TABLE
BORDER="1"
CLASS="CALSTABLE"
><COL><COL><TBODY
><TR
><TD
><TT
CLASS="literal"
>*</TT
></TD
><TD
>等同于 <TT
CLASS="literal"
>{0,}</TT
></TD
></TR
><TR
><TD
><TT
CLASS="literal"
>+</TT
></TD
><TD
>等同于 <TT
CLASS="literal"
>{1,}</TT
></TD
></TR
><TR
><TD
><TT
CLASS="literal"
>?</TT
></TD
><TD
>等同于 <TT
CLASS="literal"
>{0,1}</TT
></TD
></TR
></TBODY
></TABLE
></DIV
>
  </P
><P
>&#13;   有可能通过在一个不匹配任何字符的子模式后面跟一个没有上限的数量符构造出无限循环,例如:<TT
CLASS="literal"
>(a?)*</TT
>。
  </P
><P
>&#13;   对此类模式早期版本的 Perl 和 PCRE
   会在编译时给出错误。不过由于这在某些情况下有用,如今已经接受此种模式了,但是如果任何子模式的重复确实不匹配任何字符,则循环会被强制打断。
  </P
><P
>&#13;   默认时,数量符是“贪吃型”(greedy)的,即会在不导致剩余模式失败的情况下尽可能多地匹配(直到所允许的数目上限)。这会出问题的经典例子是尝试匹配
   C 语言的注释。在 /* 和 */ 序列中间,可能会出现单个的 *
   和 / 字符。对 C 注释如果试图用 <TT
CLASS="literal"
>/\*.*\*/</TT
>
   去和字符串 <TT
CLASS="literal"
>/* first comment */  not comment  /* second comment */</TT
>
   匹配会失败,因为由于 .* 项目的贪吃性,会匹配成整个字符串。
  </P
><P
>&#13;   不过,如果在后面加一个问号数量符,则会停止贪吃性,而变成匹配尽可能少的数目,因此模式

       <TT
CLASS="literal"
>/\*.*?\*/</TT
>

   就会正确匹配 C 注释。各种数量符的含义并没有改变,只是优先的匹配数目。不要将问号的此用法和其自己作为数量符的使用混淆。因为有两种用法,有时可以两个一起出现,例如

       <TT
CLASS="literal"
>\d??\d</TT
>

   会优先匹配一个数字,但如别无选择也可以匹配两个以使剩余模式匹配。
  </P
><P
>&#13;   如果设定了 <A
HREF="reference.pcre.pattern.modifiers.html"
>PCRE_UNGREEDY</A
> 选项(此选项 Perl
   中没有)则数量符默认不是贪吃型的,但是在个别模式后加上一个问号可以将其变成贪吃型的。换句话说,这可以反转默认的行为。
  </P
><P
>&#13;   后面跟上一个 <TT
CLASS="literal"
>+</TT
> 的数量符是“占有性”(possessive)的。它会匹配尽可能多的字符而不管剩余的模式。因此
   <TT
CLASS="literal"
>.*abc</TT
> 可以匹配 "aabc" 但是
   <TT
CLASS="literal"
>.*+abc</TT
> 就不会,因为 <TT
CLASS="literal"
>.*+</TT
>
   已经匹配了整个字符串。自 PHP 4.3.3 起可以用占有性数量符可以来加快处理过程。
  </P
><P
>&#13;     When a parenthesized subpattern is quantified with a minimum
     repeat  count  that is greater than 1 or with a limited maximum,
     more store is required for the  compiled  pattern,  in
     proportion to the size of the minimum or maximum.
  </P
><P
>&#13;     If a pattern starts with .* or  .{0,}  and  the  <A
HREF="reference.pcre.pattern.modifiers.html"
>PCRE_DOTALL</A
>
     option (equivalent to Perl's /s) is set, thus allowing the .
     to match newlines, then the pattern is implicitly  anchored,
     because whatever follows will be tried against every character
     position in the subject string, so there is no point  in
     retrying  the overall match at any position after the first.
     PCRE treats such a pattern as though it were preceded by \A.
     In  cases where it is known that the subject string contains
     no newlines, it is worth setting <A
HREF="reference.pcre.pattern.modifiers.html"
>PCRE_DOTALL</A
>  when  the  pattern begins with .* in order to
     obtain this optimization, or
     alternatively using ^ to indicate anchoring explicitly.
  </P
><P
>&#13;     When a capturing subpattern is repeated, the value  captured
     is the substring that matched the final iteration. For example, after

       <TT
CLASS="literal"
>(tweedle[dume]{3}\s*)+</TT
>

     has matched "tweedledum tweedledee" the value  of  the  captured
     substring  is  "tweedledee".  However,  if  there are
     nested capturing  subpatterns,  the  corresponding  captured
     values  may  have been set in previous iterations. For example,
     after

       <TT
CLASS="literal"
>/(a|(b))+/</TT
>

     matches "aba" the value of the second captured substring  is
     "b".
   </P
></DIV
><DIV
CLASS="refsect2"
><A
NAME="regexp.reference.back-references"
></A
><H3
>Back references</H3
><P
>&#13;     Outside a character class, a backslash followed by  a  digit
     greater  than  0  (and  possibly  further  digits) is a back
     reference to a capturing subpattern  earlier  (i.e.  to  its
     left)  in  the  pattern,  provided there have been that many
     previous capturing left parentheses.
  </P
><P
>&#13;     However, if the decimal number following  the  backslash  is
     less  than  10,  it is always taken as a back reference, and
     causes an error only if there are not  that  many  capturing
     left  parentheses in the entire pattern. In other words, the
     parentheses that are referenced need not be to the  left  of
     the  reference  for  numbers  less  than 10. See the section
     entitled "Backslash" above for further details of  the  handling
     of digits following a backslash.
  </P
><P
>&#13;     A back reference matches whatever actually matched the  capturing
     subpattern in the current subject string, rather than
     anything matching the subpattern itself. So the pattern

       <TT
CLASS="literal"
>(sens|respons)e and \1ibility</TT
>

     matches "sense and sensibility" and "response and  responsibility",
     but  not  "sense  and  responsibility". If caseful
     matching is in force at the time of the back reference, then
     the case of letters is relevant. For example,

       <TT
CLASS="literal"
>((?i)rah)\s+\1</TT
>

     matches "rah rah" and "RAH RAH", but  not  "RAH  rah",  even
     though  the  original  capturing subpattern is matched caselessly.
  </P
><P
>&#13;     There may be more than one back reference to the  same  subpattern.
     If  a  subpattern  has not actually been used in a
     particular match, then any  back  references  to  it  always
     fail. For example, the pattern

       <TT
CLASS="literal"
>(a|(bc))\2</TT
>

     always fails if it starts to match  "a"  rather  than  "bc".
     Because  there  may  be up to 99 back references, all digits
     following the backslash are taken as  part  of  a  potential
     back reference number. If the pattern continues with a digit
     character, then some delimiter must be used to terminate the
     back reference. If the <A
HREF="reference.pcre.pattern.modifiers.html"
>PCRE_EXTENDED</A
>  option is set, this can
     be whitespace.  Otherwise an empty comment can be used.
  </P
><P
>&#13;     A back reference that occurs inside the parentheses to which
     it  refers  fails when the subpattern is first used, so, for
     example, (a\1) never matches.  However, such references  can
     be useful inside repeated subpatterns. For example, the pattern

       <TT
CLASS="literal"
>(a|b\1)+</TT
>

     matches any number of "a"s and also "aba", "ababaa" etc.  At
     each iteration of the subpattern, the back reference matches
     the character string corresponding to  the  previous  iteration.
     In order for this to work, the pattern must be such
     that the first iteration does not need  to  match  the  back
     reference.  This  can  be  done using alternation, as in the
     example above, or by a quantifier with a minimum of zero.
   </P
></DIV
><DIV
CLASS="refsect2"
><A
NAME="regexp.reference.assertions"
></A
><H3
>Assertions</H3
><P
>&#13;     An assertion is  a  test  on  the  characters  following  or
     preceding  the current matching point that does not actually
     consume any characters. The simple assertions coded  as  \b,
     \B,  \A,  \Z,  \z, ^ and $ are described above. More complicated
     assertions are coded as  subpatterns.  There  are  two
     kinds:  those that look ahead of the current position in the
     subject string, and those that look behind it.
   </P
><P
>&#13;     An assertion subpattern is matched in the normal way, except
     that  it  does not cause the current matching position to be
     changed. Lookahead assertions start with  (?=  for  positive
     assertions and (?! for negative assertions. For example,

       <TT
CLASS="literal"
>\w+(?=;)</TT
>

     matches a word followed by a semicolon, but does not include
     the semicolon in the match, and

       <TT
CLASS="literal"
>foo(?!bar)</TT
>

     matches any occurrence of "foo"  that  is  not  followed  by
     "bar". Note that the apparently similar pattern

       <TT
CLASS="literal"
>(?!foo)bar</TT
>

     does not find an occurrence of "bar"  that  is  preceded  by
     something other than "foo"; it finds any occurrence of "bar"
     whatsoever, because the assertion  (?!foo)  is  always  <TT
CLASS="constant"
><B
>TRUE</B
></TT
>
     when  the  next  three  characters  are  "bar". A lookbehind
     assertion is needed to achieve this effect.
   </P
><P
>&#13;     Lookbehind assertions start with (?&#60;=  for  positive  assertions
     and (?&#60;! for negative assertions. For example,

       <TT
CLASS="literal"
>(?&#60;!foo)bar</TT
>

     does find an occurrence of "bar" that  is  not  preceded  by
     "foo". The contents of a lookbehind assertion are restricted
     such that all the strings  it  matches  must  have  a  fixed
     length.  However, if there are several alternatives, they do
     not all have to have the same fixed length. Thus

       <TT
CLASS="literal"
>(?&#60;=bullock|donkey)</TT
>

     is permitted, but

       <TT
CLASS="literal"
>(?&#60;!dogs?|cats?)</TT
>

     causes an error at compile time. Branches  that  match  different
     length strings are permitted only at the top level of
     a lookbehind assertion. This is an extension  compared  with
     Perl  5.005,  which  requires all branches to match the same
     length of string. An assertion such as

       <TT
CLASS="literal"
>(?&#60;=ab(c|de))</TT
>

     is not permitted, because its single  top-level  branch  can
     match two different lengths, but it is acceptable if rewritten
     to use two top-level branches:

       <TT
CLASS="literal"
>(?&#60;=abc|abde)</TT
>

     The implementation of lookbehind  assertions  is,  for  each
     alternative,  to  temporarily move the current position back
     by the fixed width and then  try  to  match.  If  there  are
     insufficient  characters  before  the  current position, the
     match is deemed to fail.  Lookbehinds  in  conjunction  with
     once-only  subpatterns can be particularly useful for matching
     at the ends of strings; an example is given at  the  end
     of the section on once-only subpatterns.
   </P
><P
>&#13;     Several assertions (of any sort) may  occur  in  succession.
     For example,

       <TT
CLASS="literal"
>(?&#60;=\d{3})(?&#60;!999)foo</TT
>

     matches "foo" preceded by three digits that are  not  "999".
     Notice  that each of the assertions is applied independently
     at the same point in the subject string. First  there  is  a
     check  that  the  previous  three characters are all digits,
     then there is a check that the same three characters are not
     "999".   This  pattern  does not match "foo" preceded by six
     characters, the first of which are digits and the last three
     of  which  are  not  "999".  For  example,  it doesn't match
     "123abcfoo". A pattern to do that is

      <TT
CLASS="literal"
>(?&#60;=\d{3}...)(?&#60;!999)foo</TT
>
   </P
><P
>&#13;     This time the first assertion looks  at  the  preceding  six
     characters,  checking  that  the first three are digits, and
     then the second assertion checks that  the  preceding  three
     characters are not "999".
   </P
><P
>&#13;     Assertions can be nested in any combination. For example,

       <TT
CLASS="literal"
>(?&#60;=(?&#60;!foo)bar)baz</TT
>

     matches an occurrence of "baz" that  is  preceded  by  "bar"
     which in turn is not preceded by "foo", while

       <TT
CLASS="literal"
>(?&#60;=\d{3}...(?&#60;!999))foo</TT
>

     is another pattern which matches  "foo"  preceded  by  three
     digits and any three characters that are not "999".
   </P
><P
>&#13;     Assertion subpatterns are not capturing subpatterns, and may
     not  be  repeated,  because  it makes no sense to assert the
     same thing several times. If any kind of assertion  contains
     capturing  subpatterns  within it, these are counted for the
     purposes of numbering the capturing subpatterns in the whole
     pattern.   However,  substring capturing is carried out only
     for positive assertions, because it does not make sense  for
     negative assertions.
   </P
><P
>&#13;     Assertions count towards the maximum  of  200  parenthesized
     subpatterns.
   </P
></DIV
><DIV
CLASS="refsect2"
><A
NAME="regexp.reference.onlyonce"
></A
><H3
>Once-only subpatterns</H3
><P
>&#13;     With both maximizing and minimizing repetition,  failure  of
     what  follows  normally  causes  the repeated item to be
     re-evaluated to see if a different number of repeats allows the
     rest  of  the  pattern  to  match. Sometimes it is useful to
     prevent this, either to change the nature of the  match,  or
     to  cause  it fail earlier than it otherwise might, when the
     author of the pattern knows there is no  point  in  carrying
     on.
   </P
><P
>&#13;     Consider, for example, the pattern \d+foo  when  applied  to
     the subject line

       <TT
CLASS="literal"
>123456bar</TT
>
   </P
><P
>&#13;     After matching all 6 digits and then failing to match "foo",
     the normal action of the matcher is to try again with only 5
     digits matching the \d+ item, and then with 4,  and  so  on,
     before ultimately failing. Once-only subpatterns provide the
     means for specifying that once a portion of the pattern  has
     matched,  it  is  not to be re-evaluated in this way, so the
     matcher would give up immediately on failing to match  "foo"
     the  first  time.  The  notation  is another kind of special
     parenthesis, starting with (?&#62; as in this example:

       <TT
CLASS="literal"
>(?&#62;\d+)bar</TT
>
   </P
><P
>&#13;     This kind of parenthesis "locks up" the  part of the pattern
     it  contains once it has matched, and a failure further into
     the pattern is prevented from backtracking  into  it.
     Backtracking  past  it to previous items, however, works as normal.
   </P
><P
>&#13;     An alternative description is that a subpattern of this type
     matches  the  string  of  characters that an identical standalone
     pattern would match, if anchored at the current point
     in the subject string.
   </P
><P
>&#13;     Once-only subpatterns are not capturing subpatterns.  Simple
     cases  such as the above example can be thought of as a maximizing
     repeat that must  swallow  everything  it  can.  So,
     while both \d+ and \d+? are prepared to adjust the number of
     digits they match in order to make the rest of  the  pattern
     match, (?&#62;\d+) can only match an entire sequence of digits.
   </P
><P
>&#13;     This construction can of course contain arbitrarily  complicated
     subpatterns, and it can be nested.
   </P
><P
>&#13;     Once-only subpatterns can be used in conjunction with
     look-behind  assertions  to specify efficient matching at the end
     of the subject string. Consider a simple pattern such as

       <TT
CLASS="literal"
>abcd$</TT
>

     when applied to a long string which does not match.  Because
     matching  proceeds  from  left  to right, PCRE will look for
     each "a" in the subject and then see if what follows matches
     the rest of the pattern. If the pattern is specified as

       <TT
CLASS="literal"
>^.*abcd$</TT
>

     then the initial .* matches the entire string at first,  but
     when  this  fails  (because  there  is no following "a"), it
     backtracks to match all but the last character, then all but
     the  last  two  characters, and so on. Once again the search
     for "a" covers the entire string, from right to left, so  we
     are no better off. However, if the pattern is written as

       <TT
CLASS="literal"
>^(?&#62;.*)(?&#60;=abcd)</TT
>

     then there can be no backtracking for the .*  item;  it  can
     match  only  the  entire  string.  The subsequent lookbehind
     assertion does a single test on the last four characters. If
     it  fails,  the  match  fails immediately. For long strings,
     this approach makes a significant difference to the processing time.
   </P
><P
>&#13;     When a pattern contains an unlimited repeat inside a subpattern
     that can itself be repeated an unlimited number of
     times, the use of a once-only subpattern is the only way  to
     avoid  some  failing matches taking a very long time indeed.
     The pattern

       <TT
CLASS="literal"
>(\D+|&#60;\d+&#62;)*[!?]</TT
>

     matches an unlimited number of substrings that  either  consist
     of  non-digits,  or digits enclosed in &#60;&#62;, followed by
     either ! or ?. When it matches, it runs quickly. However, if
     it is applied to

       <TT
CLASS="literal"
>aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa</TT
>

     it takes a long  time  before  reporting  failure.  This  is
     because the string can be divided between the two repeats in
     a large number of ways, and all have to be tried. (The example
     used  [!?]  rather  than a single character at the end,
     because both PCRE and Perl have an optimization that  allows
     for  fast  failure  when  a  single  character is used. They
     remember the last single character that is  required  for  a
     match,  and  fail early if it is not present in the string.)
     If the pattern is changed to

       <TT
CLASS="literal"
>((?&#62;\D+)|&#60;\d+&#62;)*[!?]</TT
>

     sequences of non-digits cannot be broken, and  failure  happens quickly.
   </P
></DIV
><DIV
CLASS="refsect2"
><A
NAME="regexp.reference.conditional"
></A
><H3
>Conditional subpatterns</H3
><P
>&#13;     It is possible to cause the matching process to obey a  subpattern
     conditionally  or to choose between two alternative
     subpatterns, depending on the result  of  an  assertion,  or
     whether  a previous capturing subpattern matched or not. The
     two possible forms of conditional subpattern are
   </P
><P
CLASS="literallayout"
><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(?(condition)yes-pattern)<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(?(condition)yes-pattern|no-pattern)<br>
&nbsp;&nbsp;&nbsp;</P
><P
>&#13;     If the condition is satisfied, the yes-pattern is used; otherwise
     the  no-pattern  (if  present) is used. If there are
     more than two alternatives in the subpattern, a compile-time
     error occurs.
   </P
><P
>&#13;    There are two kinds of condition. If the  text  between  the
     parentheses  consists  of  a  sequence  of  digits, then the
     condition is satisfied if the capturing subpattern  of  that
     number  has  previously matched. Consider the following pattern,
     which contains non-significant white space to make  it
     more  readable  (assume  the  <A
HREF="reference.pcre.pattern.modifiers.html"
>PCRE_EXTENDED</A
>   option)  and to
     divide it into three parts for ease of discussion:

       <TT
CLASS="literal"
>( \( )?    [^()]+    (?(1) \) )</TT
>
   </P
><P
>&#13;     The first part matches an optional opening parenthesis,  and
     if  that character is present, sets it as the first captured
     substring. The second part matches one  or  more  characters
     that  are  not  parentheses. The third part is a conditional
     subpattern that tests whether the first set  of  parentheses
     matched  or  not.  If  they did, that is, if subject started
     with an opening parenthesis, the condition is <TT
CLASS="constant"
><B
>TRUE</B
></TT
>,  and  so
     the  yes-pattern  is  executed  and a closing parenthesis is
     required. Otherwise, since no-pattern is  not  present,  the
     subpattern  matches  nothing.  In  other words, this pattern
     matches a sequence of non-parentheses,  optionally  enclosed
     in parentheses.
   </P
><P
>&#13;     If the condition is the string <TT
CLASS="literal"
>(R)</TT
>, it is satisfied if
     a recursive call to the pattern or subpattern has been made. At "top
     level", the condition is false.
   </P
><P
>&#13;     If the condition is not a sequence of digits or (R), it must be  an
     assertion.  This  may be a positive or negative lookahead or
     lookbehind assertion. Consider this pattern, again  containing
     non-significant  white space, and with the two alternatives on
     the second line:
   </P
><P
CLASS="literallayout"
><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(?(?=[^a-z]*[a-z])<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;\d{2}-[a-z]{3}-\d{2}&nbsp;&nbsp;|&nbsp;&nbsp;\d{2}-\d{2}-\d{2}&nbsp;)<br>
&nbsp;&nbsp;&nbsp;</P
><P
>&#13;     The condition is a positive lookahead assertion that matches
     an optional sequence of non-letters followed by a letter. In
     other words, it tests for  the  presence  of  at  least  one
     letter  in the subject. If a letter is found, the subject is
     matched against  the  first  alternative;  otherwise  it  is
     matched  against the second. This pattern matches strings in
     one of the two forms dd-aaa-dd or dd-dd-dd,  where  aaa  are
     letters and dd are digits.
   </P
></DIV
><DIV
CLASS="refsect2"
><A
NAME="regexp.reference.comments"
></A
><H3
>注释</H3
><P
>&#13;    序列 (?# 标记了注释的开头直到下一个右括号为止。不允许嵌套注释。注释在模式匹配中完全没有作用。
   </P
><P
>&#13;    如果设定了 <A
HREF="reference.pcre.pattern.modifiers.html"
>PCRE_EXTENDED</A
>
    选项,则不在字符类中间并且未转义的 # 字符标记了注释的开头,直到模式中的下一个换行符结束。
   </P
></DIV
><DIV
CLASS="refsect2"
><A
NAME="regexp.reference.recursive"
></A
><H3
>Recursive patterns</H3
><P
>&#13;     Consider the problem of matching a  string  in  parentheses,
     allowing  for  unlimited nested parentheses. Without the use
     of recursion, the best that can be done is to use a  pattern
     that  matches  up  to some fixed depth of nesting. It is not
     possible to handle an arbitrary nesting depth. Perl 5.6  has
     provided   an  experimental  facility  that  allows  regular
     expressions to recurse (among other things).  The  special
     item (?R) is  provided for  the specific  case of recursion.
     This PCRE  pattern  solves the  parentheses  problem (assume
     the <A
HREF="reference.pcre.pattern.modifiers.html"
>PCRE_EXTENDED</A
>
     option is set so that white space is
     ignored):

       <TT
CLASS="literal"
>\( ( (?&#62;[^()]+) | (?R) )* \)</TT
>
   </P
><P
>&#13;     First it matches an opening parenthesis. Then it matches any
     number  of substrings which can either be a sequence of
     non-parentheses, or a recursive  match  of  the  pattern  itself
     (i.e. a correctly parenthesized substring). Finally there is
     a closing parenthesis.
   </P
><P
>&#13;     This particular example pattern  contains  nested  unlimited
     repeats, and so the use of a once-only subpattern for matching
     strings of non-parentheses is  important  when  applying
     the  pattern to strings that do not match. For example, when
     it is applied to

       <TT
CLASS="literal"
>(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()</TT
>

     it yields "no match" quickly. However, if a  once-only  subpattern
     is  not  used,  the match runs for a very long time
     indeed because there are so many different ways the + and  *
     repeats  can carve up the subject, and all have to be tested
     before failure can be reported.
   </P
><P
>&#13;     The values set for any capturing subpatterns are those  from
     the outermost level of the recursion at which the subpattern
     value is set. If the pattern above is matched against

       <TT
CLASS="literal"
>(ab(cd)ef)</TT
>

     the value for the capturing parentheses is  "ef",  which  is
     the  last  value  taken  on  at the top level. If additional
     parentheses are added, giving

       <TT
CLASS="literal"
>\( ( ( (?&#62;[^()]+) | (?R) )* ) \)</TT
>
     then the string they capture
     is "ab(cd)ef", the contents of the top level parentheses. If
     there are more than 15 capturing parentheses in  a  pattern,
     PCRE  has  to  obtain  extra  memory  to store data during a
     recursion, which it does by using  pcre_malloc,  freeing  it
     via  pcre_free  afterwards. If no memory can be obtained, it
     saves data for the first 15 capturing parentheses  only,  as
     there is no way to give an out-of-memory error from within a
     recursion.
   </P
><P
>&#13;      Since PHP 4.3.3, <TT
CLASS="literal"
>(?1)</TT
>, <TT
CLASS="literal"
>(?2)</TT
> and so on can be used
      for recursive subpatterns too. It is also possible to use named
      subpatterns: <TT
CLASS="literal"
>(?P&#62;foo)</TT
>.
   </P
><P
>&#13;      If the syntax for a recursive subpattern reference (either by number or
      by name) is used outside the parentheses to which it refers, it operates
      like a subroutine in a programming language. An earlier example
      pointed out that the pattern
      <TT
CLASS="literal"
>(sens|respons)e and \1ibility</TT
>
      matches "sense and sensibility" and "response and responsibility", but
      not "sense and responsibility". If instead the pattern
      <TT
CLASS="literal"
>(sens|respons)e and (?1)ibility</TT
>
      is used, it does match "sense and responsibility" as well as the other
      two strings. Such references must, however, follow the subpattern to
      which they refer.
   </P
></DIV
><DIV
CLASS="refsect2"
><A
NAME="regexp.reference.performances"
></A
><H3
>Performances</H3
><P
>&#13;     Certain items that may appear in patterns are more efficient
     than  others.  It is more efficient to use a character class
     like [aeiou] than a set of alternatives such as (a|e|i|o|u).
     In  general,  the  simplest  construction  that provides the
     required behaviour is usually the  most  efficient.  Jeffrey
     Friedl's  book contains a lot of discussion about optimizing
     regular expressions for efficient performance.
   </P
><P
>&#13;     When a pattern begins with .* and the <A
HREF="reference.pcre.pattern.modifiers.html"
>PCRE_DOTALL</A
>  option  is
     set,  the  pattern  is implicitly anchored by PCRE, since it
     can match only at the start of a subject string. However, if
     <A
HREF="reference.pcre.pattern.modifiers.html"
>PCRE_DOTALL</A
>   is not set, PCRE cannot make this optimization,
     because the . metacharacter does not then match  a  newline,
     and if the subject string contains newlines, the pattern may
     match from the character immediately following one  of  them
     instead of from the very start. For example, the pattern

       <TT
CLASS="literal"
>(.*) second</TT
>

     matches the subject "first\nand second" (where \n stands for
     a newline character) with the first captured substring being
     "and". In order to do this, PCRE  has  to  retry  the  match
     starting after every newline in the subject.
   </P
><P
>&#13;     If you are using such a pattern with subject strings that do
     not  contain  newlines,  the best performance is obtained by
     setting <A
HREF="reference.pcre.pattern.modifiers.html"
>PCRE_DOTALL</A
>, or starting the  pattern  with  ^.*  to
     indicate  explicit anchoring. That saves PCRE from having to
     scan along the subject looking for a newline to restart at.
   </P
><P
>&#13;     Beware of patterns that contain nested  indefinite  repeats.
     These  can  take a long time to run when applied to a string
     that does not match. Consider the pattern fragment

       <TT
CLASS="literal"
>(a+)*</TT
>
   </P
><P
>&#13;     This can match "aaaa" in 33 different ways, and this  number
     increases  very  rapidly  as  the string gets longer. (The *
     repeat can match 0, 1, 2, 3, or 4 times,  and  for  each  of
     those  cases other than 0, the + repeats can match different
     numbers of times.) When the remainder of the pattern is such
     that  the entire match is going to fail, PCRE has in principle
     to try every possible variation, and this  can  take  an
     extremely long time.
   </P
><P
>&#13;     An optimization catches some of the more simple  cases  such
     as

       <TT
CLASS="literal"
>(a+)*b</TT
>

     where a literal character follows. Before embarking  on  the
     standard matching procedure, PCRE checks that there is a "b"
     later in the subject string, and if there is not,  it  fails
     the  match  immediately. However, when there is no following
     literal this optimization cannot be used. You  can  see  the
     difference by comparing the behaviour of

       <TT
CLASS="literal"
>(a+)*\d</TT
>

     with the pattern above. The former gives  a  failure  almost
     instantly  when  applied  to a whole line of "a" characters,
     whereas the latter takes an appreciable  time  with  strings
     longer than about 20 characters.
   </P
></DIV
></DIV
><DIV
CLASS="NAVFOOTER"
><HR
ALIGN="LEFT"
WIDTH="100%"><TABLE
SUMMARY="Footer navigation table"
WIDTH="100%"
BORDER="0"
CELLPADDING="0"
CELLSPACING="0"
><TR
><TD
WIDTH="33%"
ALIGN="left"
VALIGN="top"
><A
HREF="reference.pcre.pattern.modifiers.html"
ACCESSKEY="P"
>上一页</A
></TD
><TD
WIDTH="34%"
ALIGN="center"
VALIGN="top"
><A
HREF="index.html"
ACCESSKEY="H"
>起始页</A
></TD
><TD
WIDTH="33%"
ALIGN="right"
VALIGN="top"
><A
HREF="function.preg-grep.html"
ACCESSKEY="N"
>下一页</A
></TD
></TR
><TR
><TD
WIDTH="33%"
ALIGN="left"
VALIGN="top"
>模式修正符</TD
><TD
WIDTH="34%"
ALIGN="center"
VALIGN="top"
><A
HREF="ref.pcre.html"
ACCESSKEY="U"
>上一级</A
></TD
><TD
WIDTH="33%"
ALIGN="right"
VALIGN="top"
>preg_grep</TD
></TR
></TABLE
></DIV
></BODY
></HTML
>