正则表达式

2015-09-19

一、规则

元字符

代码/语法	说明
\	转义字符，如转义加上.代表点符号本身
#	注释
()	标记子表达式的开始和结束位置
\|	分枝，从左到右的匹配左右任意一个表达式，满足其中任
\num	num为正整数，如\1表示重复正则第1个匹配

.	匹配除换行符(\n\r)以外的任意单个字符
[-]	匹配字符范围，比如[a-z]
[…]	匹配[]中列举的字符
\w	word，匹配字母或数字或下划线，等价于[A-Za-z0-9_]
\s	space，匹配空白符，等价于[\f\n\r\t\v]，包括换行
\d	digit，匹配数字字符，等价于[0-9]

[^…]	匹配不在[]中列举的字符
\W	匹配任何非单词字符，等价于[^A-Za-z0-9_]
\S	匹配可见字符，等价于[^\f\n\r\t\v]
\D	匹配非数字字符，等价于[^0-9]
\s\S	匹配所有

边界

代码/语法	说明
^	匹配行的开始位置
$	匹配行的结束位置
\b	匹配单词的边界（开始或结束）
\B	匹配不是单词边界的位置
\A	匹配字符串的开始位置
\Z	匹配字符串的结束位置

数量限定符

正则表达式中包含能接受重复的限定符时，通常是匹配尽可能多的字符，即默认贪婪模式。
相对贪婪模式，有时我们需要匹配尽可能少的字符，即懒惰匹配。一般在可重复限定符后加问号?，即可转为懒惰模式匹配。

代码/语法	说明
*	重复任意次，贪婪匹配
+	重复1次或更多次，贪婪匹配
?	重复0次或1次，贪婪匹配
{n}	重复n次，贪婪匹配
{n,}	重复n次或更多次，贪婪匹配
{n,m}	重复n到m次，贪婪匹配

*?	重复任意次，懒惰匹配
+?	重复1次或更多次，懒惰匹配
??	重复0次或1次，懒惰匹配
{n,m}?	重复n到m次，懒惰匹配
{n,}?	重复n次以上，懒惰匹配

分组

代码/语法	说明
(re)	分组，匹配括号内的表达式，并获取这一匹配
(?:re)	匹配括号内的表达式，但不获取匹配结果，不存储
(?=re)	前向肯定匹配
(?!re)	前向否定匹配
(?<=re)	后向肯定匹配
(?<!re)	后向否定匹配

正则引擎匹配原理

正则引擎大体上可分为不同的两类，DFA（Deterministic finite automaton），确定型有穷自动机和NFA（Non-deterministic finite automaton）非确定型有穷自动机，DFA是根据字符串，去正则表达式匹配，不需要回溯，所以匹配快速，但不支持捕获组。因此大多数语言使用的是NFA，以正则表达式为主导，去字符串表达式中匹配。
对于一个字符串”abc”来讲，有3个字符和4个位置。如果子表达式匹配的是字符内容，则认为该子表达式是占有字符的。如果匹配的是位置，则子表达式是零宽度的。
零宽度的表达式的匹配开始和结束的位置是同一个，占有字符的表达式，由于它匹配开始和结束的位置不是同一个。对于整个表达式来说，通常是由字符串位置0开始尝试匹配，引擎会使正则向前传动。

二、Java版本

public class RegexBasic {
    public static void main(String[] args) {

        String s1 ="321";
        System.out.println(s1.matches("0|[1-9]\\d*")); //true，数字
        System.out.println(s1.matches("\\d{3,}")); //true，至少3位的数字

        String s2 = "11.1";
        System.out.println(s2.matches("(-?\\d+)(\\.\\d+)?")); //true，浮点数
        System.out.println(s2.matches("[0-9]+(.[0-9]{2})?")); //false，有两位小数的正实数

        String s3 = "汉字";
        System.out.println(s3.matches("[\\u4e00-\\u9fa5]+")); //汉字

        String s4 = "_xxx123";
        System.out.println(s4.matches("\\w{3,20}")); //3-20个字符

        String email ="iherr@163.com";
        System.out.println(email.matches("\\w+@\\w+\\.\\w+"));

        String ip="49.0.0.1";
        System.out.println(ip.matches("(25[0-5]|2[0-4]\\d|[01]?\\d?\\d)(\\.(25[0-5]|2[0-4]\\d|[01]?\\d?\\d)){3}"));

        String phone="+86-13888888888";
        System.out.println(phone.matches("[\\+]?\\d{2,3}[-\\s]?1[3-9]{1}\\d{9}"));

        Pattern p = Pattern.compile("(\\d{3,4})\\-(\\d{7,8})");

        Matcher m = p.matcher("010-66666666");
        if (m.matches()) {
            String g1 = m.group(1);
            String g2 = m.group(2);
            System.out.println(g1);
            System.out.println(g2);
        } else {
            System.out.println("不匹配");
        }

        //贪婪匹配
        Pattern p2 = Pattern.compile("<(.+)>");
        Matcher m2 = p2.matcher("<h2>test</h2>");
        while (m2.find()){
            System.out.println(m2.group(1)); //h2>test</h2
            System.out.println(m2.start());
            System.out.println(m2.end());
        }

        //懒惰匹配
        Pattern p3 = Pattern.compile("<(.*?)>");
        Matcher m3 = p3.matcher("<h2>test</h2>");
        while (m3.find()){
            System.out.println(m3.group(1)); //h2，/h2
        }

        //(re)，匹配括号内的表达式，并获取这一匹配
        Pattern p4 = Pattern.compile("<h2>(.*)-(he.*)</h2>");
        Matcher m4 = p4.matcher("<h2>test-herr</h2>");
        while (m4.find()){
            System.out.println(m4.group(1)); //test
            System.out.println(m4.group(2)); //herr
        }

        // (?:re)，匹配括号内的表达式，但不获取匹配结果，不存储
        Pattern p5 = Pattern.compile("<h2>(.*)-(?:he.*)</h2>");
        Matcher m5 = p5.matcher("<h2>test-herr</h2>");
        while (m5.find()){
            System.out.println(m5.group(1)); //test
            //System.out.println(m5.group(2)); //报错，没有group 2
        }

        //(re)，匹配括号内的表达式，并获取这一匹配
        Pattern p6 = Pattern.compile("(\\d+)([a-z]+)\\1\\2");
        Matcher m6 = p6.matcher("123abc123ab");
        while (m6.find()){
            System.out.println(m6.group(1));
            System.out.println(m6.group(2));
        }

        String str = "hello herr";
        System.out.println(str.replaceFirst("\\w+" , "■"));    //输出■ herr，贪婪模式，匹配尽可能多字符
        System.out.println(str.replaceFirst("\\w+?" , "■"));    //输出■ello herr，懒惰模式，匹配尽可能少字符

        // “前向”和“后向”是指正则引擎匹配顺序，字符开始->结束的顺序为前
        System.out.println(str.replaceFirst("el(?=l)" , "■"));    //输出h■lo herr。前向肯定匹配，正向环视
        System.out.println(str.replaceFirst("el(?=m)" , "■"));    //输出hello herr。el前向不是m，无法匹配
        System.out.println(str.replaceFirst("el(?!=m)" , "■"));    //输出h■lo herr，前向否定匹配，el前向不是m，匹配成功
        System.out.println(str.replaceFirst("(?<=h)el" , "■"));    //输出h■lo herr，后向肯定匹配，el后向是h，匹配成功
        System.out.println(str.replaceFirst("(?<!a|b|c)el" , "■"));    //输出h■lo herr，后向否定匹配，el后向不是abc，匹配成功

        //边界
        String str2="process macpro";
        System.out.println(str2.replaceAll("(pro)" , "■")); //■cess mac■
        System.out.println(str2.replaceAll("(pro\\b)" , "■")); //process mac■
        System.out.println(str2.replaceAll("(^pro)" , "■")); //■cess macpro
        System.out.println(str2.replaceAll("(pro$)" , "■")); //process mac■
    }
}