Tuesday 13 January 2015

Difference between (.*) and (.*?) in Regular Expressions


Quantifiers

  •  (.)   -  matches any single character except newline.
  •  (*)  -  between zero and unlimited times, as many times as possible, giving back as needed  [greedy]
  •  (.*) -  matches zero or more number of characters (except newline)
  • (.*?) - matches any character (except newline)
  • (*?)  - between zero and unlimited times, as few times as possible, expanding as needed [lazy]



Example:

Test String: <xyz>vikram</xyz><abc>ffasdfsaf</abc><xyz>awdfafsd</xyz>safasf

1. Regular Expression: <xyz>(.*)<\/xyz>
    Output: "vikram</xyz><abc>ffasdfsaf</abc><xyz>awdfafsd"

2. Regular Expression: <xyz>(.*?)<\/xyz>
    Output: "vikram"

In the first regular expression the (.*) matches till the end of Test String(i.e (.*) is greedy) and then Reg-ex Engine backtracks to first occurrence of right boundary from the end .

In the second regular expression the (.*?) matches some elements of Test String(i.e (.*?) is lazy) from the beginning and checks for right boundary, if does not exists expands further until right boundary matches.

Use (.*?) instead of  (.*) for efficient extraction.


No comments:

Post a Comment