How regex work in python in detail


Regular expressions (regex):

                                                Regular expressions (regex) are a powerful tool for matching patterns in text. In Python, the re module provides support for working with regular expressions. Below is a detailed explanation of how regex works in Python:


1. Regex Patterns

Regex patterns are written as raw strings (r'') to avoid issues with escape characters.

a) Basic Patterns

. : Matches any character except a newline.

\d : Matches any digit (0-9).

\D : Matches any non-digit.

\w : Matches any word character (a-z, A-Z, 0-9, _).

\W : Matches any non-word character.

\s : Matches any whitespace character (space, tab, newline).

\S : Matches any non-whitespace character.

b) Quantifiers

* : Matches 0 or more occurrences.

+ : Matches 1 or more occurrences.

? : Matches 0 or 1 occurrence.

{n} : Matches exactly n occurrences.

{n,} : Matches n or more occurrences.

{n,m} : Matches between n and m occurrences.

c) Anchors

^ : Matches the start of the string.

$ : Matches the end of the string.

d) Character Classes

[abc] : Matches any one of the characters ab, or c.

[^abc] : Matches any character except ab, or c.

[a-z] : Matches any character in the range a to z.

e) Groups and Capturing

() : Groups part of the pattern and captures it.

(?:...) : Groups without capturing.

f) Alternation

| : Matches either the pattern before or after the |.


4. Match Objects

When a match is found, a match object is returned. It has several useful methods:
group() : Returns the matched string.
start() : Returns the starting position of the match.
end() : Returns the ending position of the match.
span() : Returns a tuple (start, end).

2. Importing the re Module

To use regular expressions in Python, you need to import the re module:


import re


3. Basic Regex Functions

The re module provides several functions to work with regex:

a) re.match()

Checks for a match only at the beginning of the string.

Returns a match object if successful, otherwise None.

result = re.match(r'hello', 'hello world')
if result:
    print("Match found!")
else:
    print("No match.")

Output:

Match found!

b) re.search()

Searches for a match anywhere in the string.

Returns a match object if successful, otherwise None.

result = re.search(r'world', 'hello world')
if result:
    print("Match found!")
else:
    print("No match.")

Output:

Match found!

c) re.findall()

Finds all occurrences of the pattern in the string.

Returns a list of all matches.

result = re.findall(r'\d+', '3 apples, 5 bananas, 10 cherries')
print(result)

Output:

['3', '5', '10']

d) re.finditer()

Similar to re.findall(), but returns an iterator of match objects.

matches = re.finditer(r'\d+', '3 apples, 5 bananas, 10 cherries')
for match in matches:
    print(match.group())

Output:

3
5
10

e) re.sub()

Replaces occurrences of the pattern with a specified string.

result = re.sub(r'\d+', 'X', '3 apples, 5 bananas, 10 cherries')
print(result)


Output:

X apples, X bananas, X cherries

f) re.split()

Splits the string by the occurrences of the pattern.

result = re.split(r'\d+', '3 apples, 5 bananas, 10 cherries')
print(result)

Output:

['', ' apples, ', ' bananas, ', ' cherries']


result = re.search(r'\d+', '3 apples, 5 bananas, 10 cherries')
if result:
    print(f"Matched: {result.group()}, Start: {result.start()}, End: {result.end()}")

Output:

Matched: 3, Start: 0, End: 1

5. Flags

Flags modify the behavior of regex functions. Common flags include:

re.IGNORECASE (re.I) : Case-insensitive matching.

re.MULTILINE (re.M) : Allows ^ and $ to match the start/end of each line.

re.DOTALL (re.S) : Allows . to match newline characters.


result = re.findall(r'hello', 'Hello world', re.IGNORECASE)
print(result)

Output:

['Hello']

Example: Extracting Email Addresses

text = "Contact us at support@example.com or sales@example.org."
emails = re.findall(r'[\w\.-]+@[\w\.-]+', text)
print(emails)

Output:

['support@example.com', 'sales@example.org']

Example: Validating a Phone Number

def validate_phone(number):
    pattern = r'^\d{3}-\d{3}-\d{4}$'
    if re.match(pattern, number):
        return True
    return False

print(validate_phone('123-456-7890'))  # True
print(validate_phone('123-4567'))      # False

Example: Validating a Phone Numbers And Emails From A String.

import re

my_string = "Hello, my name is John, and you can reach me at 123-456-7890.
My office number is (987) 654-3210. If I'm unavailable, call my assistant at +1-800-555-0199.
We also have a support line: 800.123.4567.
support1@example1.com Our international number is +44 20 7946 0958.
You can also contact us at 999-888-7777 or (555) 123-4567.
If you prefer, send a text to 123 456 7890 or email us at support@example.com."
numbers = r"\+?\d{0,3}[-.\s]?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}"
match = re.findall(numbers, my_string)

print(match)

emails = re.findall(r'[\w\.-]+@[\w\.-]+', my_string)
print(emails)

No comments

Powered by Blogger.