Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.
–Jamie Zawinski, 1997
I don’t actually agree with Mr. Zawinski – I’ve been using regular expressions successfully for over two decades, and I have done a lot of useful work with them. However, I do admit that they are cryptic and tricky. Here is a regular expression to parse a string like “Ja. 15, 2014″ or “Au. 27, 1990″:
\b([A-Z][a-z]{2})\.?\s+(\d+),\s+(\d{4})
If you want to retrieve the month, day, and year by name rather than numeric index, it would look like this:
\b(?P<MONTH>[A-Z][a-z]{2})\.?\s+(?P<DAY>\d+),\s+(?P<YEAR>\d{4})
As it turns out, there are other ways to parse text. You could create a tailor-made parser than iterated over the text character-by-character, with logic for finding your target. This is tedious, error-prone, time-consuming, and no one does it.
You could use string methods such as split(), startswith(), endswith(), etc., to grab pieces and then analyze them. This is likewise tedious, error-prone, and time-consuming, but some people do take this route because they are scared of regular expressions.
A better option is the pyparsing module. This article describes how to apply pyparsing to everyday text parsing tasks.
Pyparsing is a Python module for creating text parsers. It was developed by Paul McGuire. Install with pip for most versions of Python. (Note: The Anaconda Python Bundle, highly recommended for serious Python developers, includes pyparsing.)
First, you create a grammar to specify what should be matched. Then, you call a parsing function from the grammar and it returns text tokens while automatically skipping over white space. Pyparsing provides many functions for specifying what should be matched, how items should repeat, and more.
Note: the examples are written with Python 3, but should work identically in Python 2 if you convert the print() function back into the print statement.
The first step in using Pyparsing is to define a grammar. A grammar defines exactly what the target text can contain. The best way to do this is in a “top-down” manner, specifying the entire target, then refining what each component means until you get down to literal characters.
The usual notation for grammars is called Backus-Naur Form, or BNF for short. You don’t have to worry about following the rules exactly with BNF, but it is convenient for describing how things fit together.
The basic form is
symbol ::= expression
This means that symbol is composed of the parts specified in the expression. For instance, a person’s name can be specified as
name ::= first-name [middle-name] last-name first_name ::= alpha | first_name alpha #one or more alphas last_name ::=alphas+ #shorter way to express one or more alphas alpha = 'a' | 'b' | 'c' etc
This means that a name consists of a first name, an optional middle name (brackets indicate optional components), and a last name. A first name consists of one or more alphabetic characters, as does a last name. The plus sign means “one or more”.The pipe symbol means “or”. Pyparsing has predefined symbols for letters, digits, and other common sets of characters.
There are more details, but you get the idea.
If you are interested in the gory details of BNF, see http://en.wikipedia.org/wiki/Backus%E2%80%93Naur_Form
To use pyparsing, you need to import it. For convenience, you can import all the names from pyparsing:
from pyparsing import *
Let’s start with a US Social Security Number. The format is DDD-DD-DDDD, where D is any digit. Using BNF, the grammar for an SSN is
ssn ::= num+ '-' num+ '-' num+ num ::= '0' | '1' | '2' etc
In pyparsing, you can represent a fixed number of digits with the expression
Word(nums, exact=N)
x = 'foo' print len(x)
A “Word” is a sequence of characters surrounded by white space or punctuation.
Since a dash is a literal character (a “primitive”, in the parsing world), you don’t need to do anything special with it. You can give the literal character a name, to help make your parser readable.
Thus, you have:
dash = '-' ssn = Word(nums, exact=3) + dash + Word(nums, exact=2) + dash + Word(nums, exact=4)
To use the parser, call parseString() on the parser object:
target = '123-45-6789' result = ssn.parseString(target) print result
This gives the result
['123', '-', '45', '-', '6789']
For most applications, you would want the entire SSN as a single string, so you can add the Combine() function, which glues together all of the tokens from the parser passed to it as a single token.
ssn = Combine(Word(nums, exact=3) + dash + Word(nums, exact=2) + dash + Word(nums, exact=4))
So now the result is
['123-45-6789']
An example script using this parser:
from pyparsing import * ''' grammar: ssn ::= nums+ '-' nums+ '-' nums+ nums ::= '0' | '1' | '2' etc etc ''' dash = '-' ssn_parser = Combine( Word(nums, exact=3) + dash + Word(nums, exact=2) + dash + Word(nums, exact=4) ) input_string = """ xxx 225-92-8416 yyy 103-33-3929 zzz 028-91-0122 """ for match, start, stop in ssn_parser.scanString(input_string): print(match, start, stop)
The output will be:
['225-92-8416'] 9 20 ['103-33-3929'] 29 40 ['028-91-0122'] 45 56
There are four functions that you can call from a parser to do the actual parsing.
parseString – parses input text from the beginning; ignores extra trailing text.
scanString – looks through input text and generates matches; similar to re.finditer()
searchString – like scanString, but returns a list of tokens
transformString – like scanString, but specifies replacements for tokens
Let’s say you had a configuration file that looked like this:
sample.cfg
city=Atlanta state=Georgia population=5522942
To parse a string in the format “KEY=VALUE”, there are 3 components: the key, the equals sign, and the value. When parsing such an expression, you don’t really need the equals sign in the results. The Suppress() function will parse a token without putting it in the results.
To make it a little easier to access individual tokens, you can provide names for the tokens, either with the setResultsName() function, or by just calling the parser with the name as its argument, which can be done when the parser is defined. Assigning names to tokens is the preferred approach.
from pyparsing import * key = Word(alphanums)('key') equals = Suppress('=') value = Word(alphanums)('value') kvexpression = key + equals + value with open('sample.cfg') as config_in: config_data = config_in.read() for match in kvexpression.scanString(config_data): result = match[0] print("{0} is {1}".format(result.key, result.value))
The output will be:
city is Atlanta state is Georgia population is 5522942
URLs are, of course, frequently used in everyday life. Without them, the Internet would be a vast mishmash. Oh, wait. It already is. Well, without URLs, the vast mishmash would have no directional signage. In this section, I’ll show you how to parse a complete URL.
For details on the syntax of a URL, see http://en.wikipedia.org/wiki/URI_scheme#Generic_syntax.
A URL consists of the following segments:
Most of those segments are optional. Only the scheme and host name/address are required.
Spaces are not allowed in URLs and are translated to a plus sign.
Many punctuation characters are not allowed as well, and are encoded as a percent sign, plus their ASCII value as a two-character hex number.
The allowed set of characters is letters, digits, and underscores, plus the following: -. ~ % +.
An informal BN grammar for parsing URLs:
url ::= scheme '://' [userinfo] host [port] [path] [query] [fragment] scheme ::= http | https | ftp | file userinfo ::= url_chars+ ':' url_chars+ '@' host ::= alphanums | host (. + alphanums) port ::= ':' nums path ::= url_chars+ query ::= '?' + query_pairs query_pairs ::= query_pairs | (query_pairs '&' query_pair) query_pair = url_chars+ '=' url_chars+ fragment = '#' + url_chars url_chars = alphanums + '-_.~%+'
Using the grammar, start at the bottom: the smallest components, since they have to be defined first. Work your way up, combining components, until you have completely defined the target text:
url_chars = alphanums + '-_.~%+' fragment = Combine((Suppress('#') + Word(url_chars)))('fragment') scheme = oneOf('http https ftp file')('scheme') host = Combine(delimitedList(Word(url_chars), '.'))('host') port = Suppress(':') + Word(nums)('port') user_info = ( Word(url_chars)('username') + Suppress(':') + Word(url_chars)('password') + Suppress('@') ) query_pair = Group(Word(url_chars) + Suppress('=') + Word(url_chars)) query = Group(Suppress('?') + delimitedList(query_pair, '&'))('query') path = Combine( Suppress('/') + OneOrMore(~query + Word(url_chars + '/')) )('path') url_parser = ( scheme + Suppress('://') + Optional(user_info) + host + Optional(port) + Optional(path) + Optional(query) + Optional(fragment) )
The final program is a module with a url_parser object:
url_parser.py
from pyparsing import * ''' URL grammar url ::= scheme '://' [userinfo] host [port] [path] [query] [fragment] scheme ::= http | https | ftp | file userinfo ::= url_chars+ ':' url_chars+ '@' host ::= alphanums | host (. + alphanums) port ::= ':' nums path ::= url_chars+ query ::= '?' + query_pairs query_pairs ::= query_pairs | (query_pairs '&' query_pair) query_pair = url_chars+ '=' url_chars+ fragment = '#' + url_chars url_chars = alphanums + '-_.~%+' ''' url_chars = alphanums + '-_.~%+' fragment = Combine((Suppress('#') + Word(url_chars)))('fragment') scheme = oneOf('http https ftp file')('scheme') host = Combine(delimitedList(Word(url_chars), '.'))('host') port = Suppress(':') + Word(nums)('port') user_info = ( Word(url_chars)('username') + Suppress(':') + Word(url_chars)('password') + Suppress('@') ) query_pair = Group(Word(url_chars) + Suppress('=') + Word(url_chars)) query = Group(Suppress('?') + delimitedList(query_pair, '&'))('query') path = Combine( Suppress('/') + OneOrMore(~query + Word(url_chars + '/')) )('path') url_parser = ( scheme + Suppress('://') + Optional(user_info) + host + Optional(port) + Optional(path) + Optional(query) + Optional(fragment) )
I’ll use a list of assorted URLs to test the parser:
from url_parser import url_parser test_urls = [ 'http://www.notarealsite.com', 'http://www.notarealsite.com/', 'http://www.notarealsite.com:1234/', 'http://bob:%[email protected]:1234/', 'http://www.notarealsite.com/presidents', 'http://www.notarealsite.com/presidents/byterm?term=26&name=Roosevelt', 'http://www.notarealsite.com/presidents/26', 'http://www.notarealsite.com/us/indiana/gary/population', 'ftp://ftp.info.com/downloads', 'http://www.notarealsite.com#moose', 'http://bob:[email protected]:8080/presidents/byterm?term=26&name=Roosevelt#bio', ] fmt = '{0:10s} {1}' for test_url in test_urls: print("URL:", test_url) tokens = url_parser.parseString(test_url) print(tokens, '\n') print(fmt.format("Scheme:", tokens.scheme)) print(fmt.format("User name:", tokens.username)) print(fmt.format("Password:", tokens.password)) print(fmt.format("Host:", tokens.host)) print(fmt.format("Port:", tokens.port)) print(fmt.format("Path:", tokens.path)) print("Query:") for key, value in tokens.query: print("\t{} ==> {}".format(key, value)) print(fmt.format('Fragment:', tokens.fragment)) print('-' * 60, '\n')
The output will be (for brevity, only one parsed URL shown):
URL: http://bob:[email protected]:8080/presidents/byterm?term=26&name=Roosevelt#bio ['http', 'bob', 's3cr3t', 'www.notarealsite.com', '8080', 'presidents/byterm', [['term', '26'], ['name', 'Roosevelt']], 'bio'] Scheme: http User name: bob Password: s3cr3t Host: www.notarealsite.com Port: 8080 Path: presidents/byterm Query: term ==> 26 name ==> Roosevelt Fragment: bio
Any parser (including the individual parsers that make up the “main” parser) can have an action associated with it. When the parser is used, it calls the function with a list of the scanned tokens. If the function returns a list of tokens, it replaces the original tokens. If it returns ‘None’, the tokens are not modified. This can be used to convert numeric strings into actual numbers, to clean up and normalize names, or to completely replace or delete tokens. Here is a parser that scans a movie title starting with “A Fistful of “, and uppercases the word that comes next:
from pyparsing import * def upper_case_it(tokens): return [t.upper() for t in tokens] prefix = 'A Fist Full of' + White() fist_contents = Word(alphas) fist_contents.setParseAction(upper_case_it) title_parser = Combine(prefix + fist_contents) for title in ( 'A Fistful of Dollars', 'A Fistful of Spaghetti', 'A Fistful of Doughnuts', ): print(title_parser.parseString(title))
The output will be:
['A Fistful of DOLLARS'] ['A Fistful of SPAGHETTI'] ['A Fistful of DOUGHNUTS']
Pyparsing is a mature, powerful alternative to regular expressions for parsing text into tokens and retrieving or replacing those tokens.
Pyparsing can parse things that regular expressions cannot, such as nested fields. It is really more similar to traditional parsing tools such as lex and yacc. In other words, while you can look for tags and pull data out of HTML with regular expressions, you couldn’t validate an HTML file with them. However, you could do it with pyparsing.
i. Yes, standard Python wisdom says ‘from module import *’ is evil, but in my opinion, the convenience outweighs the risk, at least for pyparsing.
ii. Why didn’t I say URI, which is more technically correct? Because “URL” is the most common way to refer to an address like ‘http://www.python.org“. URIs are a little more generic than URLs, and I wanted to keep things simple.
Author: John Strickler, one of Accelebrate’s Python instructors
Accelebrate offers private, on-site Python training.
Written by John Strickler
Our live, instructor-led lectures are far more effective than pre-recorded classes
If your team is not 100% satisfied with your training, we do what's necessary to make it right
Whether you are at home or in the office, we make learning interactive and engaging
We accept check, ACH/EFT, major credit cards, and most purchase orders
Alabama
Birmingham
Huntsville
Montgomery
Alaska
Anchorage
Arizona
Phoenix
Tucson
Arkansas
Fayetteville
Little Rock
California
Los Angeles
Oakland
Orange County
Sacramento
San Diego
San Francisco
San Jose
Colorado
Boulder
Colorado Springs
Denver
Connecticut
Hartford
DC
Washington
Florida
Fort Lauderdale
Jacksonville
Miami
Orlando
Tampa
Georgia
Atlanta
Augusta
Savannah
Hawaii
Honolulu
Idaho
Boise
Illinois
Chicago
Indiana
Indianapolis
Iowa
Cedar Rapids
Des Moines
Kansas
Wichita
Kentucky
Lexington
Louisville
Louisiana
New Orleans
Maine
Portland
Maryland
Annapolis
Baltimore
Frederick
Hagerstown
Massachusetts
Boston
Cambridge
Springfield
Michigan
Ann Arbor
Detroit
Grand Rapids
Minnesota
Minneapolis
Saint Paul
Mississippi
Jackson
Missouri
Kansas City
St. Louis
Nebraska
Lincoln
Omaha
Nevada
Las Vegas
Reno
New Jersey
Princeton
New Mexico
Albuquerque
New York
Albany
Buffalo
New York City
White Plains
North Carolina
Charlotte
Durham
Raleigh
Ohio
Akron
Canton
Cincinnati
Cleveland
Columbus
Dayton
Oklahoma
Oklahoma City
Tulsa
Oregon
Portland
Pennsylvania
Philadelphia
Pittsburgh
Rhode Island
Providence
South Carolina
Charleston
Columbia
Greenville
Tennessee
Knoxville
Memphis
Nashville
Texas
Austin
Dallas
El Paso
Houston
San Antonio
Utah
Salt Lake City
Virginia
Alexandria
Arlington
Norfolk
Richmond
Washington
Seattle
Tacoma
West Virginia
Charleston
Wisconsin
Madison
Milwaukee
Alberta
Calgary
Edmonton
British Columbia
Vancouver
Manitoba
Winnipeg
Nova Scotia
Halifax
Ontario
Ottawa
Toronto
Quebec
Montreal
Puerto Rico
San Juan