Pyparseltongue: Parsing Text with Pyparsing

May 19, 2015 in Python Articles

Written by John Strickler


Text Parsing Tools

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

–Jamie Zawinski, 1997

I don’t actually agree with Mr. Zawinski – I’ve been using regular expressions successfully for over two decades, and I have done a lot of useful work with them. However, I do admit that they are cryptic and tricky. Here is a regular expression to parse a string like “Ja. 15, 2014″ or “Au. 27, 1990″:

\b([A-Z][a-z]{2})\.?\s+(\d+),\s+(\d{4})

If you want to retrieve the month, day, and year by name rather than numeric index, it would look like this:

\b(?P<MONTH>[A-Z][a-z]{2})\.?\s+(?P<DAY>\d+),\s+(?P<YEAR>\d{4})

As it turns out, there are other ways to parse text. You could create a tailor-made parser than iterated over the text character-by-character, with logic for finding your target. This is tedious, error-prone, time-consuming, and no one does it.

You could use string methods such as split(), startswith(), endswith(), etc., to grab pieces and then analyze them. This is likewise tedious, error-prone, and time-consuming, but some people do take this route because they are scared of regular expressions.

A better option is the pyparsing module. This article describes how to apply pyparsing to everyday text parsing tasks.

About PyParsing

Pyparsing is a Python module for creating text parsers. It was developed by Paul McGuire. Install with pip for most versions of Python. (Note: The Anaconda Python Bundle, highly recommended for serious Python developers, includes pyparsing.)

First, you create a grammar to specify what should be matched. Then, you call a parsing function from the grammar and it returns text tokens while automatically skipping over white space. Pyparsing provides many functions for specifying what should be matched, how items should repeat, and more.

Note: the examples are written with Python 3, but should work identically in Python 2 if you convert the print() function back into the print statement.

Defining a Grammar

The first step in using Pyparsing is to define a grammar. A grammar defines exactly what the target text can contain. The best way to do this is in a “top-down” manner, specifying the entire target, then refining what each component means until you get down to literal characters.

The usual notation for grammars is called Backus-Naur Form, or BNF for short. You don’t have to worry about following the rules exactly with BNF, but it is convenient for describing how things fit together.

The basic form is

symbol ::= expression

This means that symbol is composed of the parts specified in the expression. For instance, a person’s name can be specified as

name ::= first-name [middle-name] last-name
first_name ::= alpha | first_name alpha #one or more alphas
last_name ::=alphas+  #shorter way to express one or more alphas
alpha = 'a' | 'b' | 'c' etc

This means that a name consists of a first name, an optional middle name (brackets indicate optional components), and a last name.  A first name consists of one or more alphabetic characters, as does a last name. The plus sign means “one or more”.The pipe symbol means “or”. Pyparsing has predefined symbols for letters, digits, and other common sets of characters.

There are more details, but you get the idea.

If you are interested in the gory details of BNF, see http://en.wikipedia.org/wiki/Backus%E2%80%93Naur_Form

A Simple Parser

To use pyparsing, you need to import it. For convenience, you can import all the names from pyparsing:

from pyparsing import *

Let’s start with a US Social Security Number. The format is DDD-DD-DDDD, where D is any digit.  Using BNF, the grammar for an SSN is

ssn ::= num+ '-' num+ '-' num+
num ::= '0' | '1' | '2' etc

In pyparsing, you can represent a fixed number of digits with the expression

Word(nums, exact=N)

x = 'foo'
print len(x)

A “Word” is a sequence of characters surrounded by white space or punctuation.

Since a dash is a literal character (a “primitive”, in the parsing world), you don’t need to do anything special with it. You can give the literal character a name, to help make your parser readable.

Thus, you have:

dash = '-'
ssn = Word(nums, exact=3) + dash + Word(nums, exact=2) + dash + Word(nums, exact=4)

To use the parser, call parseString() on the parser object:

target = '123-45-6789'
result = ssn.parseString(target)
print result

This gives the result

['123', '-', '45', '-', '6789']

For most applications, you would want the entire SSN as a single string, so you can add the Combine() function, which glues together all of the tokens from the parser passed to it as a single token.

ssn = Combine(Word(nums, exact=3) + dash + Word(nums, exact=2) + dash + Word(nums, exact=4))

So now the result is

['123-45-6789']

An example script using this parser:

from pyparsing import *
'''
grammar:
ssn ::= nums+ '-' nums+ '-' nums+
nums ::= '0' | '1' | '2' etc etc
'''
dash = '-'

ssn_parser = Combine(
  Word(nums, exact=3)
  + dash
  + Word(nums, exact=2)
  + dash
  + Word(nums, exact=4)
)

input_string = """
  xxx 225-92-8416 yyy
  103-33-3929 zzz 028-91-0122
"""

for match, start, stop in ssn_parser.scanString(input_string):
  print(match, start, stop)

The output will be:

['225-92-8416'] 9 20
['103-33-3929'] 29 40
['028-91-0122'] 45 56

A Few More Niceties

There are four functions that you can call from a parser to do the actual parsing.

parseStringparses input text from the beginning; ignores extra trailing text.
scanStringlooks through input text and generates matches; similar to re.finditer()
searchStringlike scanString, but returns a list of tokens
transformString – like scanString, but specifies replacements for tokens

Let’s say you had a configuration file that looked like this:

sample.cfg

city=Atlanta
state=Georgia
population=5522942

To parse a string in the format “KEY=VALUE”, there are 3 components: the key, the equals sign, and the value. When parsing such an expression, you don’t really need the equals sign in the results. The Suppress() function will parse a token without putting it in the results.

To make it a little easier to access individual tokens, you can provide names for the tokens, either with the setResultsName() function, or by just calling the parser with the name as its argument, which can be done when the parser is defined. Assigning names to tokens is the preferred approach.

from pyparsing import *

key = Word(alphanums)('key')
equals = Suppress('=')
value = Word(alphanums)('value')

kvexpression = key + equals + value

with open('sample.cfg') as config_in:
  config_data = config_in.read()

for match in kvexpression.scanString(config_data):
  result = match[0]
  print("{0} is {1}".format(result.key, result.value))

The output will be:

city is Atlanta
state is Georgia
population is 5522942

How to Parse a URL

URLs are, of course, frequently used in everyday life. Without them, the Internet would be a vast mishmash. Oh, wait. It already is. Well, without URLs, the vast mishmash would have no directional signage. In this section, I’ll show you how to parse a complete URL.

For details on the syntax of a URL, see http://en.wikipedia.org/wiki/URI_scheme#Generic_syntax.

The URL decomposed

A URL consists of the following segments:

  • scheme (AKA protocol)
  • user information
  • host name/address
  • port number
  • path
  • query
  • fragment

Most of those segments are optional. Only the scheme and host name/address are required.

Spaces are not allowed in URLs and are translated to a plus sign.

Many punctuation characters are not allowed as well, and are encoded as a percent sign, plus their ASCII value as a two-character hex number.

The allowed set of characters is letters, digits, and underscores, plus the following: -. ~ % +.

The URL parser grammar

An informal BN grammar for parsing URLs:

url ::= scheme '://' [userinfo] host [port] [path] [query] [fragment]
scheme ::= http | https | ftp | file
userinfo ::= url_chars+ ':' url_chars+ '@'
host ::= alphanums | host (. + alphanums)
port ::= ':' nums
path ::= url_chars+
query ::= '?' + query_pairs
query_pairs ::= query_pairs | (query_pairs '&' query_pair)
query_pair = url_chars+ '=' url_chars+
fragment = '#' + url_chars
url_chars = alphanums + '-_.~%+'

Building the pieces

Using the grammar, start at the bottom: the smallest components, since they have to be defined first.  Work your way up, combining components, until you have completely defined the target text:

url_chars = alphanums + '-_.~%+'
fragment  = Combine((Suppress('#') + Word(url_chars)))('fragment')
scheme = oneOf('http https ftp file')('scheme')
host = Combine(delimitedList(Word(url_chars), '.'))('host')
port = Suppress(':') + Word(nums)('port')
user_info = (
Word(url_chars)('username')
  + Suppress(':')
  + Word(url_chars)('password')
  + Suppress('@')
)

query_pair = Group(Word(url_chars) + Suppress('=') + Word(url_chars))
query = Group(Suppress('?') + delimitedList(query_pair, '&'))('query')

path = Combine(
  Suppress('/')
  + OneOrMore(~query + Word(url_chars + '/'))
)('path')

url_parser = (
  scheme
  + Suppress('://')
  + Optional(user_info)
  + host
  + Optional(port)
  + Optional(path)
  + Optional(query)
  + Optional(fragment)
)

Putting it all together

The final program is a module with a url_parser object:

url_parser.py

from pyparsing import *

'''
URL grammar
  url ::= scheme '://' [userinfo] host [port] [path] [query] [fragment]
  scheme ::= http | https | ftp | file
  userinfo ::= url_chars+ ':' url_chars+ '@'
  host ::= alphanums | host (. + alphanums)
  port ::= ':' nums
  path ::= url_chars+
  query ::= '?' + query_pairs
  query_pairs ::= query_pairs | (query_pairs '&' query_pair)
  query_pair = url_chars+ '=' url_chars+
  fragment = '#' + url_chars
  url_chars = alphanums + '-_.~%+'
'''

url_chars = alphanums + '-_.~%+'

fragment  = Combine((Suppress('#') + Word(url_chars)))('fragment')

scheme = oneOf('http https ftp file')('scheme')
host = Combine(delimitedList(Word(url_chars), '.'))('host')
port = Suppress(':') + Word(nums)('port')
user_info = (
  Word(url_chars)('username')
  + Suppress(':')
  + Word(url_chars)('password')
  + Suppress('@')
)

query_pair = Group(Word(url_chars) + Suppress('=') + Word(url_chars))
query = Group(Suppress('?') + delimitedList(query_pair, '&'))('query')

path = Combine(
  Suppress('/')
  + OneOrMore(~query + Word(url_chars + '/'))
)('path')

url_parser = (
  scheme
  + Suppress('://')
  + Optional(user_info)
  + host
  + Optional(port)
  + Optional(path)
  + Optional(query)
  + Optional(fragment)
)

Proof of concept

I’ll use a list of assorted URLs to test the parser:

from url_parser import url_parser

test_urls = [
  'http://www.notarealsite.com',
  'http://www.notarealsite.com/',
  'http://www.notarealsite.com:1234/',
  'http://bob:%[email protected]:1234/',
  'http://www.notarealsite.com/presidents',
  'http://www.notarealsite.com/presidents/byterm?term=26&name=Roosevelt',
  'http://www.notarealsite.com/presidents/26',
  'http://www.notarealsite.com/us/indiana/gary/population',
  'ftp://ftp.info.com/downloads',
  'http://www.notarealsite.com#moose',
  'http://bob:[email protected]:8080/presidents/byterm?term=26&name=Roosevelt#bio',
]

fmt = '{0:10s} {1}'

for test_url in test_urls:

  print("URL:", test_url)

  tokens = url_parser.parseString(test_url)

  print(tokens, '\n')
  print(fmt.format("Scheme:", tokens.scheme))
  print(fmt.format("User name:", tokens.username))
  print(fmt.format("Password:", tokens.password))
  print(fmt.format("Host:", tokens.host))
  print(fmt.format("Port:", tokens.port))
  print(fmt.format("Path:", tokens.path))
  print("Query:")
  for key, value in tokens.query:
    print("\t{} ==> {}".format(key, value))
  print(fmt.format('Fragment:', tokens.fragment))
  print('-' * 60, '\n')

The output will be (for brevity, only one parsed URL shown):

URL: http://bob:[email protected]:8080/presidents/byterm?term=26&name=Roosevelt#bio

['http', 'bob', 's3cr3t', 'www.notarealsite.com', '8080', 'presidents/byterm', [['term', '26'], ['name', 'Roosevelt']], 'bio']

Scheme:    http
User name: bob
Password:  s3cr3t
Host:      www.notarealsite.com
Port:      8080
Path:      presidents/byterm
Query:
  term ==> 26
  name ==> Roosevelt
Fragment:  bio

Taking Action

Any parser (including the individual parsers that make up the “main” parser) can have an action associated with it. When the parser is used, it calls the function with a list of the scanned tokens.  If the function returns a list of tokens, it replaces the original tokens. If it returns ‘None’, the tokens are not modified. This can be used to convert numeric strings into actual numbers, to clean up and normalize names, or to completely replace or delete tokens.  Here is a parser that scans a movie title starting with “A Fistful of “, and uppercases the word that comes next:

from pyparsing import *

def upper_case_it(tokens):
  return [t.upper() for t in tokens]

prefix = 'A Fist Full of' + White()
fist_contents = Word(alphas)

fist_contents.setParseAction(upper_case_it)

title_parser = Combine(prefix + fist_contents)

for title in (
  'A Fistful of Dollars',
  'A Fistful of Spaghetti',
  'A Fistful of Doughnuts',
):
print(title_parser.parseString(title))

The output will be:

['A Fistful of DOLLARS']
['A Fistful of SPAGHETTI']
['A Fistful of DOUGHNUTS']

Conclusion

Pyparsing is a mature, powerful alternative to regular expressions for parsing text into tokens and retrieving or replacing those tokens.

Pyparsing can parse things that regular expressions cannot, such as nested fields. It is really more similar to traditional parsing tools such as lex and yacc. In other words, while you can look for tags and pull data out of HTML with regular expressions, you couldn’t validate an HTML file with them. However, you could do it with pyparsing.


i. Yes, standard Python wisdom says ‘from module import *’ is evil, but in my opinion, the convenience outweighs the risk, at least for pyparsing.

ii. Why didn’t I say URI, which is more technically correct? Because “URL” is the most common way to refer to an address like ‘http://www.python.org“. URIs are a little more generic than URLs, and I wanted to keep things simple.

iii. For brevity, I only included a few common values for the scheme, leaving out nearly all of the over 200 scheme values registered with the IANA.

Author: John Strickler, one of Accelebrate’s Python instructors

Accelebrate offers private, on-site Python training.


Written by John Strickler

John Strickler

John is a long-time trainer of Accelebrate and has taught all over the US. He is a programmer, trainer, and consultant for Python, SQL, Django, and Flask.


Learn faster

Our live, instructor-led lectures are far more effective than pre-recorded classes

Satisfaction guarantee

If your team is not 100% satisfied with your training, we do what's necessary to make it right

Learn online from anywhere

Whether you are at home or in the office, we make learning interactive and engaging

Multiple Payment Options

We accept check, ACH/EFT, major credit cards, and most purchase orders



Recent Training Locations

Alabama

Birmingham

Huntsville

Montgomery

Alaska

Anchorage

Arizona

Phoenix

Tucson

Arkansas

Fayetteville

Little Rock

California

Los Angeles

Oakland

Orange County

Sacramento

San Diego

San Francisco

San Jose

Colorado

Boulder

Colorado Springs

Denver

Connecticut

Hartford

DC

Washington

Florida

Fort Lauderdale

Jacksonville

Miami

Orlando

Tampa

Georgia

Atlanta

Augusta

Savannah

Hawaii

Honolulu

Idaho

Boise

Illinois

Chicago

Indiana

Indianapolis

Iowa

Cedar Rapids

Des Moines

Kansas

Wichita

Kentucky

Lexington

Louisville

Louisiana

New Orleans

Maine

Portland

Maryland

Annapolis

Baltimore

Frederick

Hagerstown

Massachusetts

Boston

Cambridge

Springfield

Michigan

Ann Arbor

Detroit

Grand Rapids

Minnesota

Minneapolis

Saint Paul

Mississippi

Jackson

Missouri

Kansas City

St. Louis

Nebraska

Lincoln

Omaha

Nevada

Las Vegas

Reno

New Jersey

Princeton

New Mexico

Albuquerque

New York

Albany

Buffalo

New York City

White Plains

North Carolina

Charlotte

Durham

Raleigh

Ohio

Akron

Canton

Cincinnati

Cleveland

Columbus

Dayton

Oklahoma

Oklahoma City

Tulsa

Oregon

Portland

Pennsylvania

Philadelphia

Pittsburgh

Rhode Island

Providence

South Carolina

Charleston

Columbia

Greenville

Tennessee

Knoxville

Memphis

Nashville

Texas

Austin

Dallas

El Paso

Houston

San Antonio

Utah

Salt Lake City

Virginia

Alexandria

Arlington

Norfolk

Richmond

Washington

Seattle

Tacoma

West Virginia

Charleston

Wisconsin

Madison

Milwaukee

Alberta

Calgary

Edmonton

British Columbia

Vancouver

Manitoba

Winnipeg

Nova Scotia

Halifax

Ontario

Ottawa

Toronto

Quebec

Montreal

Puerto Rico

San Juan