In Part 1, we explored some of the great power offered by the pandas framework. The process of importation, through the exploration and cleaning and basic wrangling of data, is a simple matter. Time series, handling missing data and subsequent visualization of the results can be performed in a single line!
But what about more powerful tools, such as grouping, aggregation, joins, advanced string methods and other functions? Are they just as simple to carry out? And where are you most likely to see a UFO? All this—and more—in Part 2!
We have a nice longitudinal dataset that spans from 1400 to 2018 A.D. How does the rate of UFO reports change over time?
We will have to aggregate over year and count the number of occurrences. We do this aggregation with the .groupby() operation:
ufo_df.groupby('year')['posted'].count().reset_index().plot(x='year', y='posted', xlim=[1940.0, 2018.0], legend=False)
What happened in the mid-2000's? Does the introduction of so-called smart phones (while my own phone seems to keep getting dumber) have anything to do with the rate of sightings?
The first smart phone was introduced around the middle of 2007 (more or less):
ufo_df['smartphone_epoch'] = ['pre-cellphone' if x < 2007 else 'post- cellphone' for x in ufo_df['year']]
ufo_df[ufo_df['smartphone_epoch'] == 'post- cellphone']['month'].plot.hist(bins=12, color='blue', alpha=0.5, label="post-cellphone", legend=True) ufo_df[ufo_df['smartphone_epoch'] == 'pre- cellphone']['month'].plot.hist(bins=12, color='red', alpha=0.5, label="pre-cellphone", legend=True)
Note: again, because the plot is really a matplotlib object, you could also have obtained the legend by calling it 'directly':
plt.legend()
Recall the median of report counts occurred in 2009, so there are more post-2007 reports in total. Interestingly, the reports from 2007 onwards peak in July, rather than in the well-defined June peak for the pre-2007 reports. There also appears to be a relatively heightened rate of reports introduced in January. Any thoughts on why that would be? This is not a rhetorical question!
One of the columns displaying a great deal of variability is the 'shapes' field. How are these shapes distributed?
ufo_df.groupby('shape')['posted'].count().sort_values(ascending=False) .head(20).plot.bar()
By far the most common shape is 'Light'. Cigar-shaped objects were way down (12th) on the list, which surprised me, as that is the descriptor that always rang out 'technical UFO description' to me as a kid.
Often in the EDA phase, it becomes clear that further cleaning is required. This is certainly the case here!
Recall we noted that there were 227 cities missing. This hasn't been a problem until now, because we didn't care about the city names. We need to decide how to handle these missing data.
In addition, there are quite a few other tasks required to clean up the city names. For example, there are many entries that attempt to clarify or condition with commentary in parentheses:
ufo_df.dropna()[ufo_df['city'].dropna().str.contains("\)$")].loc['date _time':'city'].sample(5) city date_time year month 95593 2001-03-0 Henderson (Las Vegas) 19:45:00 2001 3 104680 1995-12-0 Salluit (Canada) 23:10:00 1995 12 83712 2003-11-2 Calhoun (north of) 11:30:00 2003 11 51673 2010-08-0 North Kenai (Daniels Lake east end) 00:30:00 2010 8 52326 2010-06-1 Hatchet Lake (Canada) 21:18:00 2010 6
Note the use of vectorized string methods and string matching using regular expressions (regex) — the "\)$" means "look for a ')' at the very end of the string". More regex's to come!
There are:
ufo_df.dropna()[ufo_df['city'].dropna().str.contains("\)$")].count()[0]8,673 of these observations!
This includes the following mysterious gem:
((town name temporarily deleted)), OK
This is the kind of thing we came here for right here!
Note the double use of .dropna() to prevent data misalignment — this is enforced by pandas, so you'll get an error if data threatens to become misaligned.
Speaking of which, we'll get rid of those no-good city NaNs. The .dropna() method has some helpful parameters to specify rows/columns and in-place filtering:
ufo_df.dropna(subset=['city'], inplace=True)
Perhaps we could split the city strings by " (" and retain the 0th element:
ufo_df['city'] = ufo_df['city'].str.split(sep=" (")[0]
Unfortunately pandas' str.split() method does not (yet) seem to have the rich support of the built-in str.split() method.
While we are at it, there are a few entries that have non-standard capitalization (such as 'AmArillo'). We enforce this with str.title():
ufo_df['city'] = [x.title().split(sep=" (")[0] for x in ufo_df['city']]
There are still 20 cities with parentheses, and at least one with '{' s. I was hoping it wouldn't come to this...
...but we're going to have to wade back, deeper into the murky waters of regular expressions (regex):
import re
Remove anything following a ( or {:
ufo_df['city'] = [re.split("\s*[\(\{]", x.title())[0] for x in ufo_df['city']]
To translate: \s is whitespace, * is a wildcard for 0 or more occurrences of the immediately preceding character. Characters in the [] are treated as logical OR, or ranges. The ( and { each need to be escaped with a \ because they're otherwise meaningful to the regex interpreter. So ("\s*[\(\{]" roughly translates to 'take any potential whitespace, followed directly by a ( or a
{'. In this case, we use this as a delimiter to split the string, and then we discard anything after this (complicated) delimiter.
You have probably worked out by now that regular expressions are (relatively) straight-forward to write, but often painful to read!
Check nothing went awry (as often occurs with prototyping regex):
ufo_df.dropna()[ufo_df['city'].dropna().str.contains("\)$|\(")]
There are only two entries with (these types of) entry errors left. We can manually input them:
ufo_df.loc[115268, 'city'] = 'Aliquippa' # The largest 'city' in Beaver County, PA
ufo_df.loc[25404, 'city'] = 'Cedar Rapids'
So, how do the cities look now? Let's look at city names containing the string "New York":
ufo_df[ufo_df['city'].str.contains("New York")]['city'].unique() array(['New York City', 'New York', 'New York Mills', 'West New York', 'New York State Thruway / Catskill', 'New York City, Manhattan',
'New York City/Far Rockaway', 'New York City/Staten Island', 'New York City/Philadelphia', 'New York/Philadelphia',
'New York City/Central Park', 'New York/San Francisco', 'Central New York', 'East New York', 'New York Worlds Fair'], dtype=object)
Still some way to go.
There are plenty of locations that are between main population centers, often separated by a '/' or a '&'. Most cases, the larger settlement is on the left of the slash or ampersand. Regex to the rescue again!
ufo_df['city'] = [re.sub("\s*/.*", "", x.title()) for x in ufo_df['city']] ufo_df['city'] = [re.sub("\s*&.*", "", x.title()) for x in ufo_df['city']]
ufo_df['city'] = [re.sub("^[Bb]etween", ",", x.title()) for x in ufo_df['city']]
CHECK. Always check when transforming data for the first time.
ufo_df[ufo_df['city'].str.contains("New York")]['city'].unique()
Much better. However, there are a lot of cities with helpful-but-unhelpful directional modifiers, such as "East Los Angeles". Similarly, there are some suffixes, such as in "San Francisco Area" that could be purged. There are as many ways to do this as there are to skin a Xenomorph.
I approached this by defining two functions to perform these cleaning operations:
def purgeDirectionalModifiers(df, city_name):
"""Strip off directional prefixes North, South etc. from city names in DataFrame df"""
direction_list = ["North", "N.", "East", "E.", "South", "S.", "West", "W."]
direction_regex = re.compile(' |'.join(direction_list) + ' ') match_idx = df['city'].str.contains(city_name).index df['city'].loc[match_idx] = [re.sub(direction_regex, "", x) for x in df['city'].loc[match_idx]] def purgeCitySuffix(df, city_name):
"""Strip specifiers for city, county etc."""
suffix_list = ["City", "County", "Area", "Bay", "Airport", "D.C.", "Dc", ","] suffix_regex = re.compile(' ' + '$| '.join(suffix_list) + "$") match_idx = df['city'].str.contains(city_name).index df['city'].loc[match_idx] = [re.sub(suffix_regex, "", x) for x in df['city'].loc[match_idx]]
I then obtained a list of cities that have the worst of these problems7, and cleaned them:
cities_to_clean = ["Sacramento", "Seattle", "Milwaukee", "Baltimore", "Las Vegas", "Boston", "San Francisco", "Washington", "Chicago", "Los Angeles", "New York"] for city in cities_to_clean: purgeDirectionalModifiers(ufo_df, city) purgeCitySuffix(ufo_df, city)
How many cities are there now in this data set?
len(ufo_df['city'].unique()) # 18,366
The US cities aren't uniquely specified. We need to add their US state:
ufo_df['city'] = ufo_df['city'] + ', ' + ufo_df['state']
All right. Let's look at how many reports came from each city. Taking the top ten:
ufo_df.groupby('city')['posted'].count().sort_values(ascending=False). head(10).plot.barh() plt.gca().invert_yaxis()
It certainly makes sense that New York, with the most populous US city, would have the most UFO reports. However, Seattle seems to be pushing well above its weight. That explains the Space Needle!
How about the observations per capita?
In the words of the great Data Scientist, Sherlock Holmes: "Data! Data! Data! I Can't Make Bricks Without Clay!" We will need some data from another source. The US census bureau has data available for the 770 most populous cities: https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml
city_pop = pd.read_csv("PEP_2017_PEPANNRSIP.US12A_with_ann.csv", encoding='latin-1')
However, there are far too many columns for our needs. We'll rename the non-extraneous census-specific column labels and drop the rest:
city_pop = city_pop.rename(index=str, columns={'GC_RANK.rank-label': 'rank', 'GC_RANK.display-label.1': 'city_state', 'respop72017': 'pop'}) city_pop = city_pop[['rank', 'city_state', 'pop']]
Or we could have used the .drop() method as above. Further cleaning:
city_pop['city_state'] = city_pop['city_state'].str.replace('
\(balance\)', '') # This term means something to the Census Bureau...
There are still a few (eight) cities with hyphens or official county designations or other weirdness. Let's impute them manually, otherwise we'd miss out on some major cities!
city_pop.loc['23', 'city_state'] = "Nashville city, Tennessee" city_pop.loc['59', 'city_state'] = "Lexington city, Kentucky" city_pop.loc['88', 'city_state'] = "Winston city, North Carolina" city_pop.loc['121', 'city_state'] = "Augusta city, Georgia" city_pop.loc['171', 'city_state'] = "Macon County, Georgia" city_pop.loc['219', 'city_state'] = "Athens County, Georgia" city_pop.loc['28', 'city_state'] = "Louisville city, Kentucky" city_pop.loc['55', 'city_state'] = "Honolulu city, Hawaii"
The convention is largely "X city, Y" where X is the name of the city and Y is the full name of the state. So we could have split by the " city, " delimiter.
Unfortunately, there are towns, cities, villages and counties in the mix. Take one step back — your doctor recommends another dose of regex!
city_delimiter = ' [Cc]ity, '
city_delimiter += '| town, ' # Note the pipe operator '|', used as logical OR
city_delimiter += '| [Cc]ounty, ' city_delimiter += '| village, ' city_delimiter += '| municipality, '
Now we can split the cities and states according to this (horrendous, just don't think about it) delimiter:
city_pop['city'] = [re.split(city_delimiter, x)[0] for x in city_pop['city_state']]
city_pop['state'] = [re.split(city_delimiter, x)[-1] for x in city_pop['city_state']]
A fun aside: how many cities have the same name in the US?
city_pop.groupby('city').count().sort_values('state', ascending=False)['state'].head(5) city Springfield 5 Lakewood 4 Albany 3 Bloomington 2 Charleston 2
There are five Springfields! Matt Groening must have known this when he created The Simpsons.
There are 36 major cities in the US with duplicated names. We'll need to retain the state information to disambiguate. This is why people have to specify the US state. You don't want to get on the plane to Portland, only to realize you're heading to the complete opposite coast!
Again, make sure we imported fine and the data looks like it makes sense. In the biz', we call this a 'sanity check'.
city_pop.sample(n=5) rank city_state pop city state 412 413 Westland city, Michigan 81747 Westland Michigan 569 570 Blaine city, Minnesota 64557 Blaine Minnesota 552 553 North Little Rock city, Arkansas 65911 North Little Rock Arkansas 550 551 Laguna Niguel city, California 66334 Laguna Niguel California 506 507 Canton city, Ohio 70909 Canton Ohio
Let's check the ten most populous cities:
city_pop.sort_values('pop', ascending=False).head(10).plot.barh(x='city_abbrev', legend=False) plt.gca().invert_yaxis()
Unfortunately, the good people at the Census Bureau write out their US states in long form. But we just want their abbreviations. I created a text file with all the states' names and abbreviations, 'state_abbrev.txt':
state_ref = pd.read_csv("state_abbrev.txt")
Let's add the abbreviation references to the city_pop DataFrame using the .merge() method:
city_pop = city_pop.merge(state_ref, on='state', how='left')
Combine the city and state into one column to make merge syntax easier (i.e. we only have to use a single key for each8):
city_pop['city_abbrev'] = city_pop['city'] + ', ' + city_pop['abbreviation']
Ugh. But we have too many columns. We only needed two!
city_pop = city_pop[['city_abbrev', 'pop']]
Finally, merge the city_pop references into the ufo_df DataFrame:
ufo_merged = ufo_df.merge(city_pop, left_on='city', right_on='city_abbrev', how='left')
Note: by doing this, we are automatically resigning ourselves to analyzing US-only data; the following transformations make no sense if there is no associated US state.
How much data will this discard?
ufo_df['state'].isnull().sum()/len(ufo_df)
Hence 7.2% of the entries have no defined state. By inspection, most of these are from outside of the US. This could be a follow-up investigation.
Also...
city_pop['pop'].min()
...cities with populations of at less than 47,929 will be omitted.
Get rid of the resulting city NaNs:
ufo_merged.dropna(subset=['city'], inplace=True)
Finally, we can answer the question 'what are the cities with highest recorded UFO sightings per capita?'
ufo_per_capita = ufo_merged.groupby('city')['posted'].count() ufo_per_capita = pd.DataFrame(ufo_per_capita)
ufo_per_capita = ufo_per_capita.join(city_pop.set_index('city_abbrev'), how='left') ufo_per_capita.dropna(subset=['pop'], inplace=True) ufo_per_capita['obs_per_1000'] = 1000*ufo_per_capita['posted']/ufo_per_capita['pop'] ufo_per_capita.sort_values('obs_per_1000', ascending=False)['obs_per_1000'].head(10).plot.barh(legend=False) plt.gca().invert_yaxis()
Wow! Who would have thought that Tinley Park, Illinois would have the most UFO reports per capita?
The second, Sarasota, FL, isn't even close. Personally, I was expecting somewhere like Santa Fe, New Mexico, or Portland, Oregon, to have the highest reports per capita (the population of Roswell, NM didn't quite have enough population to be included).
ufo_merged.groupby('state')['posted'].count().sort_values(ascending=Fa lse).head(10) state observations CA 13235 FL 6312 WA 5767 TX 4833 NY 4580 AZ 3909 PA 3784 IL 3521 OH 3470 MI 2922
It's reasonable that California has twice the reports of Florida, having almost twice the population. But Washington state, with a little less than half California also, has 1/7th the population...
Once again, Washington is punching well above its weight in terms of UFO reportage. I always thought Jimi Hendrix was out of this world!
There are plenty of avenues for exploration of this dataset using pandas.
For example, there are over 720 potential 'HOAX' flags in the summary pages. Adding this label to the observations may make a neat machine learning exercise (although it's unlikely we have enough information in the reports alone to make any useful prediction).
There are numerous famous UFO incidents that may warrant an in-depth study, such as the 'Phoenix Lights' in 1997, or the 'Chicago O'Hare Lights' in 2006. Unfortunately, the mysterious airship event in Aurora, Texas of 1897 has only a single entry.
Hopefully this tutorial has demonstrated some of the power and ease of application of the pandas module.
There is so much functionality that we could not cover all the features of pandas here. For instance, we have not showcased transitioning between wide (unstacked) and long data formats, or some of the powerful time-series methods, such as periodic analysis or window functions. But we have hopefully shown that data analysis doesn't always have to be difficult. Given the right dataset, it even be fun!
Until next time — see you on the seventh rock from 40 Eridani A, Earthling!
7 For more details, I refer you to the full code for this tutorial.
8 The on keyword in the .merge() method can easily take a list for primary and secondary keys, but we want to retain the US state information for plotting purposes too.
Accelebrate offers Python training onsite and online.
Written by Ra Inta
Our live, instructor-led lectures are far more effective than pre-recorded classes
If your team is not 100% satisfied with your training, we do what's necessary to make it right
Whether you are at home or in the office, we make learning interactive and engaging
We accept check, ACH/EFT, major credit cards, and most purchase orders
Alabama
Birmingham
Huntsville
Montgomery
Alaska
Anchorage
Arizona
Phoenix
Tucson
Arkansas
Fayetteville
Little Rock
California
Los Angeles
Oakland
Orange County
Sacramento
San Diego
San Francisco
San Jose
Colorado
Boulder
Colorado Springs
Denver
Connecticut
Hartford
DC
Washington
Florida
Fort Lauderdale
Jacksonville
Miami
Orlando
Tampa
Georgia
Atlanta
Augusta
Savannah
Hawaii
Honolulu
Idaho
Boise
Illinois
Chicago
Indiana
Indianapolis
Iowa
Cedar Rapids
Des Moines
Kansas
Wichita
Kentucky
Lexington
Louisville
Louisiana
New Orleans
Maine
Portland
Maryland
Annapolis
Baltimore
Frederick
Hagerstown
Massachusetts
Boston
Cambridge
Springfield
Michigan
Ann Arbor
Detroit
Grand Rapids
Minnesota
Minneapolis
Saint Paul
Mississippi
Jackson
Missouri
Kansas City
St. Louis
Nebraska
Lincoln
Omaha
Nevada
Las Vegas
Reno
New Jersey
Princeton
New Mexico
Albuquerque
New York
Albany
Buffalo
New York City
White Plains
North Carolina
Charlotte
Durham
Raleigh
Ohio
Akron
Canton
Cincinnati
Cleveland
Columbus
Dayton
Oklahoma
Oklahoma City
Tulsa
Oregon
Portland
Pennsylvania
Philadelphia
Pittsburgh
Rhode Island
Providence
South Carolina
Charleston
Columbia
Greenville
Tennessee
Knoxville
Memphis
Nashville
Texas
Austin
Dallas
El Paso
Houston
San Antonio
Utah
Salt Lake City
Virginia
Alexandria
Arlington
Norfolk
Richmond
Washington
Seattle
Tacoma
West Virginia
Charleston
Wisconsin
Madison
Milwaukee
Alberta
Calgary
Edmonton
British Columbia
Vancouver
Manitoba
Winnipeg
Nova Scotia
Halifax
Ontario
Ottawa
Toronto
Quebec
Montreal
Puerto Rico
San Juan