Tired of your fellow humans being probed by aliens? It’s time to probe back! Well, at least, probe the data on UFO sightings. To do this, we’ll show-case one of the most popular open source tools for analyzing data: the pandas package.
Not to be confused with the Chinese bear, the name pandas is a portmanteau of panel data, an econometrics term for multidimensional structured datasets. As a package for the popular programming language, Python, the project arose in response to a number of key issues the language had with handling data. Its author, Wes McKinney, noted that base Python did not cope gracefully with missing data, time series, labeling of data and metadata and common merge/join operations found in comparable analysis tools, such as SAS, R, SQL and Matlab. The project began in 2008 and is an open-source project with enthusiastic support.
Consequently, pandas usage has grown steadily, becoming one of the most popular and powerful tools in a Data Scientist’s workshop1. It provides a versatile framework for plenitudinous tasks in data science, designed to easily import, manipulate and present data efficiently and (mostly) intuitively. When teaching how to use pandas in one of our Data Science courses, we recommend Python for Data Science Handbook by Jake VanderPlas. I also highly recommend Wes McKinney’s Python for Data Analysis (both published by O’Reilly media).
Two key structures are introduced with pandas: the DataFrame and the Series objects. The former will be familiar to those with experience using the R language for statistical computing (as data.frame); it was indeed the prototype for the pandas version. A Series is a one-dimensional array, with optional labeling and naming. In many ways, a Series acts like a vector, in that common vector operations are defined if the elements are numerical types.
However, it can be referenced by label, similar to a dictionary, and it also supports time series relatively gracefully. A DataFrame is a two-dimensional tabular data structure of one or more Series objects, with labels for each axis: rows and columns.
So, without further ado, let’s get our probes ready, and analyze some data!
The National UFO Reporting Center (NUFORC) hosts and curates a comprehensive repository of UFO sightings. Physically housed in Missile Site #6 — an abandoned Atlas missile bunker (of course) in rural Washington state — NUFORC is currently directed by Peter Davenport, who took the helm in 1994. Members of the public may report a possible UFO experience via the website, fax or telephone. Care is taken to ascertain whether reports are potential hoaxes or otherwise erroneous.
The reports themselves are publicly accessible via the NUFORC website. On an ordered collection of web pages, the entries are stored as plain HTML tables, where each observation is a single line. The columns contain data on the event date, city and US state (where applicable), shape of the object(s) witnessed, duration of the event, a brief summary text, and, finally, the date the report was posted to NUFORC. Typical entries look like the following:
Date / Time | City | State | Shape | Duration | Summary | Posted |
6/30/0523:30 | Grapevine | TX | Light | 5-10 sec | I was talking on my cell phone by the pool and i was looking up into the night sky, I had been there for at least 20 minutes when I saw | 7/5/05 |
6/30/0523:25 | Jackson | PA | Fireball | 30 sec | ((HOAX??)) Looked like it was a red moon coming up over the mountain. I think I may have | 11/28/07 |
6/30/05 22:42 |
Friendship | MD | Diamond | 2 Minutes | Two Diamond shaped craft spotted over southern Maryland | 7/5/05 |
As you can probably guess, the entries make for a fun and interesting read.
For the purposes of analysis, I generated a single Comma Separated Variable (CSV) file from the pages using a web-scraper2 constructed using the excellent scrapy Python package3. One important function of the web scraper was to capture more accurate times for events. Many of the earlier date/time entries only had two digits for the year (yy), yet the archive extends from the fifteenth century to now (September 2018). This meant that most (all?) of the reports from, e.g. 1967, appear as 2067 etc. However, thanks to the curation efforts of NUFORC, the year and month are held as six-digit metadata in the URL for the entries; this vital information is parsed4 and added to each entry.
First of all, whenever possible, you should briefly examine the data before importing it, with a text editor or viewer. I prefer the excellent file pager less, which is perfect for examining large files without crashing or noticeably slowing your personal computer.
date_time,year,month,city,state,shape,duration,posted,url 8/30/18 22:00,2018,08,Morehead City,NC,Unknown,1 hour,8/31/18,http://www.nuforc.org/webreports/142/S142925.html 8/30/18 21:36,2018,08,Commerce,GA,Circle,5 seconds,8/31/18,http://www.nuforc.org/webreports/142/S142924.html 8/30/18 21:15,2018,08,Queens Village,NY,Light,15 minutes,8/31/18,http://www.nuforc.org/webreports/142/S142929.html
Good. Looks like a CSV file should, complete with comma delimiters and a header line.
Note the ‘year’ and ‘month’ fields were extracted directly from the URL (which is a different URL to the summary links provided in the table). The ‘summary’ field was omitted because it’s a truncated version of what is contained in the ‘url,’ and there’s not much we can do, in terms of analysis, for now anyway (that is the domain of Natural Language Processing).
If you have not already installed the pandas library, do so now. We recommend using
Anaconda’s conda package manager, to keep versions consistent. However, you could also use pip (pip install pandas) or Easy Install (easy_install pandas).
Import this, as well as the popular graphics package, matplotlib:
import pandas as pd import matplotlib as mpl
Let’s tweak some of the plotting defaults for some flavor (note that you probably will not have the Souvenir font installed on your local system):
mpl.rcParams['font.family'] = 'Souvenir' mpl.rcParams.update( {'font.size': 12, 'lines.linewidth': 2, 'lines.markersize': 5} )
Importing data files with pandas is very easy. We have stored our UFO Reports data in a Comma Separated Variable (CSV) file (national_ufo_reports.csv), so we can simply use the .read_csv() method:
ufo_df = pd.read_csv("national_ufo_reports.csv")
A somewhat common practice is to put a _df at the end of a variable name to remind us that it is a DataFrame object (recall that a DataFrame is a collection of Series objects). Our UFO data is in what is often referred to as a ‘wide’ table format, where each column is a variable, and each row an observation.
It is a simple matter to interrogate the first few elements using the .head() method associated with pandas DataFrames:
ufo_df.head(5) date_time year month city state shape duration posted url 0 8/30/18 22:00 2018 8 Morehead City NC Unknown 1 hour 8/31/18 http://www.nuforc.org/webreports/142/S142925.html 1 8/30/18 21:36 2018 8 Commerce GA Circle 5 seconds 8/31/18 http://www.nuforc.org/webreports/142/S142924.html 2 8/30/18 21:15 2018 8 Queens Village NY Light 15 minutes 8/31/18 http://www.nuforc.org/webreports/142/S142929.html 3 8/30/18 20:48 2018 8 Independence KS Unknown 3 minutes 8/31/18 http://www.nuforc.org/webreports/142/S142928.html 4 8/30/18 20:25 2018 8 Redding CA Light 5 seconds 8/31/18 http://www.nuforc.org/webreports/142/S142926.html
Note the row indices (here, 0—5) and column labels (‘date_time’—’url’). We can easily call any given element. We can look at the Series from any column by directly calling it with [] (note the use of .head() here to suppress the otherwise large number of rows):
ufo_df['date_time'].head() 0 8/30/18 22:00 1 8/30/18 21:36 2 8/30/18 21:15 3 8/30/18 20:48 4 8/30/18 20:25 Name: date_time, dtype: object
Note the default for .head() is the first five rows.
We could look at the fourth row (index of 3 because, as conventional in Python, we index from 0) by using .loc[]:
ufo_df.loc[3] date_time 8/30/18 20:48 year 2018 month 8 city Independence state KS shape Unknown duration 3 minutes posted 8/31/18 url http://www.nuforc.org/webreports/142/S142928.html Name: 3, dtype: object
Because any given contiguous region of a data file tends to have localized observations, it pays to take a random sample of a DataFrame to better gauge the general variability of the dataset. To take five such random samples:
ufo_df.sample(5) date_time year month ... duration posted url 80697 11/27/04 18:00 2004 11 ... 3 hours 12/14/04 http://www.nuforc.org/webreports/040/S40941.html 50608 11/25/10 20:20 2010 11 ... 3 seconds 1/5/11 http://www.nuforc.org/webreports/079/S79252.html 39366 7/3/12 22:00 2012 7 ... half-hour 7/4/12 http://www.nuforc.org/webreports/090/S90360.html 11873 7/23/16 22:10 2016 7 ... 3 minutes 8/2/16 http://www.nuforc.org/webreports/128/S128673.html 78087 1/7/05 18:40 2005 1 ... 1 min. 30 sec. 1/19/05 http://www.nuforc.org/webreports/041/S41473.html [5 rows x 9 columns]
Notice that pandas has sensible defaults for displaying large datasets. Here, not all nine rows would fit on the screen and are thus truncated.
What if you want to look at a specific range (‘slice’) in your DataFrame?
ufo_df.loc[0:4, 'duration':'url'] duration posted url 0 1 hour 8/31/18 http://www.nuforc.org/webreports/142/S142925.html 1 5 seconds 8/31/18 http://www.nuforc.org/webreports/142/S142924.html 2 15 minutes 8/31/18 http://www.nuforc.org/webreports/142/S142929.html 3 3 minutes 8/31/18 http://www.nuforc.org/webreports/142/S142928.html 4 5 seconds 8/31/18 http://www.nuforc.org/webreports/142/S142926.html
NOTE: If you look in the help documentation ( help(pd.DataFrame.loc) ) you’ll notice something of an apology: “note that contrary to usual python slices, **both** the start and the stop are included!”
If you prefer to use numbers instead of labels, use .iloc[] instead:
ufo_df.iloc[0:5, 6:9]
Confusingly, .iloc[] slices with the conventional start and stop indices. Also note, for both slicing methods, square brackets [] are used instead of parentheses ().
The following are typical steps for Exploratory Data Analysis (EDA) of a dataset, once the data are imported.
Size and scope of the data
What are the dimensions of the DataFrame?
ufo_df.shape (115877, 9)
Around 116,000 rows and 9 columns.
We could just as easily have found the number of rows by using the built-in Python len() function:
len(ufo_df) # 115877
What sort of data is each column?
ufo_df.dtypes date_time object year int64 month int64 city object state object shape object duration object posted object url object dtype: object
Confusingly, ‘object’ here refers to strings as well as other possible objects. How do the numeric data vary (range, median, mean)?
ufo_df.describe() year month count 115877.000000 115877.000000 mean 2006.127687 6.850945 std 11.481973 3.216381 min 1400.000000 1.000000 25% 2003.000000 4.000000 50% 2009.000000 7.000000 75% 2013.000000 9.000000 max 2018.000000 12.000000
Half of all observations in this dataset were from 2009 until now (September 2018). How much data is missing?
ufo_df.isnull().sum()
Note the ‘chaining’ of methods here, with the .sum() operating on the result of the .isnull()
operation. This powerful concept will be used throughout the remainder of this tutorial.
date_time 0 year 0 month 0 city 227 state 8438 shape 3686 duration 3896 posted 0 url 0 dtype: int64
None of the date-like entries are missing, as too for the URLs.
The duration and shape may not be well defined or known, so the ~4,000 missing for each is understandable, at 3% of the observations. The state is not defined for most non-US entries, and there are plenty of those. In this rare case, there is no need for imputation5 — for now. Great!
Do we need to filter by year?
The dates prior to 1900 are likely to be based on heavy, retrospective, speculation, or worse: errors due to data entry!
Filtering is simple in pandas. We can easily filter the ‘year’ field by evaluating a logical Series:
ufo_df[ufo_df['year'] < 1900].count() # 24
There is also the .query() method to achieve a similar effect (although this performs better with larger datasets):
ufo_df.query('year<1900').count() # 24
So there are only 24 such observations before 1900. Upon manual inspection, only three of these entries are mis-entered. This is an unexpected bonus!
For example, one entry was dated as the year 1721, when it was obviously a replication of the time (21:30) with the yy year format (17, for 2017). Otherwise, this is unexpectedly nice data!
We have officially entered the data cleaning stage (often referred to as ‘data wrangling,’ or ‘data munging’). For realistic datasets, this process generally occupies an inordinate amount of time.
Let’s remove the bad dates for the single entries in years 1617, 1615 and 1721. We can create a tiny DataFrame of them:
bad_dates = ufo_df[ufo_df['year'].isin([1617, 1615, 1721])]
These have indices [115875, 115872, 115871]:
bad_dates.index Int64Index([115871, 115872, 115875], dtype='int64')
You could .drop() these manually…
ufo_df = ufo_df.drop([115875, 115872, 115871])
…or you could have indexed from the bad_dates DataFrame:
ufo_df.drop(index=bad_dates.index, inplace=True)
Note the .drop() method has a convenient inplace option, so you don’t need to update a copy of the DataFrame.
Whenever you perform data wrangling and cleaning, it is good policy to check you performed the operation correctly.
ufo_df.shape (115874, 9)
OK, good. Three fewer rows, correct number of columns.
As with most analysis frameworks, dates are a special class unto themselves. This is because everyone has a different way of writing them (for example, a European may write 4 July 2018 as 4/7/2018, while an American might write 7/4/2018), and there are so many things you want to do with them.
Here, we explicitly force the ‘date_time’ column to be a DateTime object:
ufo_df['date_time'] = pd.to_datetime(ufo_df['date_time'])
Check the output format:
ufo_df.dtypes[0:4] date_time datetime64[ns] year int64 month int64 city object dtype: object
NOTE: This makes use of the default format specifications, including dayfirst=False and yearfirst=False. This is because the NUFORC database follows the American convention. Don’t always expect this.
If we already expect a date format upon import, there is an option to specify a date-time column, so we could have done this right at the import phase, i.e.:
ufo_df = pd.read_csv("national_ufo_reports.csv", parse_dates=[0], keep_date_col=True, infer_datetime_format=True)
However, as mentioned earlier this particular dataset is tricky! The date_time input wraps around the year, so will not give the correct century in many cases.
As mentioned above, the dataset we use has already parsed the ‘proper’ year and month from the URL itself, because we had anticipated this very issue (see the description on the Spider used to obtain this data from the NUFORC website).
The topic of the time-series analysis capabilities of pandas, including resampling and windowing, could be a whole tutorial itself. But we shall leave it at this for the time being.
Pandas has support for a wide range of graphical analysis. These are really a number of wrappers to the matplotlib.pyplot Python library.
Let’s analyze the univariate distribution of observations by month:
ufo_df['month'].hist(bins=12)
Because the plotting functions are wrappers for matplotlib, you can add attributes to the figures accordingly:
import matplotlib.pyplot as plt plt.xlabel("Month of report") plt.ylabel("Number of reports") plt.title("Distribution of UFO reports by month")
You can save the plot like any matplotlib figure:
plt.savefig("UFO_observations_by_month.png")
Or you could examine it interactively with:
plt.show()
Note that the object is ‘used’ up by saving or displaying6.
It appears that summer (in the Northern Hemisphere) is the time to spot UFOs. It makes sense!
Hopefully you have seen here some of the great power offered by the pandas framework. The process of importation, through the exploration and cleaning and basic wrangling of data, is but a simple matter. Time series, handling missing data and subsequent visualization of the results may be simply performed in a single line!
But what about the more sophisticated tools, such as grouping, aggregation joining advanced string methods and other functions? Are they just as simple to carry out too? We’re talking grouping, aggregation and joining data, as well as other powerful functions.
For that—you’ll have to tune in for the next installment, where we visit the Mother Ship!
— TRANSMISSION TERMINATED —
1 In fact you can find three articles using pandas, with different applications, on this very site: [1], [2] & [3]
2 Often referred to as a ‘spider’. Because it, y’know, crawls the web.
3 The details of which would comprise its own article, but you may download the code from here.
4 No, this isn’t a typo of ‘passed.’ This is a technical term for ‘interpreting some output’.
5 A fancy, ten-dollar, word for ‘figuring out how to handle missing data’. It has the same Latin roots as for ‘input,’ and is similar to that for ‘amputation’ (which is roughly ‘to clean by cutting back’).
Accelebrate offers Python training onsite and online.
Written by Ra Inta
Our live, instructor-led lectures are far more effective than pre-recorded classes
If your team is not 100% satisfied with your training, we do what's necessary to make it right
Whether you are at home or in the office, we make learning interactive and engaging
We accept check, ACH/EFT, major credit cards, and most purchase orders
Alabama
Birmingham
Huntsville
Montgomery
Alaska
Anchorage
Arizona
Phoenix
Tucson
Arkansas
Fayetteville
Little Rock
California
Los Angeles
Oakland
Orange County
Sacramento
San Diego
San Francisco
San Jose
Colorado
Boulder
Colorado Springs
Denver
Connecticut
Hartford
DC
Washington
Florida
Fort Lauderdale
Jacksonville
Miami
Orlando
Tampa
Georgia
Atlanta
Augusta
Savannah
Hawaii
Honolulu
Idaho
Boise
Illinois
Chicago
Indiana
Indianapolis
Iowa
Cedar Rapids
Des Moines
Kansas
Wichita
Kentucky
Lexington
Louisville
Louisiana
New Orleans
Maine
Portland
Maryland
Annapolis
Baltimore
Frederick
Hagerstown
Massachusetts
Boston
Cambridge
Springfield
Michigan
Ann Arbor
Detroit
Grand Rapids
Minnesota
Minneapolis
Saint Paul
Mississippi
Jackson
Missouri
Kansas City
St. Louis
Nebraska
Lincoln
Omaha
Nevada
Las Vegas
Reno
New Jersey
Princeton
New Mexico
Albuquerque
New York
Albany
Buffalo
New York City
White Plains
North Carolina
Charlotte
Durham
Raleigh
Ohio
Akron
Canton
Cincinnati
Cleveland
Columbus
Dayton
Oklahoma
Oklahoma City
Tulsa
Oregon
Portland
Pennsylvania
Philadelphia
Pittsburgh
Rhode Island
Providence
South Carolina
Charleston
Columbia
Greenville
Tennessee
Knoxville
Memphis
Nashville
Texas
Austin
Dallas
El Paso
Houston
San Antonio
Utah
Salt Lake City
Virginia
Alexandria
Arlington
Norfolk
Richmond
Washington
Seattle
Tacoma
West Virginia
Charleston
Wisconsin
Madison
Milwaukee
Alberta
Calgary
Edmonton
British Columbia
Vancouver
Manitoba
Winnipeg
Nova Scotia
Halifax
Ontario
Ottawa
Toronto
Quebec
Montreal
Puerto Rico
San Juan