While there are multiple languages that data scientists can use, Python has essential advantages for data science. Each language has its specific history, purpose, strengths, and style. The choice of language should be informed by the type of work being done now and in the future. In this post, I make the (somewhat opinionated) case for why Python is the best choice as the primary language for data science coding shops.
In a word, Python gives data scientists leverage. I detail the features of the language that convey this benefit later in this post. However, this XKCD comic makes the point well and is only a slight exaggeration.
Python is used for data science because it gives unmatched leverage to collect data, process data, and deliver insights from data. Furthermore, code that is written in the "Pythonic" style: clear, readable, and easily shared. This encourages iteration and exploration that leads to insights.
There are numerous languages that can be used to perform data science and advanced analytics, so what makes Python so well suited for the data scientist? To answer this, think about what a data scientist is. Data science is the general practice of taking data and using advanced analytics to derive insight. The generality comes from the varied data sources and knowledge domains in which we operate. As the saying goes: a data scientist is part developer, part statistician, and part data engineer.
The generality and technical requirements require that the data scientist is prepared to perform all the steps of the analytics process. They should move from data to insight and be able to work in unfamiliar domains. The data scientist needs a tool that can do all pipeline steps well. Other languages like SPSS, SAS, and R are highly refined tools for statistics, but they suffer from deficiencies at other stages of analysis. Python does everything well. It is not refined for a single task; but it can do all tasks. The only constant in data science is that we are tasked with every step of analysis in all domains. Python makes this possible.
Python is lean and powerful; it gets right to the analysis stage. There are other languages like C and Java that have more built-in guardrails. You need to specify memory allocation in C. In Java and Scala, the programmer needs to control object type carefully. In Python, many of these details are handled in the background. Python tends to abstract the user away from the gory details of the underlying computational backend. This level of abstraction does come with risks that memory allocation will be less than optimal or an object will be of the wrong type, but these errors can be fixed when it is essential. For the data scientist, who is mainly focused on analyzing data, this is a good trade-off.
Python lets the data scientist write lean and readable code that directly connects data to insight.
One of the most apparent advantages of Python is that it is a developer language. This is not to say it is the best developer language, but any functionality that a developer normally expects is available and robust. Python can handle most data types and database connections. Python even comes with SQLite, which can be used without a separate database. SQLite can be deployed as part of your program. Most API services have a python API client. If there is no client, you can use requests to use the API endpoints directly.
In Python, dependency issues are well-handled by dependency managers like conda and pip. In addition, virtual environments and Conda environments enable the user to maintain multiple versions of Python, with different suites of dependencies. Trying to set up multiple versions of R is much clumsier and error-prone.
Python has well-supported packages to deliver analytical endpoints and results to stakeholders and customers. Since data science is about communicating insights derived from data, these packages are major levers. Interactive graphics and dashboards can be made with Holoviews (panel) and plotly (dash). Webapps can be built on the Flask and Django frameworks.
Other aspects make Python usable for developer workflows. Packages like PySpark and Dask allow scalable analytics and cluster management. However, the main point is that Python has momentum in developer communities; if a developer-facing process or package does not have a Python implementation, it is such a disadvantage that one will be produced in short order.
Python facilitates storytelling with an increasingly diverse set of data visualization tools. We can use matplotlib for fine-grained control of plots. Matplotlib. rcParams (eeRC stands for runtime compiled) is a dictionary that allows us to re-write the standard plot aesthetics. This is particularly useful to specify report or company-themed aesthetics. Try this code to see the variety of plotting preferences that you can customize.
import matplotlib matplotlib.rcParams
A collection of more stylized and powerful data visualization packages is built on top of the matplotlib scaffolding. For easy and aesthetically pleasing plots we use Seaborn and ggplot2. These packages abstract away from the detailed control afforded by matplotlib and give great results with a few lines of code. For interactive web-based visualizations, we have Bokeh and Plotly. Plotly goes beyond web-based visuals with Dash which allows the user to make public or private dashboards. With a license. Dashboards can be seamlessly deployed on the Plotly servers, further reducing the overhead associated with web app deployment and security.
There are many other data visualization packages, but the Holoviews family of tools is worth discussing. Holoviews is part of a family of tools that was built by Anaconda. It includes datashader, which plots huge datasets as shades instead of points. This application specifically addresses big data and can be used to make some stunning visuals. In addition, Holoviews introduced Panel, a flavor of dashboarding app. The data visualization capacity of Python is formidable and dynamically developing.
Because it is easy to talk (or write) a case for anything, the careful reader might demand more. What is the proof that Python gives you leverage to do important data science with minimal fuss?
Let me show you...
For this article, I made the point that Python just works. Dream up a data science application, and with a few lines of clear and readable code, we can create an analysis that moves your team forward.
Let's demonstrate how easy we can go from a simple question to a data-backed analysis. The first question that comes to mind is: How popular is Python relative to other data science tools? Is this popularity changing year over year?
To measure this, we can look at a coding forum and ask about how many questions are asked for the language. What percentage of these questions are answered? This important piece of data science intelligence can be easily and quickly answered with a few lines of Python. We might use this to validate already published insights, but we can also change the code interactively so that we can look at the question from new angles as we think of them.
The use of Jupyter notebooks allows us to annotate the code in markdown; these notebooks can read much more like a paper than raw code. R has RMD files that are similar, but these must be knit. Knitting means that all of the code needs to run or the knitting fails. This leads to time lost troubleshooting and less rapid iteration.
The use of Conda environments makes the installation of totally new packages easy and quick. Conda environments make spinning up new projects very dynamic; they avoid costly dependency incompatibilities and can be simply removed when we are done with them.
Let's start with the stack overflow API:
https://stackapi.readthedocs.io/en/latest/
You will need to have Anaconda and PIP installed for this. Any code with a '$' proceeding it is done in the terminal the remainder of the code is executed in jupyter notebook cells
$ conda create -n codecom_api jupyter pandas ipykernel
$ conda activate codecom_api $ pip install stackapi
$ python -m ipykernel install --user --name=codecom_api
$jupyter notebook
from stackapi import StackAPI from DateTime import DateTime import pandas as pd import seaborn as sns
SITE = StackAPI('stackoverflow') SITE.max_pages=10 results = pd.DataFrame(columns = ['year', 'language','answered', 'views', 'answers'])
#get the data from the Stackoverflow api for year in list(range(2012,2023)): for language in ['C', 'python', 'java', 'R']: post = SITE.fetch('questions', fromdate=datetime(year,1,1),\ todate=datetime(year,1,2), tagged = language) # collect the data in a dataframe for item in post['items']: results=results.append(pd.DataFrame.from_dict({'year':[year],\ 'language':[language],\ 'answered':[item['is_answered']],\ 'views':[item['view_count']],\ 'answers':[item['answer_count']]}))
summ=results.groupby(['language', 'year']).agg({'answered':'mean',\ 'views':['min', 'max', 'mean', 'size'],\ 'answers':['min', 'max', 'mean']}) p = sns.lineplot(x='year' , y=summ['views', 'size'], style='language', data=summ) p.set( xlabel = "year sampled", ylabel = "questions asked/ day",\ title ="stack overflow questions by language")
In summary, we had a question that could have very important implications for a business: Should we transition our data science to Python? Then we answered that question with a few lines of Python code. I want to pause to emphasize that this goes to the heart of the power of data science. We moved from a speculative discussion point to actionable primary data analysis in a single turn of very lean code. Now that we have identified a data source and rendered it into a visualization we can add to it easily, generating more nuanced analysis. Questions lead to answers and better questions. Python gets that out of the way.
Accelebrate offers in-person or live, online Python Data Science training for your team of 3 or more.
Set Up References
Written by Gunnar Kleemann
Dr. Gunnar Kleemann runs a small friendly data science shop, Austin Capital Data. Gunnar has over 25 years of experience teaching a broad array of STEM fields; acting as a teacher and advisor to students in a number of contexts at institutions including at The Princeton University Genomics Institute, Barnard College, Albert Einstein College of Medicine, the University of Nebraska-Lincoln, K2, Data Society, the Princeton Review, and of course Accelebrate. Most recently he has been a Lecturer at UC Berkeley’s Master’s in Data Science (MIDS) program since 2016.
Gunnar is primarily interested in making the benefits of data science more broadly accessible since he believes that data science skills will be the core delimiters in the future world. To this end, he regularly presents his results at international conferences, most recently at All Things Open 2021. Gunnar has published research on physiological and behavioral genomics in the most prominent international journals, including Cell, Genetics, and the Journal of Neuroscience.
Our live, instructor-led lectures are far more effective than pre-recorded classes
If your team is not 100% satisfied with your training, we do what's necessary to make it right
Whether you are at home or in the office, we make learning interactive and engaging
We accept check, ACH/EFT, major credit cards, and most purchase orders
Alabama
Birmingham
Huntsville
Montgomery
Alaska
Anchorage
Arizona
Phoenix
Tucson
Arkansas
Fayetteville
Little Rock
California
Los Angeles
Oakland
Orange County
Sacramento
San Diego
San Francisco
San Jose
Colorado
Boulder
Colorado Springs
Denver
Connecticut
Hartford
DC
Washington
Florida
Fort Lauderdale
Jacksonville
Miami
Orlando
Tampa
Georgia
Atlanta
Augusta
Savannah
Hawaii
Honolulu
Idaho
Boise
Illinois
Chicago
Indiana
Indianapolis
Iowa
Cedar Rapids
Des Moines
Kansas
Wichita
Kentucky
Lexington
Louisville
Louisiana
New Orleans
Maine
Portland
Maryland
Annapolis
Baltimore
Frederick
Hagerstown
Massachusetts
Boston
Cambridge
Springfield
Michigan
Ann Arbor
Detroit
Grand Rapids
Minnesota
Minneapolis
Saint Paul
Mississippi
Jackson
Missouri
Kansas City
St. Louis
Nebraska
Lincoln
Omaha
Nevada
Las Vegas
Reno
New Jersey
Princeton
New Mexico
Albuquerque
New York
Albany
Buffalo
New York City
White Plains
North Carolina
Charlotte
Durham
Raleigh
Ohio
Akron
Canton
Cincinnati
Cleveland
Columbus
Dayton
Oklahoma
Oklahoma City
Tulsa
Oregon
Portland
Pennsylvania
Philadelphia
Pittsburgh
Rhode Island
Providence
South Carolina
Charleston
Columbia
Greenville
Tennessee
Knoxville
Memphis
Nashville
Texas
Austin
Dallas
El Paso
Houston
San Antonio
Utah
Salt Lake City
Virginia
Alexandria
Arlington
Norfolk
Richmond
Washington
Seattle
Tacoma
West Virginia
Charleston
Wisconsin
Madison
Milwaukee
Alberta
Calgary
Edmonton
British Columbia
Vancouver
Manitoba
Winnipeg
Nova Scotia
Halifax
Ontario
Ottawa
Toronto
Quebec
Montreal
Puerto Rico
San Juan