Why every Data Scientist should know SQL
Still waiting…it’s been over an hour and still nothing. I watch the clock, get some tea, ruminate on the structure of dark matter….
I’m trying to work with course enrollment data in a relatively large database and format it in a nice splashboard, but processing this data takes far too long. Perhaps dark matter is to blame.
Let me back up.
Last year, I was tinkering with a Jupter notebook to summarize course enrollment and completion stats for some of our database courses.
In fact I started with a notebook that someone had originally created for another set of courses involving the same database. Why re-invent the wheel when a perfectly good notebook to do something similar you want already exists. After all, data science is a team sport.
I had made relatively minor updates to the notebook – just switched the course numbers that I wanted summaries for and clicked on Run All to execute all cells in the notebook.
I hadn’t really looked carefully at the code in the notebook before running it. But once the summarized results failed to materialize after a couple of hours, I knew I couldn’t blame things on dark matter anymore and would need to get my hands dirty with code.
So I grabbed another warm beverage and got ready to dig into the code in the notebook. But it only took scrolling to the cells that performed the database queries to recognize the problem.
SELECT * FROM ENROLLMENTS. I read it and the read it again. Aloud, the second time. It was like a eureka moment.
I was pleased that I was able to debug the problem so quickly, but was not too happy with the prospect of having to spend some time hacking the notebook to make it run faster. A lot faster.
I have over 25 years of experience working with databases so I knew fixing the database queries would be relatively quick. But much of the data analysis logic in the notebook involved Pandas dataframes.
I had only recently picked up some data science skills and most of my data science capability involved the use of R programming language. All the data scientists I had been talking to recently were using Python. [So this was also good opportunity for me to pick up some skills in Python and Data Analysis with Python].
But let me not digress further and get back to the problem with SELECT * FROM ENROLLMENTS.
Imagine you want to buy one item from an online retailer. Would you order all the millions of items in the retailer’s warehouse to get just the one you want and then discard or return the rest of the items? Can you imagine how long it would take to have the entire inventory shipped to you? Even if all of the contents managed to reach you somehow, would you even have enough capacity and resources in your house to receive and process the entire inventory?
But apparently that is exactly what a lot of data scientists actually do. They “order” all of the items in the data warehouse and then use tools like Pandas dataframes to sift through the data they need and discard the rest.
And that is exactly what the SQL query: SELECT * FROM ENROLLMENTS in my example above does. The database I was accessing had millions of rows for course enrollment and completion data, and getting all the data into a notebook would take considerable time. And with constrained resources on my laptop, processing those millions of rows with Pandas dataframes would take even longer.
Shortly after this issue, I met with a Database Administrator (DBA) at one of the big banks. Their CEO was sold on the fact that data science could help transform the company and data science teams were cropping up all over the company in the recent months, but that’s when his job had started to become “hell”.
DBAs run a tight ship. They tune the system and queries to the umpteenth degree so the database can hum along fine responding to predictable queries efficiently.
And then comes along a hotshot data scientist and runs a huge query like “SELECT * FROM ENROLLMENTS” against an operational database. The database slows to a crawl, and the company’s clients on the website start seeing database errors and timeouts. And the DBA responsible for the database gets called to the boss’s office.
I may have exaggerated a bit and fictionalized parts of the narrative but unfortunately this sort of a thing is quite common. But data scientists are not entirely to blame. Data Science itself has been evolving.
Data Science traditionally has been done on very small data sets. As a matter of fact, over 80% of data science work is done on a laptop according to one of the consulting firms.
Small data sets are easy and fast to manipulate in memory and Pandas is great for that. Data Scientists traditionally worked with CSV files (text files with comma separated values) and did not have a connection to a database. A DBA would do a one-time database dump in to a CSV and that was it.
We are in the age of Big Data and working with CSV files is simply not practical. Repeatedly generating CSV file extracts with more up to date data is even less practical. This means that Data Scientists need to learn to work with big data repositories like relational Data Warehouses, Hadoop, Spark, Cloud Object Storage etc.
The language of relational databases is SQL. And because of SQL’s ease of use, it is increasingly being adopted by other big data repositories.
In case of my query – “SELECT * FROM ENROLLMENTS” – all I had to do was add a WHERE clause to the query to filter the results for just the courses I was interested in so the result set would include only a small subset of the millions of rows in the table.
So that is one reason I feel the knowledge of SQL is essential for today’s Data Scientists. Perhaps modern data scientists only need to learn a subset of SQL. They don’t need to learn transaction processing but things like simple filtering and aggregation are a must.
The impact of adding filtering to my SQL query in the Jupyter notebook was dramatic. The results were rendered in a couple of minutes instead of a couple of hours. And I don’t consider myself to be a genius.
And if I could tweak SQL in my data science experiment by so little and have such a huge impact on performance, I could surely help other Data Scientists (and some of those DBAs who are frustrated with newly minted data science yahoos like myself) work more efficiently with databases and SQL.
So shorty after these episodes, working with my colleagues Hima Vasudevan and Raul Chong, we launched the course Databases and SQL for Data Science on Coursera. It is an online self-study course that you can complete at your own pace.
This course introduces relational database concepts and helps you learn and apply knowledge of the SQL language. It also shows you how to perform SQL access in a data science environment like Jupyter notebooks.
The course requires no prior knowledge of databases, SQL, Python, or programming. It has four modules and each requires 2 – 4 hours of effort to complete. Topics covered include:
– Introduction to Databases
– How to Create a Database Instance on Cloud
– CREATE Table Statement
– SELECT Statement
– INSERT Statement
– UPDATE and DELETE Statements
– Optional: Relational Model Concepts
– Using String Patterns, Ranges
– Sorting Result Sets
– Grouping Result Sets
– Built-in Functions, Dates, Timestamps
– Sub-Queries and Nested Selects
– Working with Multiple Tables
– Optional: Relational Model Constraints
– How to access databases using Python
– Writing code by Using DB-API
– Connecting to a Database by Using ibm_db API
– Creating Tables, Loading Data, and Querying Data from Jupyter Notebooks
– Analyzing Data with SQL and Python
– Optional: INNER JOIN, LEFT, RIGHT OUTER JOIN
– Working with Real-world Data Sets
– Assignment: Analyzing Chicago Data Sets using SQL and Python
The emphasis in this course is hands-on and practical learning. As such, you will work with real databases, real data science tools, and real-world datasets. You will create a database instance in the cloud. Through a series of hands-on labs, you will practice building and running SQL queries using cloud based tools. You will also learn how to access databases from Jupyter notebooks by using SQL and Python.
Anyone can audit this course at no-charge. If you want a certificate and access to graded components of the course, there is currently a limited time price of $39 USD. And if you are looking for a Professional Certificate in Data Science, this course is one of the 9 courses in the IBM Data Science Professional Certificate.
So if you are interested in learning SQL for Data Science, you can enroll now and audit for free.
NOTE: Portions of this post have been updated from the original version. In the process of publishing this blog post, I may have inadvertently hurt the emotions of a few Data Scientists and perhaps some DBAs, but certainly none were physically harmed. But seriously, it is not my intent to offend or stereotype any Data Scientist or DBA. So my sincere apologies to anyone who may have taken offence. The intent of this blog post is to highlight a real problem in data science, one that can be easily rectified with some knowledge of SQL, and I would be a lousy marketeer if I didn’t include a shameless plug for the IBM Data Science Professional Certificate on Coursera.