This article provides an overview of (possible) steps to perform an exploratory data analysis (EDA) on a data set. These instructions are largely based on my own experience and may be incomplete or biased, I just figured this may be helpful since there’s not a lot of content out there.
The goal of EDA
The goal of exploratory data analysis is to get an idea of what a new data set we’re working with looks like. We’re mainly interested in aspects such as the size and “shape” of the data set, date ranges, update frequency, distribution of values in value sets, distribution of data over time, missing values and sparsely populated fields, the meaning of flags, as well as connections between multiple tables. This goes along with notes and documentation about the findings with the purpose of having a permanent reference for this particular dataset. Once EDA is complete, we should be able to add an integration for the data and create analyses more easily than if we started from scratch.
Tools for EDA
EDA can be done in different ways depending on which toolkit you’re most comfortable with. Generally, you can start out simply writing SQL in your SQL workbench (e.g. DataGrip) or start immediately with a notebook (e.g. Jupyter or Hex). At the time of writing this (February 2022), my data team mostly uses Hex notebooks for EDA, as they allow sharing and commenting on analyses fairly easily.
Some basic principles
- In an ideal world, we wouldn’t have to do EDA, but be working with well documented data and an accurate and up-to-date entity relationship diagram. Ideally, we’d also have this data “profiled”, i.e. have some form of documentation with some basic statistics. This isn’t the case very often for production databases, which is why we do in-depth EDA. However, if we can find documentation or code, we should use it during EDA to inform our assumptions and insights.
- When looking to integrate source data, make sure to query the actual source data table, rather than any modification (view, subset, extract) of it – if possible. This is to make sure we’re not looking at data that might already have some issues introduced through our code.
- Keep a “running monolog” in the notebook or SQL script. Yes, the code is usually self-explanatory, but it’s good to document your thought process so that other people can follow along more easily. Examples:
- “Here I’m just looking at the row count.”
- “Let’s join set_item on the set table to see if the IDs match. I see that there are no missing joins, so this seems to be in sync.”
- Keep in mind that you’re only looking at a snapshot of the data at this point in time, so all assumptions you’re making may not be true forever, unless they’re explicitly documented and asserted in code (see 1.).
- If you’re working with a large data set that’s very slow to query, pick a “reasonable” subset, e.g. restricted to a specific time frame.
- I usually just look at numbers, but occasionally having some lightweight data visualization can be helpful to see trends. The resources I’ve listed below have some more content on using data visualization for EDA.
There is no exact playbook for EDA since the steps depend on the type of data you’re looking at. Here are some high-level steps to follow:
- Print some sample data of the table (first few rows, or use a sample function if available), just to look at what kind of data each column contains.
- If using Pandas, transposing a dataframe (
df.T) can be helpful for reading through wide tables.
- If using Pandas, transposing a dataframe (
- Print the data types for each column. Keep in mind that the database datatype and logical datatype might be different, e.g. an integer field may be used to represent a boolean value with 0/1.
- Identify the primary key and potential foreign key columns, and relevant timestamp fields (e.g. created date, last updated).
- Get some basic numbers for relevant fields:
- Table row count
- Unique count for the “primary key” field: Does it match the row count, i.e. is it really a unique primary key? If not, is there a set of fields that can uniquely identify a record, e.g. ID field + timestamp.
- Min/max for numeric and date fields: What ranges are we looking at?
- Note if there are values that look like dummy values, e.g. “1900-01-01” for dates, or dates in the future.
- Note if there are values that look like outliers based on the column name, e.g. a “2088” in a field named “customer_age”.
- Group by and counts for value set columns, i.e. categorical variables such as boolean fields or values from a fixed set, e.g. “service category”
- Pay attention to NULL values – how sparsely populated is this column? Can we expect NULL values at all? Or do we have another “dummy” value that represents NULL, e.g. “Unknown”?
- For boolean fields, do we have only true/false values, or do we also have true/false/NULL and if yes, what does NULL represent?
- It’s fairly critical to find out which fields are free-text and which ones have controlled input. The database datatype in both cases will be text, but the logical type is either controlled input (categorical) or free-text.
- Look at the distribution of record counts over time: Identify the relevant date field (if exists) and count the number of records for a reasonable time period, e.g. by month or year.
- This gives us an idea of the volume of data to expect over time.
- It also helps see at what point the data starts to be “complete”, better than just looking at the earliest date since we might just have a handful of records for specific dates.
- If possible, use a library like Pandas Profiling to get a “profile” of the data
- This captures most of the basic stats listed under “Get some basic numbers” as well as more detailed histograms, correlations, etc.
- It can get a little unwieldy for large table, so might make sense to only focus on a subset of relevant fields.
- If working with multiple tables, try and draw out a simplified high-level ERD (entity relationship diagram) to get an idea of how the tables join together and whether we have referential integrity.
- Run the joins and confirm whether join fields always match or whether there are some “empty joins”. E.g. ask questions such as “Does every service_level_id in the submission table have a corresponding record in the service_level lookup table?”
I haven’t found too many posts that I found particularly helpful (maybe all the good content is tucked away in books?). Here are a few links to other sites that may be useful and complement this post:
Header image by Miguel Tejada-Flores via Flickr (CC BY-NC 2.0)