Analytics From Day 1: How to Ensure Smooth Data Integrations with New Production Systems

Birds-eye view shot of roads

This is a pre-print of a post that will be published on the collectors.com tech blog.

We recently went through one of our most epic product launches at Collectors: Customers are now able to submit cards for grading to PSA, add them to their collection on the new collectors.com site, and choose to have their valuable cards stored in an actual physical vault. Oh, and all of this will be accessible via a single login, their “Collectors ID”, which replaces the multiple logins customers previously had to manage across our different products and business units like PSA and PCGS. While all these new features are integrated to provide a seamless experience to our customers, we’re dealing with a significantly more complex architecture of multiple new systems, databases, and APIs on the backend. Naturally, as with any new product launch, our product managers were keen to get analytics about the use of these features from day 1: How many users actually converted to the new “Collectors ID”? How many items have customers submitted to the Vault? Who is using the new collectors.com “My Collection” feature? 

In order to provide these kinds of data insights right from the go-live, we had to coordinate with several engineering teams to get our hands on the right data and integrate it into our data warehouse. In total, we ended up pulling data from systems sitting on top of four different production databases that were being launched at the same time. Considering how many different systems and databases we were working across, the integrations went pretty smoothly! Within a day of the go-live, I had produced a few dashboards with key metrics that the product and business stakeholders started using immediately to track uptake of the new services. The road to getting there wasn’t straightforward though and involved some amount of scrambling, knocking on different doors, and a few small surprises during the go-live. In this post, I will share some of my lessons learned from integrating with a new production system when you’re looking to provide analytics from the get-go.

Logistics

I’m big on keeping running docs with notes from my conversations and findings when working on a project – I always say I outsource my brain into a Google doc. Keep a doc with (datestamped) notes and “to do” items for every piece of information you find, open questions, as well as a list of who’s responsible for what on the product, e.g. product managers, engineering leads, project managers, etc.

In addition to the running notes, connect with the business stakeholders (product managers, analysts…) early on to document a set of desired metrics along with their priorities and timelines: What do we need to know from day 1? What can wait until some time after the launch? This will also be helpful when exploring the new data models to determine what is actually being captured and what data points may not be available to calculate the required metrics.

If there are standing meetings for the engineering team that’s responsible for the database setup, I strongly recommend regularly sitting in on those meetings. Even if you don’t always understand everything that’s going on, it’s helpful to have the context of what the team is focusing on, and establish a relationship with them. As data engineers, we’re often pretty removed from our counterparts on the data producer side, but knowing the people on the team (and having them know you) can be helpful in working together more effectively. 

Infrastructure

Once you know who your engineering point of contact is, the first question you’ll want to ask is: How do we get access to the data? Assuming we’re talking about data that lives in a relational database, here’s a short check list of information you need to get from the engineering team that’s responsible for the database setup:

  • Find out (and document) what cloud service the database is hosted in
  • Will you get access to a production database or a read-replica? And what permissions will you get, read-only, or will you be able to create temp tables or views if they’re needed by any of the tools in your pipeline?
  • Will there be dev and prod environments? What’s the timing for these being available?
  • How do users and services authenticate against the database? Do we need personal and/or service accounts to log in?
  • How will the logins will be shared? Will you need access to a shared password storage?
  • Do you need an SSH tunnel setup to access the database from any of the tools in your data stack?

It’s best to try and get all these details ironed out as early as possible, since especially tasks like setting up SSH tunnels can take some time. Make sure you can access the database as early as possible to avoid surprises later on, even if there is no meaningful data in there yet. 

Data model

Now that we’ve covered physical access to the data, let’s take a look at things to consider when you’re working with a new data model. I got looped into the production database design process early on and was able to provide input on the data modeling (see also: establishing a good connection with the upstream engineering teams! They’re your friends!). This ensured that the data would be suitable for our data extraction tool (Stitch) and contained all relevant data. Again, assuming you’re working with a relational database, here are some questions you’ll want to cover when talking about the data model:

  • Where is the data model documentation and how is it being kept up to date?
  • For any fields containing value sets, such as status codes, where are the corresponding descriptions stored? Will there be lookup tables in the database, or will these only be stored in code? The latter means you will need to be able to access the up-to-date list of lookups through your infrastructure, e.g. by querying an API (or simply reading the API documentation).
  • Will there be JSON columns? What is the schema for those?
  • What are the constraints on each table and column, e.g. foreign key relationships, NULL values, default values?
  • For datetime fields, will they be stored with timezone (they should)?

Application and data flow

Perhaps most importantly, when trying to make sense of data coming from a production base, we need to understand what the flow of the application is: What workflows (user-created or automated) in the application modify the data in what way? This is absolutely crucial to handling the data correctly and drawing the right conclusions from it. For example:

  • How and when is a record created, and what fields are populated through what input?
  • What workflows cause records to be modified in what way? And what metadata is there to track modifications, e.g. a “last updated” timestamp?
  • Will update timestamps for events such as status changes be tracked in separate fields? Or will there be kind of changelog table that captures these kinds of changes? This also trickles down into your data warehouse models, where you might need to start tracking status change dates right from the get-go.
  • How are deletions being handled? Will there be “hard deletes”, i.e. the record is simply removed, or “soft deletes”, i.e. the record has a “is deleted” or “deleted timestamp” field. And, along the same lines, is there a data retention policy that means data will be dropped or archived after a certain amount of time?
  • If the application is replacing a legacy application, will data be migrated? How do you recognize migrated data? Will there be any gaps or differences between migrated and newly create data?
  • Will there be realistic dummy data (i.e. data that adheres to the constraints and workflows described above) to develop our data models and metrics against?
  • Is there any chance of any test or dummy data getting into the production system? And if yes, how can we recognize and filter for it?

Ideally, your engineering and database admin teams will already have a “best practice” guide for designing new databases, which usually answers a lot of these questions. Otherwise, this might be a good time to start collecting these kinds of design decisions into a guide and encoding them in setup scripts where possible.

And finally…

I hope that this post has provided you with a starting point for a checklist for your next production data integration. All the questions I’ve covered in the above paragraphs should be treated as conversation prompts to elicit existing design decisions, or help influence decisions that are yet to be made. There will likely be some oversights (I have yet to work with *the* perfect production database), but coming prepared with a plan may help you catch some of the biggest issues to getting a good data integration early on. And even with the best preparation, you can probably expect to make some tweaks after the application go-live to adjust to some last-minute database changes or correct some assumptions you’ve made about the data. Developing against an empty data model or even dummy data can be challenging, and you might not nail everything at first try.

One last thing to keep in mind: As data consumers, our downstream use case will most likely be of lower priority than getting the production system stood up – and that’s totally okay. While I would love for data to always be a first-class citizen, I believe it’s pretty obvious that producing a stable production system needs to take priority, and we just need to accept that resource constrained engineering teams may move slower on supporting a data integration. This is why you’ll want to get started early and get these kinds of tasks and questions on the engineering team’s radar as soon as possible.

CC-licensed photo by Ian Beckley: https://www.pexels.com/photo/top-view-photography-of-roads-2440013/

Building a data platform from scratch at Collectors: A tale in three parts

I wrote an epic blog post series about my experience building a data platform from scratch in my new job, using the “Modern Data Stack” (well, at least parts of it). The post is an account of my first six months at Collectors building a data platform. It is part memoir, part instructional manual for data teams embarking on a “build a data platform” journey. I figured this might be relevant for some of y’all data engineering folks and/or “data teams of one”, so check it out here: Building a data platform from scratch at Collectors: Part 1 (parts 2 and 3 are linked from the post).

Image credit: “under construction” by Pedro Moura Pinheiro is marked with CC BY-NC-SA 2.0.

How to do Exploratory Data Analysis

This article provides an overview of (possible) steps to perform an exploratory data analysis (EDA) on a data set. These instructions are largely based on my own experience and may be incomplete or biased, I just figured this may be helpful since there’s not a lot of content out there.

The goal of EDA

The goal of exploratory data analysis is to get an idea of what a new data set we’re working with looks like. We’re mainly interested in aspects such as the size and “shape” of the data set, date ranges, update frequency, distribution of values in value sets, distribution of data over time, missing values and sparsely populated fields, the meaning of flags, as well as connections between multiple tables. This goes along with notes and documentation about the findings with the purpose of having a permanent reference for this particular dataset. Once EDA is complete, we should be able to add an integration for the data and create analyses more easily than if we started from scratch.

Tools for EDA

EDA can be done in different ways depending on which toolkit you’re most comfortable with. Generally, you can start out simply writing SQL in your SQL workbench (e.g. DataGrip) or start immediately with a notebook (e.g. Jupyter or Hex). At the time of writing this (February 2022), my data team mostly uses Hex notebooks for EDA, as they allow sharing and commenting on analyses fairly easily.

Some basic principles

  1. In an ideal world, we wouldn’t have to do EDA, but be working with well documented data and an accurate and up-to-date entity relationship diagram. Ideally, we’d also have this data “profiled”, i.e. have some form of documentation with some basic statistics. This isn’t the case very often for production databases, which is why we do in-depth EDA. However, if we can find documentation or code, we should use it during EDA to inform our assumptions and insights.
  2. When looking to integrate source data, make sure to query the actual source data table, rather than any modification (view, subset, extract) of it – if possible. This is to make sure we’re not looking at data that might already have some issues introduced through our code.
  3. Keep a “running monolog” in the notebook or SQL script. Yes, the code is usually self-explanatory, but it’s good to document your thought process so that other people can follow along more easily. Examples:
    1. “Here I’m just looking at the row count.”
    2. “Let’s join set_item on the set table to see if the IDs match. I see that there are no missing joins, so this seems to be in sync.”
    3. etc.
  4. Keep in mind that you’re only looking at a snapshot of the data at this point in time, so all assumptions you’re making may not be true forever, unless they’re explicitly documented and asserted in code (see 1.).
  5. If you’re working with a large data set that’s very slow to query, pick a “reasonable” subset, e.g. restricted to a specific time frame.
  6. I usually just look at numbers, but occasionally having some lightweight data visualization can be helpful to see trends. The resources I’ve listed below have some more content on using data visualization for EDA.

How-to

There is no exact playbook for EDA since the steps depend on the type of data you’re looking at. Here are some high-level steps to follow:

  • Print some sample data of the table (first few rows, or use a sample function if available), just to look at what kind of data each column contains.
    • If using Pandas, transposing a dataframe (df.T) can be helpful for reading through wide tables.
  • Print the data types for each column. Keep in mind that the database datatype and logical datatype might be different, e.g. an integer field may be used to represent a boolean value with 0/1.
  • Identify the primary key and potential foreign key columns, and relevant timestamp fields (e.g. created date, last updated).
  • Get some basic numbers for relevant fields:
    • Table row count
    • Unique count for the “primary key” field: Does it match the row count, i.e. is it really a unique primary key? If not, is there a set of fields that can uniquely identify a record, e.g. ID field + timestamp.
    • Min/max for numeric and date fields: What ranges are we looking at?
      • Note if there are values that look like dummy values, e.g. “1900-01-01” for dates, or dates in the future.
      • Note if there are values that look like outliers based on the column name, e.g. a “2088” in a field named “customer_age”.
    • Group by and counts for value set columns, i.e. categorical variables such as boolean fields or values from a fixed set, e.g. “service category”
      • Pay attention to NULL values – how sparsely populated is this column? Can we expect NULL values at all? Or do we have another “dummy” value that represents NULL, e.g. “Unknown”?
      • For boolean fields, do we have only true/false values, or do we also have true/false/NULL and if yes, what does NULL represent?
      • It’s fairly critical to find out which fields are free-text and which ones have controlled input. The database datatype in both cases will be text, but the logical type is either controlled input (categorical) or free-text.
  • Look at the distribution of record counts over time: Identify the relevant date field (if exists) and count the number of records for a reasonable time period, e.g. by month or year.
    • This gives us an idea of the volume of data to expect over time.
    • It also helps see at what point the data starts to be “complete”, better than just looking at the earliest date since we might just have a handful of records for specific dates.
  • If possible, use a library like Pandas Profiling to get a “profile” of the data
    • This captures most of the basic stats listed under “Get some basic numbers” as well as more detailed histograms, correlations, etc.
    • It can get a little unwieldy for large table, so might make sense to only focus on a subset of relevant fields.
  • If working with multiple tables, try and draw out a simplified high-level ERD (entity relationship diagram) to get an idea of how the tables join together and whether we have referential integrity.
    • Run the joins and confirm whether join fields always match or whether there are some “empty joins”. E.g. ask questions such as “Does every service_level_id in the submission table have a corresponding record in the service_level lookup table?”

Other resources

I haven’t found too many posts that I found particularly helpful (maybe all the good content is tucked away in books?). Here are a few links to other sites that may be useful and complement this post:

Header image by Miguel Tejada-Flores via Flickr (CC BY-NC 2.0)

What’s next?

When I joined Flatiron Health in February 2014, I had no idea what to expect. I had just moved to New York City – my second ever trip to the US – with two suitcases, crashed on my friend’s couch, and walked into the office in the middle of a snowstorm (I got in late on my first day because I was left stranded by the MTA – pro move!). I was on a 1-year visa and didn’t even know whether it was going to get extended after the year was up, or whether the 20-person startup I had just joined after finishing my PhD in England was even going to last that long.

Almost 5 1/2 years later I’m now looking back onto many late nights at the office, countless meals with my work family, a few drinks (just a few, really!), late night karaoke, rafting and ski trips, pipeline breaks and product launches, both great and absolutely horrifying client calls, several rounds of funding, an acquisition (us buying a company twice our size), another acquisition (this time us getting acquired), almost a thousand new employees, many farewells, wonderful relationships, challenging relationships, my first intern, my first direct report, my first time as a team lead, and my first goodbye to a company that I still talk about as “we” even though I officially left almost a month ago. As I like to tell people who ask me about my time at Flatiron: It’s been a wild ride.

So… what’s next? Honestly, I don’t know. I want to continue doing “data stuff”, but as a non-traditional (as far as the word “traditional” applies to a fairly new field) data scientist who puts data empathy and interpretability before building ML models, it’s going to be an interesting challenge to find the right fit for me. For now, I’m still based in NYC, enjoying the summer, plotting some travel, and reflecting on the things I’ve learned over the past few years.