Sam Bail

Hi!

I’m Sam Bail – Data Engineer, educator, and founder of Bright Nights Social, an alcohol-free nightlife platform in New York City. I’m currently based between Berlin and NYC. I’ve worked at data-centric startups doing things like healthcare data analysis, building an open source data quality framework, and leading engineering teams for over a decade. I also do a lot of conference talks and workshops around data, data quality, Python, Postgres, and other data topics, and I’m an online instructor for O’Reilly and LinkedIn Learning. I’ve also been hosting events in the non-alc space and writing about non-alcoholic topics since 2022.

Check out my Portfolio and Bio & Contact pages for more info about my work.

This blog is a merger of several personal and technical blogs I’ve maintained in the past since my Manchester days in the late 2000s. You’ll come across a pretty random mix of technical posts, posts about my travels, my time in Manchester UK, and more recent content from NYC and Berlin.

Will AI replace data engineers?

May 19, 2026 Sam Bail

Yesterday, I explained my data engineering job to a friend and he asked if I was worried that AI was going to replace me. And I thought about all the messy data sources with zero documentation I’ve worked with, the countless conversations I had with humans to understand what the heck I was looking at and how to interpret the data, the technical challenges of connecting to data sources, and the cross-functional teamwork required to actually make sense of data, and I can honestly say – at least for the time being: AI could never.

AI can certainly speed up writing code, reverse-engineering existing code and data, auto-generating documentation, and detecting errors, but unless it can find out who designed a production database from chatting with someone over lunch and send them a Slack message to say “hey, I’m seeing this in the data, is this an error or intentional?”, I believe it still takes humans to deal with human-generated data chaos.

How to party sober – and actually have fun!

April 19, 2026April 19, 2026 Sam Bail

Sober curious. Soft clubbing. Soft socializing. The cultural shift away from alcohol-centric socializing is one of the most interesting consumer trends of the decade, and I’ve been right in the middle of it with Bright Nights Social for several years now.

But I still get asked all the time: “How do you actually have fun while going out sober?” – so I wanted to fill that gap!

I’m excited to announce that I’ve just released The Sober Partying Playbook: the first practical playbook for going out sober. 27 pages based on my experience from producing 100+ alcohol-free events through Bright Nights Social since 2022, my own experience going out partying til 4am and beyond, and many conversations I’ve had with other sober party people.

The ebook covers mindset, practical logistics, what to drink, and how to navigate the social dynamics of going out sober and substance-free. I’ve had so much fun writing it, and I can’t wait to hear your feedback!

The full ebook is available now: https://lnkd.in/dgrMAqZb

Data Quality Testing with Great Expectations

April 19, 2026 Sam Bail

Even in the age of sophisticated AI, “Garbage in, garbage out” still holds true – you can’t get good results from bad data.

I’m excited to announced that my second course for LinkedIn Learning on Data Quality Testing just got published! It covers the foundations of data quality and why it’s so important, and deep dives into testing with Great Expectations (GX), a free and open source framework.

As a former engineer at GX and co-author of the official tutorial in the GX documentation, I’m uniquely positioned to teach this class from an insider perspective.

You can start the course here.

Check out PySpark Essentials course on LinkedIn Learning!

January 27, 2026January 27, 2026 Sam Bail

Happy almost 6 month anniversary to my first course on PySpark Essentials that launched on LinkedIn Learning last year! If you’ve always wanted to try your hand at PySpark, this is a great course for beginner to learn data engineering basics with PySpark hands-on.

Here’s the course description:

PySpark is a powerful library that brings Apache Spark’s distributed computing capabilities to Python, making it a key tool for processing large-scale data efficiently. In this course, data engineer and analyst Sam Bail provides a structured and hands-on introduction to PySpark, starting with an overview of Apache Spark, its architecture, and its ecosystem. Learn about Spark’s core concepts, such as the DataFrame API, transformations, lazy evaluations, and actions, before setting up a lab environment and working with a real dataset. Plus, gain insights into how PySpark fits into a broader data engineering ecosystem and best practices on running PySpark in a production environment.

Learning objectives:

Build your understanding of the core concepts of Spark and PySpark.
Understand how to install PySpark, load, manipulate, and analyze large datasets in a notebook environment.
Gain an understanding of how PySpark fits into a wider data engineering ecosystem.
Understand best practices about executing PySpark in a production environment.

The entire course is around 1 hour 18 minutes long and includes hands-on exercises in a Google Collab notebook. You can watch it here with a LinkedIn Learning subscription!

Goodbye Third Place Bar…

January 6, 2026 Sam Bail

… hello Bright Nights Social!

When I first launched Third Place Bar almost exactly three years ago, I wasn’t even one month sober yet, but the idea of an alcohol-free bar had been on my mind for a long time.

Why were there so few “third places” in New York City that stayed open late and didn’t revolve around drinking? So I set out to create one, starting with pop-up events to test out the concept, understand the space better, and build up a community.

Three years, nearly 100 events, and thousands of guests later, Third Place Bar has grown into many things… just not an actual brick & mortar bar. The name stayed, but it started to feel misaligned with what I was actually doing: hosting pop-up events, exploring sober-friendly nightlife in NYC, and sharing the things I’ve learned along the way.

So… I’ve changed the name. That’s all. After two weeks of overthinking, hours of meetings, and endless text messages, I landed on Bright Nights Social. To me, the name captures the joy of going out, finding community, and being fully present for it.

It’s also a nod to:

My obsession with disco balls.
Bright Lights, Big City – I randomly found the book on the street in NYC last year and absolutely love it.
New York City, of course!

And on that note… stay bright! ✨

My PySpark Essential Training course just launched on LinkedIn Learning!

August 18, 2025 Sam Bail

I’m excited to share that my new course on PySpark Essential Training is now live on LinkedIn Learning!

I was approached by the LinkedIn Learning team after they found some of my Python courses on YouTube, and we brainstormed several ideas for topics together. While I hadn’t used PySpark in a production setting, I was really excited to dive into it and develop a new course for them! This is the first time I’ve created an on-demand course that actually pays royalties based on streams, which is a lot more scalable than the quarterly live trainings I’ve been hosting for O’Reilly.

In the LinkedIn Learning course, I provide a structured and hands-on introduction to PySpark – perfect for data engineers, analysts, and anyone looking to scale their data processing skills. The course covers:

The core concepts of Spark and PySpark.
Installing PySpark, loading, manipulating, and analyzing large datasets in a notebook environment.
How PySpark fits into a wider data engineering ecosystem.
Best practices about executing PySpark in a production environment.

You can stream the course here with a LinkedIn Learning subscription – and feel free to like and bookmark the course to provide feedback!

The key to building a high-performing data team is structured onboarding

September 1, 2023September 1, 2023 Sam Bail

As I’m reflecting on the past two years on the last day of my job at Collectors, as well as the years before that at other organizations, I’m seeing a common thread between what worked and what didn’t work to build high performing data teams. But… let me rewind a little.

When I started at Flatiron Health in early 2014, the entire company was around 25 people, around half of them software and data engineers. When I left the company five years later in 2019, we had grown to around 800 employees (give or take, I lost track at some point) – and despite the hyper-growth, we still had one of the most highly performing engineering teams I have gotten the chance to work with. In my next role, it took me a moment to realize what we were missing that made it feel like the engineers were able to hit the ground running and seamlessly integrate into an existing engineering team or form new teams without major bumps: structured onboarding. For me, structured onboarding consists of two things that are both equally important:

The onboarding document
The “bootcamp” phase

Using these two methods, I have onboarded members on my data teams over the past decade and gotten them to be productive and ready to go after just a week. There are several main benefits of structured onboarding, both for the new team member and the organization:

It significantly reduces the amount of time for the new staff member to be productive.
It reduces the amount of “asking around” the new hire has to do.
It makes them feel welcome – arriving to a prepped and personalized welcome doc shows respect and care for the new hire, and sets the tone for excellence.
In addition to speeding up the time to being productive, it also sets the tone for “how we do things here”, which allows every member of the team to operate quickly and efficiently without wondering (or debating) about workflows or best practices.

The onboarding doc

The onboarding doc can be a simple Google doc that’s shared with the new hire on day 1. My onboarding docs look something like this:

A brief personalized intro paragraph and “what is the purpose of this document”. I’m serious, this sounds like a minor detail, but for someone who is brand new in a job and might have absolutely no idea what’s going on, even the smallest amount of help can make them feel more comfortable.
A list of key documents and links, such as:
- Company-wide documents and links:
  - The company intranet page
  - Any “what does the company do” type document
  - Org charts
  - Learning & development, travel, holiday, reimbursement etc. policies
- Team-specific documents and links:
  - The team mission statement and roadmap
  - Key data platform documentation
  - The JIRA board (or whatever project planning tool you’re using)
Who to meet:
- Everyone on the team – I suggest short 15 minute meet & greets with everyone on the team (if size allows)
- I usually pick a list of 5-10 people in different key roles at the company (e.g. product, software engineering, operations, IT, security…) to chat with. If you choose to do this, please prepare as follows so the experience is pleasant for everyone:
  - I confirm with those people up front that they’re happy to talk to new hires
  - I make a note for the new hire to message the person first, introduce themselves briefly, and ask if it’s okay to throw a quick meeting on the calendar. NO ONE likes “cold”
  - The doc also explicitly calls out that the new hire should come prepared with a quick intro and some basic questions (what’s your background, what do you do in your role, how do you use data in your role, etc.).
- System access: This is one of the most crucial parts of the onboarding doc and will save everyone a lot of headaches.
  - This should just link to a separate, centrally maintained list that is kept up-to-date.
  - A list of every system and tool and how they can get access, e.g. email, GitHub, VPN, JIRA, databases, SAAS products, etc.
  - Obviously, a lot of these should have been granted to the hire before the start date.
  - I explicitly ask every new hire to add comments or suggestions to the central systems access doc to improve it or update outdated or incorrect information.
Onboarding plan:
- Day 1: What the hire should be doing on day 1. That’s usually just getting access to various systems, joining Slack channels, and reading the “key documents”.
- Week 1: This usually includes browsing the code base, exploring dashboards, running dev pipelines, and picking an “onboarding ticket” from a list of quick and easy tasks.
  - Great onboarding tasks allow the hire to make and test some simple code changes without needing to know too much about the code base. This could be a small refactor (renaming a variable), writing tests, or adding an existing data point to an output.

The “bootcamp” phase

The “bootcamp” is usually the first two weeks where a new hire learns about the company, the product, data, codebase, and best practices for how the team “does things”. These are usually taught in live sessions by the subject matter experts, which allows the new hire to ask questions and clarification. Obviously, the scope of a bootcamp depends on the size of the company and team, but it can include sessions such as:

Company mission statement
Who is who: org chart walkthrough
Product demos (get a product manager to do that!)
Data platform and dashboard demos
“Where does the data come from” walkthroughs
Developer workflows, code conventions, and best practices
How we give and receive feedback on our team

The bootcamp ensures that the new hire gets exposed to key information that’s required for them to navigate the organization. Additionally, it also makes them “drink the kool aid” and shows them how the team operates. This is one of the most crucial points for building a high-performing team: Instead of each new engineer learning from whoever is reviewing their code or helping them get something to work, which may often turn into a game of “telephone”, there is an actual document, slide deck, or some other written documentation of workflows and best practices that everyone knows about from the very beginning.

Bonus for managers: The 30/60/90 document

In addition to an onboarding doc, you may also want to put together a 30/60/90 document for a new hire that lists out expectations of what they should have accomplished at each milestone. There. areplenty of templates out there, so I’m just briefly sharing my personal preference here.

Keep it super simple. One page is enough.
At each milestone, write out what the engineer is expected to be doing with increasing complexity, ownership, and scope depending on their level of seniority: e.g. write SQL queries without help, review other people’s code, lead stakeholder meetings, plan out projects, etc.
I generally split these between technical tasks, team internal tasks, and leadership skills.

Conclusion

Of course, great onboarding isn’t the only thing necessary to build a high performing team, but it’s almost impossible to build one without great onboarding. Depending on the size of your team and organization, onboarding may look different from what I’ve described here, but I hope this provides a good starting point for you. Whether you’re kicking off a brand new team as the first data hire or coming into a more established data team that doesn’t have an onboarding plan yet, spending just a few hours putting together a solid onboarding plan and refining it with every new hire will help set your team up for success.

I asked ChatGPT to write me some code to create a running playlist in Spotify… and it just worked?

April 25, 2023 Sam Bail

I don’t really love running. I run marathons. These two things are true at the same time. I started running a lot (like, a lot) during the lockdown 2020 and somehow never stopped. While I do enjoy the actual act of running, the one thing that keeps me coming back to it is the music. My playlists. To me, running is just a different form of dancing – moving your body to music.

As I got more into training for actual races, I noticed that running to music at a certain bpm helped me keep up my cadence and made running a goal pace easier. Running cadence is basically the same as bpm for music, it describes how many steps you take per minute. A higher cadence is generally considered more favorable as it reduces the time spent on the ground, and quicker turnover can help you run faster. I generally aim for 175 bpm for an easy/medium run which gets me to a 9:30 pace with light effort, and faster bpm for faster runs.

You probably know where this is going by now: I’m always looking for a good playlist with my target tempo. There are obviously a bunch on Spotify already, but they’re not necessarily the songs I like. While Spotify recently introduced auto-generated playlists at a certain bpm based on your liked artists, it seems to have a pretty wide margin for the tempo. Since I’ve been wanting to play around with auto-generated code by ChatGPT for a while, I figured I might as well tell the AI to write some code for me. And it went… ok?

Here’s my first prompt, just to test the waters:

>> Can you write me some python code that gets a list of all my liked songs including their ID and bpm from Spotify?

Sure, I can help you with that. To get a list of all your liked songs including their ID and bpm from Spotify, you will need to use the Spotify API and the spotipy library for Python. Here is an example code that should work:

You will need to set up a Spotify developer account and create a new app to get the required credentials for the Spotify API. Then, replace the client_id, client_secret, and redirect_uri values in the SpotifyOAuth call with your own credentials. Finally, run the code and it should print a list of your liked songs with their IDs and BPMs.

That already looked pretty good, so I got a bit bolder and asked ChatGPT to actually give me the full code to add songs based on a certain bpm… and it delivered:

This looked like reasonable code to me. I was particularly surprised and somewhat bewildered that it automatically added the 5 bpm margin. Of course, most human developers would know that a buffer makes sense and ChatGPT just learned from existing code, but it almost felt like it was being smarter than what I asked it to do?

However, when I tried to run the code, I encountered two minor problems:

First, the example didn’t actually add the variables for the authentication credentials in the SpotifyOAuth call that it mentioned in the comment after the code. I had to do some googling to figure out how to set use these and decided to set the respective environment variables instead.

And second, the data model for the object returned by the Spotify API has changed, and the “tempo” now requires a separate API call. I found an example API call and asked ChatGPT explicitly to use the different method to get the “audio features” object:

>> can you write some code that uses the spotipy library to retrieve the audio features for a given track ID

I’ll skip the code snippet it gave me, but I ran the code separately and it ran just fine. Then I put everything together and it… just worked. Playlist created. Job done.

The final result

Overall, this took me maybe 15 minutes to put together, including figuring out authentication (I had already registered for a developer account before). Even though the code is fairly trivial, the biggest help was the API examples and the ability to ask questions about the API. It felt like working with a buddy who just happens to know that particular library and API pretty well and can you point the right way. I actually think I might use this in the future when working with unknown APIs. I’m decent enough at reading library and API docs, but using ChatGPT just really sped things up, even though it initially pointed me to an outdated version of the data model.

I don’t actually know how this would work for inexperienced developers that don’t have a good mental model of how the code should be structured at all and might have struggle with debugging error messages if the code doesn’t actually run. But I feel like for someone who’s written enough code to have an outline in their head and just needs some examples for things like API calls, this might be a great way to accelerate writing code.

The notebook with the final version of the code can be found on my GitHub. And yes, I did run the London marathon with (part of) that playlist!

Booze-free in NYC: Resources for sober and sober curious folks in New York City

April 2, 2023April 4, 2023 Sam Bail

Okay, so – I recently quit drinking alcohol and started a series of non-alcoholic social events in New York City called “Third Place Bar”. There’s a whole blog post waiting to be written on that topic, but until then I’ll leave you with an episode of the Sober Sallies podcast where I talk about my own story. In this post, I just want to highlight the insanely fast growing sober community in New York City that’s gone way beyond AA meetings in church basements, and list a bunch of resources for people in NYC that want to discover more alcohol-free social spaces. Here you are.

Non-alcoholic bars in NYC

Hekate: The first completely sober neighborhood bar in NYC. Located in the East Village and it’s definitely got some quirky EV vibes. They occasionally have events on, and their resident Tarot reader Matt is a sweetheart. You’ll probably find me there a couple of times a week.
Fat Tiger: The latest addition to the NYC booze-free scene, Fat Tiger is a dry speakeasy bar/lounge just a couple of blocks away from Hekate in the East Village. They just opened in March 2023 and it’s still a work in progress, but the staff are lovely and the craft cocktails and food are promising.
Kava Social: A kava bar in Brooklyn. I’ve never actually been, but they have a pretty big community that loves their kava, worth checking out!

Non-alcoholic events in NYC

Absence of Proof: Super successful non-alc cocktail pop-ups in NYC, upscale vibe. They’re running an event almost every week (mostly Manhattan).
Sober in Central Park: My buddy Rachel just started her own sober event production company, focusing on highly curated events in NYC! Check out her website for upcoming events and sign up for the mailing list to stay in the loop.
Third Place Bar: Yup, that’s me! I started running non-alc bar pop-up events with a neighborhood bar vibe in Brooklyn. I’m aiming to have a permanent space at some point, but for now I’ll keep hijacking coffee shops after hours for bar hangs and events like trivia nights and comedy shows 🙂 Check the website for upcoming events, sign up for the mailing list, follow on Instagram, do all the fun stuff and show us some ❤️.

Non-alcoholic bottle shops in NYC

This is a quick list of the booze-free bottle shops that I know of. I wouldn’t necessarily think of them as “social spaces”, but I recommend following them on social media and subscribing to their newsletters: Most shops usually have tastings on several times a week, and they host the occasional special event!

Spirited Away: A lovely non-alc bottle shop in Soho, the staff are super knowledgeable. Big selection of spirits, wines, and cocktails, maybe a little less so on non-alc beer. Super fast same-day delivery in Lower Manhattan!
Minus Moonshine: Another independently owned bottle shop in Prospect Heights. The owners Aqxyl and Juan are absolutely wonderful and have been supporting Third Place since day 1! Huge selection in a small space, tons of beer, loose cans (super useful if you want to just taste something instead of buying a 4 or 6 pack!) and lots of open bottles for samples if you’re unsure.
Sechey: A beautiful non-alc bottle shop in the West Village. Apparently the NYC location is just a pop-up, but it’s been around since fall 2022!
Boisson: A national chain of non-alc bottle shops, they have several locations in NYC and Brooklyn. I personally prefer the independently run shops, but you do you babe!

Sober community groups in NYC

Definitely not something I’m an expert on, so here’s just a short list of communities that I know of or have personally been in touch with that run meetings and events (mostly free!). I’d say these are more focused on people in recovery rather than people that are “just” sober-curious.

AA Inter-Group: The NY chapter of AA, tons of meetings pretty much round the clock. The Midnite and 12th Street meetings tend to have a younger demographic (according to a friendly Redditor).
- Now You Understand and Young and Wise are AA groups specifically for young people that meet on Zoom and in person in lower Manhattan.
- I also want to point out that there’s plenty of criticism of AA, but I also know people that had a lot of success with it, so I figured I’d list it anyway.
The Phoenix: A non-profit that organizes frequent free community events like running, climbing, and social events. I know some of the folks that are involved and they’re wonderful.
Big Vision NYC: Similar to the Phoenix, they run free community events for sober folks (most recently paintballing!).

This list was last updated 4/2/2023 and I might be making some edits in the future.

Analytics From Day 1: How to Ensure Smooth Data Integrations with New Production Systems

September 19, 2022October 19, 2022 Sam Bail

This is a pre-print of a post that will be published on the collectors.com tech blog.

We recently went through one of our most epic product launches at Collectors: Customers are now able to submit cards for grading to PSA, add them to their collection on the new collectors.com site, and choose to have their valuable cards stored in an actual physical vault. Oh, and all of this will be accessible via a single login, their “Collectors ID”, which replaces the multiple logins customers previously had to manage across our different products and business units like PSA and PCGS. While all these new features are integrated to provide a seamless experience to our customers, we’re dealing with a significantly more complex architecture of multiple new systems, databases, and APIs on the backend. Naturally, as with any new product launch, our product managers were keen to get analytics about the use of these features from day 1: How many users actually converted to the new “Collectors ID”? How many items have customers submitted to the Vault? Who is using the new collectors.com “My Collection” feature?

In order to provide these kinds of data insights right from the go-live, we had to coordinate with several engineering teams to get our hands on the right data and integrate it into our data warehouse. In total, we ended up pulling data from systems sitting on top of four different production databases that were being launched at the same time. Considering how many different systems and databases we were working across, the integrations went pretty smoothly! Within a day of the go-live, I had produced a few dashboards with key metrics that the product and business stakeholders started using immediately to track uptake of the new services. The road to getting there wasn’t straightforward though and involved some amount of scrambling, knocking on different doors, and a few small surprises during the go-live. In this post, I will share some of my lessons learned from integrating with a new production system when you’re looking to provide analytics from the get-go.

Logistics

I’m big on keeping running docs with notes from my conversations and findings when working on a project – I always say I outsource my brain into a Google doc. Keep a doc with (datestamped) notes and “to do” items for every piece of information you find, open questions, as well as a list of who’s responsible for what on the product, e.g. product managers, engineering leads, project managers, etc.

In addition to the running notes, connect with the business stakeholders (product managers, analysts…) early on to document a set of desired metrics along with their priorities and timelines: What do we need to know from day 1? What can wait until some time after the launch? This will also be helpful when exploring the new data models to determine what is actually being captured and what data points may not be available to calculate the required metrics.

If there are standing meetings for the engineering team that’s responsible for the database setup, I strongly recommend regularly sitting in on those meetings. Even if you don’t always understand everything that’s going on, it’s helpful to have the context of what the team is focusing on, and establish a relationship with them. As data engineers, we’re often pretty removed from our counterparts on the data producer side, but knowing the people on the team (and having them know you) can be helpful in working together more effectively.

Infrastructure

Once you know who your engineering point of contact is, the first question you’ll want to ask is: How do we get access to the data? Assuming we’re talking about data that lives in a relational database, here’s a short check list of information you need to get from the engineering team that’s responsible for the database setup:

Find out (and document) what cloud service the database is hosted in
Will you get access to a production database or a read-replica? And what permissions will you get, read-only, or will you be able to create temp tables or views if they’re needed by any of the tools in your pipeline?
Will there be dev and prod environments? What’s the timing for these being available?
How do users and services authenticate against the database? Do we need personal and/or service accounts to log in?
How will the logins will be shared? Will you need access to a shared password storage?
Do you need an SSH tunnel setup to access the database from any of the tools in your data stack?

It’s best to try and get all these details ironed out as early as possible, since especially tasks like setting up SSH tunnels can take some time. Make sure you can access the database as early as possible to avoid surprises later on, even if there is no meaningful data in there yet.

Data model

Now that we’ve covered physical access to the data, let’s take a look at things to consider when you’re working with a new data model. I got looped into the production database design process early on and was able to provide input on the data modeling (see also: establishing a good connection with the upstream engineering teams! They’re your friends!). This ensured that the data would be suitable for our data extraction tool (Stitch) and contained all relevant data. Again, assuming you’re working with a relational database, here are some questions you’ll want to cover when talking about the data model:

Where is the data model documentation and how is it being kept up to date?
For any fields containing value sets, such as status codes, where are the corresponding descriptions stored? Will there be lookup tables in the database, or will these only be stored in code? The latter means you will need to be able to access the up-to-date list of lookups through your infrastructure, e.g. by querying an API (or simply reading the API documentation).
Will there be JSON columns? What is the schema for those?
What are the constraints on each table and column, e.g. foreign key relationships, NULL values, default values?
For datetime fields, will they be stored with timezone (they should)?

Application and data flow

Perhaps most importantly, when trying to make sense of data coming from a production base, we need to understand what the flow of the application is: What workflows (user-created or automated) in the application modify the data in what way? This is absolutely crucial to handling the data correctly and drawing the right conclusions from it. For example:

How and when is a record created, and what fields are populated through what input?
What workflows cause records to be modified in what way? And what metadata is there to track modifications, e.g. a “last updated” timestamp?
Will update timestamps for events such as status changes be tracked in separate fields? Or will there be kind of changelog table that captures these kinds of changes? This also trickles down into your data warehouse models, where you might need to start tracking status change dates right from the get-go.
How are deletions being handled? Will there be “hard deletes”, i.e. the record is simply removed, or “soft deletes”, i.e. the record has a “is deleted” or “deleted timestamp” field. And, along the same lines, is there a data retention policy that means data will be dropped or archived after a certain amount of time?
If the application is replacing a legacy application, will data be migrated? How do you recognize migrated data? Will there be any gaps or differences between migrated and newly create data?
Will there be realistic dummy data (i.e. data that adheres to the constraints and workflows described above) to develop our data models and metrics against?
Is there any chance of any test or dummy data getting into the production system? And if yes, how can we recognize and filter for it?

Ideally, your engineering and database admin teams will already have a “best practice” guide for designing new databases, which usually answers a lot of these questions. Otherwise, this might be a good time to start collecting these kinds of design decisions into a guide and encoding them in setup scripts where possible.

And finally…

I hope that this post has provided you with a starting point for a checklist for your next production data integration. All the questions I’ve covered in the above paragraphs should be treated as conversation prompts to elicit existing design decisions, or help influence decisions that are yet to be made. There will likely be some oversights (I have yet to work with *the* perfect production database), but coming prepared with a plan may help you catch some of the biggest issues to getting a good data integration early on. And even with the best preparation, you can probably expect to make some tweaks after the application go-live to adjust to some last-minute database changes or correct some assumptions you’ve made about the data. Developing against an empty data model or even dummy data can be challenging, and you might not nail everything at first try.

One last thing to keep in mind: As data consumers, our downstream use case will most likely be of lower priority than getting the production system stood up – and that’s totally okay. While I would love for data to always be a first-class citizen, I believe it’s pretty obvious that producing a stable production system needs to take priority, and we just need to accept that resource constrained engineering teams may move slower on supporting a data integration. This is why you’ll want to get started early and get these kinds of tasks and questions on the engineering team’s radar as soon as possible.

CC-licensed photo by Ian Beckley: https://www.pexels.com/photo/top-view-photography-of-roads-2440013/

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

The onboarding doc

The “bootcamp” phase

Bonus for managers: The 30/60/90 document

Conclusion

Share this:

Here’s my first prompt, just to test the waters:

The final result

Share this:

Non-alcoholic bars in NYC

Non-alcoholic events in NYC

Non-alcoholic bottle shops in NYC

Sober community groups in NYC

Share this:

Logistics

Infrastructure

Data model

Application and data flow

And finally…

Share this: