Being a Product Data Scientist

9 min readJan 9, 2023

I hear you all — who have an issue [1] with the term ‘Data Scientist’. I do too! [2]. But I have another concern. It is hard to explain the exact process of finding good insights from messy data [3]. And sometimes even harder to learn it directly from someone. Then how can one become good, or even the best at it?

The most common answer to this question is to keep doing it long enough [4].

Working with data teams in startups for the last five years [5], it is clear to me that — Yes, of course, you gotta keep practicing, but it is not just to become more proficient at your skills. It is to discover your own unique ways of doing it, articulating it, and utilizing it to drive desired outcomes.

Being a product data scientist at Sundial [6] gave me lessons, tips, and tricks that will forever be a part of my analytics toolkit. So, here are some of them, along with examples from my work, that I wanted to write [7] and share. Especially with folks yearning and spending a good deal of resources (both time and money) to become a “data {something}”.

Ad-hoc queries are God sent
Be clear on the utility of your analysis
Do not be afraid to work on new challenging problems
Never-Ever compromise on your data checks
Being an on-call is a blessing in disguise
Work in tandem with the engineering and design team

Ad-hoc queries are God sent 😇

Accept or not, we all despise ad-hoc data requests, especially when we signed up to automate analytics and build a product. However, working on these ad-hoc questions was extremely useful for building my analytical skills and muscle to quickly identify methods to solve them.

Let me clarify what I mean by ad-hoc queries here. Often these are business questions that the user cannot answer with existing dashboards or reports and needs further investigation. Often they are just on-time use cases in the “hope” of finding something valuable.

Some interesting questions that I got to work on were:

Should we track monthly or quarterly churn rates?
Can you figure out our product’s high growth potential/budding segments?
Did decreasing the number of notifications sent improve user retention?

The monthly and quarterly churn rate problem was a fascinating one: We had to pick one of the two for an e-commerce B2C product [8]. When I checked for correlation, both metrics were highly correlated, so the simple, quantitative answer to this question could have been — it does not matter; you can track any! This answer is acceptable, but there should be a more substantial reason for it than just a high correlation, isn’t it?! But how do you find it?

The key was to understand why we even track a churn rate- It is to get the percentage of users lost in proportion to the users active on the product in a given time period. Note the time period is important.

The churn rate formula

So, tracking the monthly churn rate is beneficial if your product is more active monthly — like a credit card bill payment app. If it is a quarterly usage product, maybe some financial planning product, then the quarterly churn rate metric makes more sense. Hence, we will have our answer by calculating the product usage frequency [9]. And one of the ways to do that is by using stickiness ratios (WAU/MAU, MAU/QAU) [10].

No BS user retention metrics to reduce churn by June

Solving this ad-hoc query helped me intuitively understand how product metrics are related and the impact of both their value + trend on the product’s growth — which is otherwise hard to interpret.

I get it. Automating analytics is cool! But to level up and build conviction in your analysis — keep working on these vague questions. It is as essential as a warm-up before running a marathon.

Be clear on the utility of your analysis 👾

Irrespective of how sophisticated your technique to solve a problem is. Or how interesting your insights are. If the end user of your analysis cannot take any action or decision using it or make it a part of their workflow, it is useless.

I worked on an algorithm that identified the growth phase (early, growth, mature) of a product [11] derived from the concept of an S curve [12]. But the novelty and challenge of the problem statement initially pushed me away from asking — what will the user do with this piece of information? I just wanted it to be live on our product asap.

Evolution of a Product from Sequoia Capital Data Science series

It can be super valuable if a user can evaluate its product metrics’ movements in the context of its phase. However, knowing just the growth phase of the product is not helpful. It has to assist in understanding how a specific metric movement is a good thing in one phase and a point of concern in another.

For example — when Zoom launched or was in its early stages of growth, most active users were new users, fair enough! Now imagine, if this case held at its later stages of growth (mature phase), Zoom would have been in trouble. Because, in the later stages of growth (mature phase,) the proportion of resurrected (returning) users in active usage should be higher than new users to ensure sustainable growth.

Questioning the analysis's impact helped us improve our phase detection capability.

Note to self: For your fun, own personal projects, it is entirely okay to analyze, play-around with data that has no utility for anyone. When Dear Data [13] started, I am sure they did not think about its utility but did it to their own heart’s content!

Do not be afraid to work on new challenging problems ⛰️

In a high-trust team, which is common in good startups, you can catalyze the confidence of your peers to do amazing new things. At the start of my career five years ago — I learned enough R to get the job done. Since then, I’ve wanted to get self-sufficient by learning to write and debug good code.

At Sundial, I learned how to write modular Python code, debug data pipelines, and automate analytical capabilities — all of which were challenging and helped me expand my skill set.

For instance, in my first week — I built sparklines [14] to visualize trends. Before I started, I had no idea how to build a sparkline.

However, this does not imply that you should always work independently and cannot seek help. Building this alone was fun; building on top of it with colleagues was even more satisfying. My colleagues helped me improve the code parameterization and add enhancements on top. We liked this small piece so much that we later used this as a hiring assignment.

Never-Ever compromise on your data checks ⚠️

If your startup is new with a small data team, there may be no standard ways to check and clean the data. Yes, there will be a best-practice document, but it is still a document, not imposed like your GitHub workflow checks. You can still process your data in the pipeline without performing those checks. DO NOT DO THAT!

I once missed doing some checks on retention data which resulted in 3–4 days of work going to waste and additional computing costs. The bigger the data size, the more difficult it gets to check it, but the more critical it becomes.

No matter how painstakingly manual the checks are or how confident your inner voice is about the sanctity of the data, only proceed with ticking every data check off your list. In fact, add more checks to that list each time you work with a new schema.

Figure out ways with your team to automate these checks. If you do not have the bandwidth to automate, then involve other team members and the engineering team to discuss ways to thoroughly run data checks on large-size, complex data and ways to reduce manual effort.

Ensuring data quality is the first step to building conviction in your analysis.

Being an on-call is a blessing in disguise 👩‍💻

Debugging your code is not the most fun part of being an analyst. Debugging someone else’s code is just painful! However, there is no better, faster way to understand the code base and to improve your coding skills.

It is easier said than done. It takes time; to learn how to read the error messages well, quickly add breakpoints, and print statements at the correct places to run the code and identify the cause of the error.

The first time I was on-call, I was tempted to immediately ask the person who wrote the code to check why their code failed or ask someone more experienced than me who must have encountered this error. But independently spending time trying to figure out the error and then fixing it encouraged me and gave me the confidence to write better code.

Bonus: Check out this beautiful debugging manifesto by Julia Evans.

Work in tandem with the engineering and design team ✌️

Software engineers are better and more efficient in designing the code, making it scalable and more robust. It’s like stating the obvious, yet the collaboration is limited for some reason.

Layerinitis, illustrated from Jean-Michel Lemieux’s excellent thread

To avoid Layerinitis [15], analysts must keep engineers in the loop when building new code capabilities.

Similarly, with the design team- the best person to craft a natural flow for your data insights is the person who cares the least about your analysis but the most about getting their questions answered. Collaborating with designers can help you get to the most critical insights and present them in the most user-friendly way.

Thanks for reading so far!

As per the premise, you need more than just reading this blog to help you become great at analytics. But it might help you move in a direction where you learn to do this weird art/science [16] thing in your unique ways.

So keep at it, and you will figure it out!

View from my desk at Sundial office in Bengaluru. I learned so much at Sundial because not only was the work interesting, but I also got to work with some really awesome people!

Notes:

The one I take issue with the title “data scientist”
Obvious Fact: Google search trends result for ‘what does a data scientist do’ ranking significantly lower than ‘how to become a data scientist’ worldwide. Except in Oct of 2022. Could be self-doubt or the finance teams!
How do we actually pull stories out
Analytics is a profession
I have worked as a product data scientist at Sundial, a data success manager at Atlan, and a senior data analyst and SocialCops.
A product data scientist (PDS) at Sundial develops proprietary frameworks and algorithms to generate product insights like alerts, trend analysis, driver analysis (what drove the change in metric), forecasting, funnels, etc., at a regular cadence, mainly in Python.
Why I Write
It was an e-commerce product, something like eBay. It had items available to purchase ranging between 2 to 1000 dollars. Instead of tracking all possible product metrics, we recommended our users to track the most crucial metrics for their business.
Selecting the right user metric
The WAU/MAU and MAU/QAU ratios value help in identifying product usage frequency. Check out: Stickiness-benchmark to know more.
Evolution of a product
The Death and Birth of Technological Revolutions
Dear Data
Sparkline
Layerinitis
How do we actually “pull stories out of data” ; Proficiency v. Creativity