From raw data to a story
You learn best by doing it! We all know this, but the fear of doing it wrong or not living up to our imaginary expectations stops us from even trying. In such situations, all we need, at least I do — is jumping in headfirst.
Participating in the Gramener Comicgen Awards 2021 with Rasagy Sharma and creating a data story was a similar experience. Winning the competition was a highlight, but the real value came from collaborating to create a data comic on an Excel sheet from a raw dataset in two days. The organic process followed, helped me intuitively understand the fundamentals and objectives of visualizing data more clearly than any course, blog, or conversation with an expert could ever do.
Hence, we decided to document it. If you truly want to make a data story - use this blog just for direction. Take the cleaned dataset linked with the comic, and just give it a shot.
Step 1: Clean the data
Not the most fun part of the process, especially if it is a formatted Excel workbook! But it is important to do it well to get closer to your story faster.
We were given a standard dataset from the Loyd’s Register Foundation World Risk Poll. Each country was a column and each survey question with its multiple-choice options represented each row.
For a country-wise comparison, we transposed the data. Each country was a separate row, and multiple-choice options concatenated with its question became a column. This format worked better because we knew that we will need data for all countries across a few questions. And deleting columns is much easier and more modular than deleting rows.
Notes:
- Each row of the cleaned data should be a meaningful unit. To test this on your data — think if you delete all rows and retain one, will the data still makes sense or not? It should make sense.
- Retain all information in your clean master data. Remove it incrementally as you narrow down the scope of the story.
- Use an automated script (R, Python) or a tool to clean your data. It makes re-running the same transformations or adding something new to any data easy and quick.
✨ Tip: Ideally, data cleaning should be a one-time activity, or it can be a painful and unnecessary waste of time. Retain most of your time to figure out the story.
Step 2: Identify a list of questions
Good and curious questions are essential elements of a good data story. The visualizations just make it easy for the reader to follow your line of investigation as you discover the answer. Rasagy and I listed all the questions that popped into our heads in a shared Google doc. Hence following your natural curiosity as you scan the data is a great starting point.
For best results, keep staring at the data! Just kidding (mostly), but this activity requires spending time exploring and analyzing the data.
Questions in the survey about mental stress, exhaustion, risks from AI and Tech, and climate change got us hooked. We naturally wanted to know more about it. Like -
- Where do people rank risk from AI/Tech among other risks?
- Is there a relationship between popular information and perceived risks?
- Which countries consider risks from mental stress and exhaustion more than others? If yes, then why?
Imagining ourselves as one of the responders to the survey also helped. A few questions that came up in a natural flow were:
- What comes to mind when we consider risks to our safety or humanity?
- What triggers our responses? What factors do we personally consider when evaluating risks?
- Does it change with age?
- Will countries with younger populations have a different response to these questions?
We made a long list of all the questions that made us naturally curious.
✨ Tip: Do not worry too much about data availability at this step. Just note down all questions and interesting perspectives that you can think about. You can check the plausibility of your ideas wrt data and narrow your scope in the next step.
Step 3: Explore Explore Explore
We got many exciting ideas in the previous step, but we only wanted one. Going deeper into exploring data is the way to do that.
Visualizing the data is one of the easiest ways to explore it, especially when it is more than 100 rows. It quickly shows the trend, outliers, dips, troughs, or anything interesting. For more clarity, filters like — top 50 countries, income type, and region type, also helped.
The process of exploring the data is hard to explain. Everyone has a different way of doing it. You can start with — finding out basic descriptives, comparing them among other categorical variables, finding relationships, the degree of relationship (strong, medium, negligible), its cause, observing patterns, finding drivers of that pattern/ behavior, etc.
Exploration paves the way to insights. Not always! Sometimes it ends with a descriptive analysis without any novel findings. We crossed out the first two topics (AI and Mental stress) because the available data did not let us go beyond the descriptors.
Climate change showed promise. There were also more questions in the survey related to climate change. Hence data availability also plays a role in determining which idea to take forward.
It was the end of day 1 and we still had no story! But one solid direction to focus on.
Notes:
- Use the data exploration step to go through the finer details. Understand the data and its context better. This will help avoid mistakes and not force-fit any hypothesis.
- Compare apples to apples. For example, carefully check the units, and the context, and normalize with the correct baseline.
- Understand the meaning and context of column values.
- Be wary of the degree of missing data.
✨ Tip: Try not to be too picky or neat with your exploratory visuals. Its objective is to help you figure out something interesting in the chaos. So let it be messy, and spend more time exploring the visual rather than making it.
Step 4: Find the insight
This was difficult! We did this together. Brainstormed on all our findings so far. We had a Tableau workbook pre-loaded with data before our discussions so that we could draw using different dimensions and see. Doing, seeing, staring, writing, and thinking help rather than just thinking alone!
We plotted a bubble chart/scatterplot with the Y-axis as the percentage of people who considered climate change a serious threat and those who were affected or know someone personally affected by severe weather events on the X-axis.
We expected this graph to show a strong relationship between the two. But it did not! It seemed counterintuitive, hence interesting.
We colored the bubbles by income group, and tada — our insight was in front of us. We excitedly smiled through our zoom screens as we saw the colored bubbles cluster fairly distinctly in each quadrant.
Observe the top and bottom left of the scatter plot below. Countries of higher income groups considered climate change a more serious threat than others despite very few knowing anyone or being personally affected by severe weather conditions.
Now check the top and bottom right. Countries of the lower-middle or lower-income group did not consider climate change a grave threat even when a higher percentage of people residing there had personally experienced or were affected by severe weather conditions.
Our inference — Maybe people in the lower-middle or middle countries are unable to relate severe weather conditions to rapid climate change and fully comprehend the risk it poses. One possible reason could be a lack of information and Nepal’s data conveyed the same.
And that inference, derived from that messy world risk poll data became the protagonist of our data comic.
Step 5: Figure out the best visuals
We used three types of visualization/charts — Boxplot, Isotypes, and Scatterplot.
Why Boxplot:
We wanted to start our narrative by ranking all the risks based on responses and identifying climate change’s position. It was funnily surprising to find out that people are more scared of road accidents than climate change.
To show rank, we chose a boxplot instead of a bar chart because the data across countries was highly disparate. Hence to separate the min, max, and outliers, a boxplot was a more accurate visual than a bar chart.
Why Isotypes:
What does your brain comprehend faster? 40% of people or 4 in 10 people? Isotypes solve for the latter; it visually shows 4 in 10 people. Plus, it was a creative way to use comic figures to represent percentages.
Why Scatterplot:
If you want to understand the relationship between more than two variables, this is a great chart type to do it. It was a single visualization that explained our key insight succinctly.
The Characters:
We used the Gramener Comicgen tool to create the characters with the exact emotions that we wanted. Try it out, it is pretty cool and summer simple to use.
Step 6: Write the narrative
I enjoyed this step the most! The anxiety to figure out an insight was over; now, all we had to do was put together an engaging and entertaining narrative. Doing this together, with Rasagy was double the fun. I remember how excited we were when coming up with the title. We Kept the language conversational making the story easy and simple flow with.
We also used the narrative to highlight the exact country names, percentages, etc., that we did not visually represent.
For example, the insight about Nepal, where more than 50% have been affected by natural calamities, yet 45% choose — “Don’t know” when asked if climate change is a severe threat. It helped convey that there is a lack of awareness and information — to make people realize that our climate is rapidly changing each year, and it is not a good sign.
The finer edits ensured no compromise on readability and clarity. We read our narrative out loud to remove redundancies. Neat margins, and correct spacing, made our comic reader-friendly.
And, that is how we made our data comic! As mentioned earlier, this is not an exact recipe. Just a few directional steps and tips learned from making this comic and spending 6 years analyzing data. Hope it helps you to take the plunge too!