We’re more influenced by our experiences than by reading numbers. Yet, as product people, we’re relying more and more on quantitative data to make decisions.
This could get really dangerous, because quantitative data tells you most things, but it doesn’t tell you why somebody does something, so you could end up optimising for the wrong metric or making assumptions about customer behaviour without ever validating it.
When I worked at Tinybeans, we ran onboarding A/B experiments for a good six months. Based on quantitative input (specifically, surveys and heat maps) we increased an already good metric by 0.64%. I did some contextual inquiries and usability tests and we tweaked the experiment based on those findings, mainly by terminology and changing the user flow. The conversion jumped up by 13%.
You can use qualitative data-gathering to validate start up ideas before you write a line of code, or you could use it to test large company decisions. There are plenty of times where it would be unethical to A/B test, or to scrape data. You know what would help with that? Qualitative data!
There’s a bunch of qualitative research tools you can use to learn about problem spaces, user motivations and mental models.
The one tool that I use 80% of the time is usability testing. It’s kinda like the Swiss army knife of user research, and I mean it in the best possible way. You’re heading out on a multi-day hike and your pack is heavy with the bare essentials. You don’t intend for things to go wrong, but it’s always best to be prepared, so pack the Swiss army knife, and use it if you have to.
I test with at least 10 people at a time. It’s inconclusive on the numbers, but you can find around 85% of your product’s usability problems. Along the way, you’ll also discover some mental models and customer insights.
If you’re going to gather data through talking to users, the biggest weakness will actually be you.
Here’s how to avoid the most common mistakes:
Stratify your sampling.
Choose the ratio of customers you want to test on very carefully. Do not test on your power users. They are not indicative of your greater audience. Ask them if they have colleagues or friends that they can refer. Test with them.
Remain open to customers that sit outside of your assumptions. I once found a receptionist that was tasked with installing server software. Arguably, if we made the software easier for them, we’d make it easier for everyone.
Don’t lead the witness.
This is hard because, as humans, we rely so much on body language. Non-verbal behaviours account for 60% of all interpersonal communication. It’s your job in qualitative research to not give any behavioural cues. Be pleasantly silent, make someone comfortable and assure them that we’re testing the product, not them.
Use their words, not your terminology. If somebody wants to call a search bar a shiny box of magic, that’s what you’re going to call it for the rest of your interactions with them.
Never, ever, ever point something out to the customer or correct them. It’s the equivalent of telling someone that they’re not on a placebo in a medical test. You’ll walk away with the wrong data, which results in the wrong metrics, or worst, wrong product. Perhaps, more importantly, our time on this earth is limited, and if you skew the data, you’ve wasted an hour of their time and an hour of your time. Don’t give away any cues, remain impartial to their actions, but help them feel comfortable in the test.
How do we go from insights to action items?
There’s a couple of ways that I like to use to make sense of my transcripts.
1. Adjusted-Wald calculation
This is a binomial confidence interval method that has a 95% confidence, meaning that if you were to conduct the same test 100 times, 95 of those times, the results would be within the confidence interval. It’s useful when you’re trying to measure the impact of your findings.
This will only work if you have two successes and two failures in your test.
For each feature or task, do a binary pass/fail.
Let’s say 7/10 of your users were able to complete the task. That means 3 failed. Easy. With the Adjusted-Wald method, there’s a 95% confidence that 39–90% of your user base would success in the task. Now you have a number that you can make a decision on.
If 39% or less of your customers will fail at this task, would the sky fall in? If not, add it to the backlog and move on. If its’ crucial, iterate.
This method only works if you correctly stratify your sample, if you test with your power users or people who aren’t likely to use your product, these findings go out the window, so yes, choose wisely.
2. This is an in-between, it allows for more nuance, but it’s also more grey.
For every task, you can give it a scale. Sadly, that will likely involve a spreadsheet. I put each tester in a column and tasks or features in rows.
Each tester gets a point for each completion:
- 0 Task completed
- 1 User’s suggestion
- 2 Relatively minor effect on task performance
- 3 Causes significant delay, incomprehension or frustration
- 4 Prevents task completion
I find this really helpful because it gives you context. Sometimes the most vocal customers will remain in your memory as having a real rough trot, but when you review the transcript, you find that they mainly had suggestions.
This scaling system helps me get out of opinions and put some facts around the qual data.
Unfortunately there’s no way to make a claim that you can expound to the larger audience base, but that’s ok.
That’s it. Test your thing. Test often. Hunt for the “why” behind your customer’s actions. Be kind to each other.