“An undefined problem has an infinite number of solutions” – Robert A Humphrey
I found that quote on the underside of a tea bottlecap. I think it’s germane to what we’re doing with Test Pilot.
But as the bottlecap warns, those massive piles won’t get us anywhere unless we know what problem we’re trying to solve with them.
There’s a temptation, when looking at a pile of data, to leap to design conclusions.
For example: Suppose studies show that almost nobody is using the profile manager. Great, that means we can get rid of the profile manager! Right?
Or, to use a non-hypothetical example:
(Thanks to Blake Cutler for generating all the graphs used in this post).
The most common Firefox session is one with three tabs open? Well hot dog, we should optimize Firefox for the three-tab use case! Right?
Careful, there. This kind of assumption is dangerous. Maybe lots of people would start using the profile manager if they knew it existed. Maybe people would love to have sessions with more tabs open, if the interface was better for managing lots of tabs.
Related to the problem of jumping to conclusions is the problem of overfitting. It’s easy to make a model that fits a given dataset perfectly… but it may be adapted so well to the quirks of the particular data that it has no generality, no ability to predict anything else. You can always get some kind of conclusion out of a data set, but the conclusion isn’t worth much unless you test it against new, independently collected data.
The science books I remember reading as a kid typically defined the scientific method like this:
Observation -> Hypothesis -> Experiment -> Theory
Real science is a good deal more complicated than that, of course. But this is a good starting point.
What we’ve done so far with Test Pilot is the “Observation” step. We’ve got a glimpse into what all those copies of Firefox out there are actually doing for people, in a quantitative way that we’ve never had before.
Let’s look at one example observation. This graph shows the number of bookmarks a user has vs. their self-reported skill level (which we asked them to rate from 1 to 10 on a survey). The midline of the boxes represents the median for each user segment; the top of each box is the 3rd quartile, the bottom of each box is the 1st quartile. Lone dots are apparent outliers.
Before we ran this study, we had a hunch that more “advanced” users rely on bookmarks less. This is based on anecdotal evidence from many people we’ve talked to that have said the more they use the Web, the more they rely on tabs and search and the less they rely on bookmarks.
But from this graph, it appears that there is no correlation at all. It looks like self-reported level of technical skill does nothing to predict how many bookmarks someone will have.
Instead of jumping to conclusions, let’s consider other possible explanations of the data. Here are the possibilities I can think of:
1. Self-reported skill level is a meaningless number; it’s too subjective. If we instead took some more objective measurement of web skill, we’d find an interesting correlation.
2. People who have been using the web for a longer time indeed use bookmarks less and less over time, but our data doesn’t capture this because it doesn’t look at the age of the bookmarks — long-time web-users may have a bookmark collection going back to 1997 that they haven’t touched since 2004. If we look only at bookmarks made in the last six months, we’d find some interesting differences.
3. Everybody saves bookmarks at the same rate, but some people open their bookmarks often and other people barely ever open a bookmark. So instead of looking at the number of saved bookmarks, we should look at how often someone opens one as a better metric of bookmark use.
4. There really is no correlation between number of bookmarks and skill level: There are web-savvy heavy bookmark users, web-savvy light bookmark users, web-newbie heavy bookmark users, and web-newbie light bookmark users.
The next step is to design experiments to discern which of these hypothesis, if any, best fits the real world. A good experiment tries as hard as possible to disprove its hypothesis. If the hypothesis survives the experiment, we’re still not sure it’s correct, but we’ll have a little more confidence in it. In fact we can never be sure a theory is correct. Science is really all about understanding the limitations of your own knowledge.
If it turns out that hypothesis #4 is the best fit, then that tells us something interesting about our users which can inform our attempts to improve Firefox UI.
Here are a few other observations and some hypotheses that go with them:
There’s a positive correlation between time spent on web (self-reported) and number of bookmarks. This should surprise nobody. But which way does the causation run?
Hypothesis 1: People create bookmarks at a certain rate as long as they’re on the web, therefore the longer you spend on the web, the more bookmarks you accumulate.
Hypothesis 2: Having more bookmarks makes you more likely to spend more time on the web, because all the links that show up in your awesome bar distract you with the promise of interesting things to read, and thus keep you procrastinating. (I know this happens to me.)
Here’s an observation from the accounts and passwords study that we recently concluded:
The x-axis show the number of unique passwords stored in the password manager; the y-axis shows the number of users who have that many passwords stored.
A large number of users (the bar on the far left, which is 38.7% of the total users) have zero passwords stored in their Firefox password manager. What does this mean?
Hypothesis 1: These users simply don’t use sites requiring login information.
Hypothesis 2: These users don’t trust the Firefox password manager and prefer to remember their passwords themselves.
Hypothesis 3: These users have other software installed to manage their passwords, making the Firefox password manager redundant.
And so on. How would you design experiments to distinguish between these hypotheses?
You can see more charts of study observations here:
- A Week in the Life study results (bookmarks, sessions)
- Tab open/close study results
- Accounts and Passwords study reslts
Take a look at that and see what other hypotheses occur to you. What else should we be testing for?