Test Pilot Self-Selection Bias, and How to Compensate For It

I can always tell whether someone understands statistical research or not by describing Test Pilot to them. If their very first question is “What about the self-selection bias?” then they understand statistical research!

Self-selection bias is the bias that creeps into your results when the subjects of your study are people who choose to be subjects. It’s a bias because a group of people who choose to be subjects is not the same as a random sample of the population.

Think of product reviews on Amazon: they’re mostly 5-star reviews or 1-star reviews, because only the people who really love something or really hate it are motivated enough to write a review! If you randomly sampled the people who had read a given book, you might find that the majority of them found it mediocre – but indifferent people don’t take the time to write reviews. The self-selection of the reviewers skews the average rating.

This is the same reason why dialing random telephone numbers gives you better poll results than setting up a poll on a website – the telephone polling is (closer to) a random sample, while people who answer the website poll are self-selecting.

The relevance to Test Pilot is obvious. Only people who chose to install the Test Pilot extension or the Firefox 4 Beta get the studies; and only people who click “submit” send results back to Mozilla.

Therefore, it would be a mistake to rely only on Test Pilot submissions when redesigning Firefox UI. We’d be over-tailoring the UI to a particular subset of users, while potentially making it worse for the “silent majority” of users not represented in the sample.

Just how skewed is our sample, anyway?

The results of a survey in March 2010 (which was taken only by users of the Test Pilot extension) gave us a portrait of users who were:

More likely to be Linux users…
More likely to self-describe as “Tech-savvy”…
More likely to be in the 18-35 age range…
Much more likely to be male…
Likely to spend 4-8 hours a day using the Web…
More likely to be using Chrome in addition to Firefox…
Much more likely to have been using Firefox for 4 or more years…

than either the general Internet-using public, or the general Firefox user base. Additionally, since Test Pilot has so far been released only in English, we are limiting ourselves to English-speaking test subjects (although given that limitation, we were pleasantly surprised by how many respondents we had from Europe, epecially Germany, France, Poland, and Russia).

A picture begins to emerge of a certain type of user — extremely active on the Web; experienced and comfortable with a variety of web browsers and operating systems; having a desire to be on the cutting edge, or at least experiment with software on the cutting edge, before it becomes mainstream.

Let’s call this user group the Early Adopters. It shouldn’t be surprising that Early Adopters would know about Test Pilot before other types of users, and would be more interested in trying it out.

We expected that the sample from the Test Pilot extension would be dominated by Early Adopters. We hoped that extending Test Pilot to all the Firefox 4 Beta users would help. Thanks to the beta, we now have nearly a million users instead of 10-12,000, which is fantastic. But by and large, they’re the same type of users — the beta user base is just as skewed towards Early Adopters. For more details, see the Metrics blog post: Who are our Firefox 4 beta users?.

Having a lot of Early Adopters in the sample isn’t the problem: the problem is that we may be missing out on submissions from all those other types of non-Early-Adopter users. While it’s great that we have so many users who have been using Firefox for over 4 years, we need to balance those submissions out with some submissions from Firefox newbies. Relying too much on data from people who are already intimately familiar with using Firefox won’t help us make it easier to learn for people who are coming to Firefox for the first time. It could even hinder us.

Subsampling

I believe that self-selection bias is the single biggest obstacle to the success of Test Pilot as a research program. It’s big and it’s hard to deal with.

The one sure way to defeat self-selection bias would be to instrument every copy of Firefox, pick a random subset of users to observe, and send the results back to Mozilla automatically. I don’t want to do that. I’m proud of the way we respect user privacy in Test Pilot. I think informed user consent is the ethical way to collect data and I stand behind the wa we’ve run our studies so far.

So how else can we defeat self-selection bias? Just getting more users in the sample won’t help. We have to better understand the users who are already in the sample. The approach I prefer is called subsampling or survey weighting. This is a difficult road but I think it’s the right one.

The principle is easy to understand. Suppose you have a survey where the sex of the responder is important for some reason. But due to self-selection bias, you have a sample that’s 90% male and only 10% female. What do you do?

You can subsample your sample: Randomly pick some women from your 10% female respondents and randomly pick an equal number of men out of the 90% male respondents. Your subsample will be 50-50, like the general population that you’re trying to study. It will be much smaller than your original, biased sample, but if that sample was large enough, then this might not be a problem.

With Test Pilot, we’re not particularly interested in dividing people up by sex, or age, or any other purely demographic information. We don’t care about that stuff except to the extent that it helps us gauge the bias of our sample. What we’re interested in is having representation of the different ways that people use Firefox.

There are two things that make correct subsampling of Test Pilot data much more difficult than my simplistic male-female example.

First is that we’re not sure what the meaningful groupings are. It’s not “men” and “women”. What is it that we’re looking for, exactly? We’re probably oversampling “early adopters”, but early adopters as opposed to who? What group (or groups) are we under-sampling?

There was a survey question asking Test Pilot users to describe their own level of technical proficiency, but this number isn’t very reliable since it’s so subjective and relative (different people have very different ideas of what it means to be “advanced”) that it’s almost meaningless to compare it across different samples.

The second problem is that even when we identify the groups we want to account for, we need some way of estimating their prevalance in the general Web user population, so we know what weights to assign. We know that the real world is something like 51-49 female-male, but what fraction of the general web user population is early adopters? How can we measure this? We can’t use the Test Pilot data to tell us this; we need an independent observation that we can compare our Test Pilot data to. Something to calibrate against, if you will.

Who are we missing… and how many of them?

The usual marketing research term for “types of user” is “personas”, but there’s a Firefox feature called Personas, so to avoid confusion I’m going to refer to keep referring to them as “user types” or “user groups”.

The usual folk taxonomy is to divide users into expert vs. beginner, also known as “techie” vs. “mainstream”, or “developers like us” vs. “my mom”. I don’t find this one-dimensional scale particularly useful. I think there are multiple dimensions and multiple groups of users who are not simply “good with computers” or “bad with computers”, but rather are experts at different things. Picture the following imaginary characters:

Extreme Socializer: is part of eight different social networks and carefully prunes each one to prevent information leaking between non-overlapping friend groups; an expert in presenting his/her self online, managing reputation, keeping people in the loop, and expressing feelings and relationships through text.

Political blogger: Is subscribed to dozens of RSS feeds from news sources so that he can find out immediately when somebody in the government says something stupid; pounces on it immediately, blogging and tweeting a link with snarky commentary. Spends a lot of time moderating blog comments, trading links, self-promoting, quoting other blogs (and dissecting their arguments line-by-line!) and looking up charts of long-term economic trends in order to support an argument.

Web developer: has Firebug installed, has a tab permanently open to jQuery documentation and another to CSS reference; runs multiple browsers at once to test how the site looks in each one; reads tech news and design blogs; likes trying out demos of new standards and new coding techniques; views web pages with an eye to critiquing their design and functionality; can name any font on sight (and has a burning hatred for Comic Sans and Papyrus); etc.

Modest web user: Uses the internet mainly for e-mail, which he/she reads through a free webmail client. Is not clear on the difference between “the web” and “the internet”. Mainly uses the web to view links that people send her, or to get information on practical things such as researching a medical condition or printing out directions to somewhere. Knows how to Google things and has a few favorite bookmarks, but isn’t interested in exploring the web just for the sake of exploring.

The web developer, political blogger, and socializer are all experts in their own domains. The web developer might not know the first thing about social networks or news feeds; the political blogger might not know the first thing about HTML or friending people, but understands how to deal with trackback spam; etc. It would be silly to try to rank them or say that this person is more of a web expert than that person.

At the same time, it’s wrong to call the Modest Web User a “newbie” or a “beginner” – he/she may have been using the web this way for ten years! He/she has found a way of doing things that works, and sees no reason to put a lot of effort into learning more.

These four examples are by no means an exhaustive list. What makes this problem such a Gordian knot is that we don’t know what the distinct user groups are that matter, so we don’t know who we’re under-representing, let alone how much we’re under-representing them by. Maybe there’s a very important type of user behavior that we’ve never even considered.

So we can’t rely on using pre-defined categories to segment up the user base for subsampling. Our a priori categorization would be little better than guesswork, and probably wrong.

Instead, let’s look at the data and see whether we can identify clusters or patterns in the data points that look like evidence of real, recognizably different user behavior groupings.

Empirical Usage Patterns

For instance, here’s something that looks promising. There’s a bimodal distribution in the the number of keyboard shortcuts that a user used over the course of a week. For the most part, a user either doesn’t use keyboard shortcuts, or he/she uses hundreds of them. There’s very little in-between.

That makes me think that using keyboard shortcuts or not using keyboard shortcuts is a proxy metric that identifies a real qualitative difference in usage style between distinct user groups. The nice thing about keyboard shortcut usage is that it’s objective, it’s easy to measure, and it’s closely related to how people actually interact with their computers.

The next step is to test this idea by doing some regressions of keyboard shortcut usage to see whether it correlates strongly with other factors – do heavy keyboard shortcut users also tend to have a lot of tabs open, for instance? Does shortcut usage predict other things about your web browsing behavior? Or does it not have much to do with anything except keyboard shortcuts?

If it turns out there are strong correlations, and shortcut usage is a useful way of assigning people to user groups, then we could easily start including a “total keyboard shortcut count” as metadata with each Test Pilot study. Then we could use that to re-sample our data. We’d just need an independent measurement of keyboard shortcut usage among the general Web user population to compare it to.

Another promising avenue is installed extensions. It’s pretty easy to identify that, for instance, someone with Firebug or DOM Inspector installed is probably a web developer. Let’s do a multivariate analysis of installed extensions and see if there are clusters around particular extensions or groups of extensions. If certain extensions turn out to be strong predictors of certain web browsing behavior, then that could also be a useful way of dividing and re-sampling people.

The best part about extensions is that we already have independent observation to compare it to. Because running extensions periodically ping addons.mozilla.com to check for updates, the Addons.mozilla.com logs can tell us the overall popularity of those extensions in the wider Firefox user base. We already have calibration for extension popularity; we could very easily use it for an accurate subsampling, once we figure out what extensions, if any, are useful predictors.

I hope to shift the focus of my work a little bit in the next couple of months, away from coding the Test Pilot extension and towards the type of analysis I outlined in this post. We’re just beginning to brainstorm strategies for subsampling. It’s uncharted territory for me; I don’t have any clear conclusions yet – just some ideas about where I want to explore.

Sarah Says:

October 9, 2010 at 1:26 am

What about encouraging more people to install test pilot? Adding a checkbox in the firefox installer (though I hate install spam as much as the next guy, so maybe not the best option) saying “I want to send usage data to help make firefox better!” or something thereabouts. Or on the Firefox Update page, or the one that says “You’ve just installed firefox!”, have a small button or blurb about helping to make firefox better. If you make Test pilot more obvious, then you might have a better chance of tempting more non-techie types.

Or why not ask people which they fall in? Create a study that tracks addons/shortcut use, and a survey that asks them which web user they think they are (with descriptive paragraphs), so you can see what sorts of extensions correlate with people who identify a certain way.

Zack Says:

October 9, 2010 at 1:46 am

We don’t want to run studies on people without asking first, but it would be much less sketchy to *advertise* our interest in running studies on people to the user base at large, ne? For instance, I see we already mention Test Pilot (and feedback) in the beta first-run screen, but why not mention it in *all* the first-run screens, whether or not we ship TP with that version?

JBohm Says:

October 9, 2010 at 11:17 pm

One major demographic you need to correct for is “privacy desire” . Users who value their privacy more are less likely to allow metrics to observe them, and are more likely to actively block any blockable instrumentation. For instance they might be using the NoScript or RequestPolicy plugins
to selectively block various kinds of web bugs (like the bunch that this very blog are riddled with). Restrictive cookie settings, lots of consent prompts etc. are also more likely amongst such users. At the more extreme end, some people may be using tor tunnels and other high end anonymity measures.

This demographic measure is important to Firefox design for several reasons:

1. It affects how people use the privacy specific features and settings (obvious).

2. It affects the user experience for any Firefox feature affected by privacy settings, such as redirection, plugin prompts etc. For instance page reload due to changed per-site privacy settings works differently for users with permissive defaults than for those with restrictive defaults, affecting such issues as caching and form entry.

3. It affects how users feel about features such as the “thanks for installing” web page. Some will want more functionality, others want it to just go away.

Chartjunk Says:

October 11, 2010 at 1:53 am

Please don’t use pie charts for things like age ranges. Especially if the bins are of unequal size. Use a histogram with variable bin widths.

Zach Lym Says:

October 12, 2010 at 5:38 am

Yay Jono!

I am so happy to see Mozilla “pioneering” the type of research that gives statistically valid results WITHOUT violating the privacy of every user.

But Personas has a great history in UX, and we tend to verify them (as much as we can) with real statistics. But your right, marketers generally give it a bad name :p

I should send you a book… I was thinking along these lines as a response to Alex’s Don’t Talk About Users post (http://tinyurl.com/2bteltu), but never got around to it.

Updaterank Says:

October 12, 2010 at 11:28 am

amazing analysis.
But i think you should use same-color pie charts.

Not The User's Fault