Laura Faulkner wrote an interesting article “Beyond the five-user assumption: Benefits of increased sample sizes in usability testing.” I just heard about it at the UIDesigner blog, but the article has been around for awhile (published in Behavior Research Methods, Instruments, & Computers in August 2003).
In the article, Faulkner argues that “the risk of relying on any one set of 5 users was that nearly half of the identified problems could have been missed; however, each addition of users markedly increased the odds of finding the problems.” Here are some other interesting quotes from the article:
“the average percentage of problem areas found in 100 trials of 5 users was 85%” – this agrees with Nielsen’s claims of 5 users catching 85% of the problems. However…
“The percentage of problem areas found by any one set of 5 users ranged from 55% to nearly 100%. Thus, there was large variation between trials of small samples.” – wow! That’s one WIDE variation!
“groups of 5 found as few as 55% of the problems, whereas no group of 20 found fewer than 95%.” – the hint here is that more testers will find more problems.
Faulkner then goes on to discuss how her study still supports Nielsen’s claims, but that usability practitioners are incorrect about religiously using the 5-user test to always catch 85% of the problems.
The thing I found interesting was this: Faulkner’s article is based on using 5 users in single usability tests. But Nielsen doesn’t actually say to do that! In his article “Why You Only Need to Test With 5 Users” (I know, it sure SOUNDS like he recommends only using 5 users) he actually recommends doing “three tests with 5 users each” and correcting any errors found between each test. That’s VERY different than just testing 5 users, don’t you think?
I think Faulkner is REALLY talking about the perception that usability testers only do single tests, hoping to catch most problems, and then move on to other things. And it’s certainly possible that some testing is done this way. It’s just that, well… Nielsen didn’t suggest doing that (the basis of Faulkner’s article).
Here’s what Nielsen actually says to do:
- First test with 5 users: you’ll catch an average of 85% of usability problems
- THEN FIX THOSE PROBLEMS!!!
- Second test with 5 users: tests the corrections made from results of the first test, catches stuff your first 5 users might not have caught, and even better, “the second test will be able to probe deeper into the usability of the fundamental structure of the site, assessing issues like information architecture, task flow, and match with user needs.” So the second test fixes, tests, and probes much deeper than the first test.
- THEN FIX ANY PROBLEMS FOUND DURING THE SECOND TEST!!!
- Third test with 5 users: you just fixed more problems, so you need to test those fixes out… hence the third test.
- THEN FIX ANY MORE PROBLEMS!!!
One other difference I noticed. Faulkner’s focus in usability testing is the goal of catching all usability problems with testing, but Nielsen’s usability testing goals are different. His goal is “to improve the design and not just to document its weaknesses.” It’s possible that in a life-or-death, mission-critical (egad, did I just write “mission-critical?”) product (like healthcare or flight equipment) with high regulatory standards to meet, documenting weaknesses and catching EVEYTHING would be important. But remember – I’m a library web manager! This isn’t a life-or-death situation, and I’m not a brain surgeon.
Is there a big difference between testing 20 users or multiple groups of 5 users? I’ll let the Research Methods people figure that one out. But it sure seems like three groups of 5 users for usability testing still works just fine, and catches most, if not all, web usability problems.