Guide to Measuring Usability: A Practical Framework

1. Scenarios (The Foundation of Testing)
Scenarios are an essential tool in design and an indispensable prerequisite for conducting usability studies. Designing these scenarios before initiating any testing experiments is a crucial step.
Scenarios describe the context and the story behind why users seek out your product. They provide users with a practical goal, which helps you better understand how they accomplish their tasks. In usability testing, it is necessary to simulate a realistic scenario so that users can naturally replicate their thought processes and behaviors to complete the test.
A well-crafted scenario must be concise, easy to understand, and answer four main questions:
Who is the user? You can utilize personas or information gathered from desk research to understand the target users of your product.
Why do they come to your product? You need to identify the expectations or motivations that drive them to use it.
What is their goal? Through analysis, you can determine the specific tasks users perform, which helps identify necessary features or guides users on how to utilize existing ones.
How do they achieve that goal? By asking users to execute the scenario directly on the product during testing, you can observe exactly how they complete their desired tasks.
Types of Scenarios:
Elaborated Scenarios: This type is mostly used in usability testing; it describes detailed user stories to help identify specific barriers they face while using the product.
Example: Nam is a busy professional who needs to open a beneficiary account to easily provide financial support for his child who recently went away to college. He finds it highly inconvenient to waste a whole working day registering at a physical bank branch. A colleague tells him about Ebank, which allows online registration followed by a quick 30-minute card pickup at branches open until 9 PM. Consequently, Nam downloads the application from the App Store to register.
Goal- or Task-Based Scenarios: These are shorter and also frequently utilized. Because they are brief and contain less information, they are particularly useful when you want to evaluate information architecture.
Example: You want to take your partner out for dinner tonight, so you need to find nearby restaurants and check their prices.
2. Task Completion Rate & Statistical Confidence
The task completion rate is a widely recognized and fundamental usability metric. It is indisputable that if the majority of users cannot complete their desired tasks, the product holds absolutely no practical value for them.
Calculating Required Sample Size
Since it is impossible to test every single user, choosing the right number of participants is highly important. According to measuringU—a specialized testing service company—statistics across 1,189 tasks show an average completion rate of approximately 78%. Therefore, having your product's completion rate sit around 70-80% is completely acceptable. Once you have this baseline, you can calculate the confidence interval using statistical theory or measuringU's tools.
1. The Adjusted Wald Method: For example, if you aim for a 95% Confidence Level and test a task on 8 users with an 8/8 success rate. Because the sample size is small, the Adjusted Wald method is the correct calculation approach to choose. The resulting calculation reveals a 95% probability that this outcome holds true for the entire real user base, placing the confidence interval for the completion rate between 70% and 100%. This guarantees a very high probability that the minimum completion rate for all users is at least 70%, which is a solid baseline for most tests.
Ref: https://measuringu.com/calculators/wald/
2. Iterative Testing Strategy: Based on the example above, to balance the incurred costs with the actionable results, it is recommended to conduct testing in small phases of 8 users per round. If the resulting confidence interval is too broad—for instance, 40-93% when only 6 out of 8 users complete the task—you should proceed to test another batch of 8 users until the data provides a safe enough margin to prove your conclusion.
Data Collection Methods
Data collection is straightforward; you can utilize a two-person team where one guides the user through the task while the other observes and records the results. Alternatively, you can use screen-recording tools to capture the session for later observation and note-taking. Recommended tools include Camtasia or Vysor.
3. Task Time
Task time is not used as an independent metric to evaluate usability; rather, it is typically utilized to assess efficiency after design iterations, or to diagnose and form hypotheses about user interface issues. Because it does not require heavy resources and can be tracked simultaneously alongside task completion rates or errors, recording task time is highly recommended.
Calculation and Reporting
While people commonly use the arithmetic mean to represent a set of varied results, task time data in statistics suffers from significant skew. Due to external factors, a user might occasionally take significantly longer to complete a task compared to a normal user, creating outliers. Thus, you need a different "average" calculation method.
MeasuringU recommends two approaches:
Geometric Mean: Best suited for small sample sizes.
Median: Most appropriate for larger sample sizes exceeding 25.
Note regarding collection: Since external factors heavily influence test results, you must collect data under identical conditions (e.g., in the same lab environment, utilizing the exact same think-aloud protocol) to ensure the comparisons remain meaningful.
4. Measuring User Satisfaction
Task-Level Satisfaction
To measure this, you can ask questions immediately after a user completes a single task. The easiest approach is using the Single Ease Question (SEQ), a widely adopted 7-point scale. Based on measuringU statistics spanning 400 tasks and 10,000 users, the average score usually falls between 5.3 and 5.6. You can use this benchmark to evaluate your product's usability.
Test-Level Satisfaction
Similar to the task level, test-level satisfaction provides a broader measurement regarding the user's satisfaction with the entire testing session. These questions are administered at the end of the complete test.
For most web and mobile applications, the System Usability Scale (SUS) questionnaire is recommended. This relies on a 5-point scale ranging from "strongly disagree" to "strongly agree".
Calculating SUS: Let the score for each answer be X. Every odd-numbered question (1, 3, 5, 7, 9) is calculated as (X - 1). Every even-numbered question (2, 4, 6, 8, 10) is calculated as (5 - X). The sum of all these results multiplied by 2.5 yields the final SUS score. Statistically, the average SUS score is around 68, which serves as a general benchmark for product usability.
5. Error Tracking
Errors are inevitable when users perform tasks; tracking them precisely identifies the issues users face and their severity.
Collection & Calculation
Similar to task completion rates, you should take notes while observing the user. It is important to record the exact number of times a user makes an error, even if they repeat the same mistake within a timeframe. For example, a user might mistakenly click an unclickable area 5 times within 1 minute. Documenting this accurately describes their true experience in that moment.
In some scenarios, knowing that zero errors occurred during a task is sufficient for reporting. Therefore, you can convert this data into a binary error rate and calculate it similarly to the task completion rate; while this sacrifices some granularity, it often provides enough information for a summary report.