User Studies for Enterprise Tools: HCI in the Industry

Author:Murphy | View: 20716 | Time: 2025-03-22 19:47:13

Enterprises and organizations in general have dedicated efforts to build custom tooling for business operations. These include dashboards, custom UIs for niche systems, toolings that make complex algorithms accessible, etc. Assessing the quality of such tooling is important. In HCI courses and mainstream HCI research, controlled user studies are the most popular way of evaluating a tool's effectiveness.

A controlled user study, in this context, is designed around a task the tool supports and the user population for which it is supposed to target. Different conditions of the tool are also designed to have some form of a baseline for comparing the tool. The tool is measured against how well the user accomplishes the task on the tool. Different metrics, such as the duration for a user to complete the task, are measured and used to compare different conditions of the tool.

However, there is a gap between what is mostly taught in Hci courses versus the practicalities of HCI in an industry, enterprise setting. In this short blog, I will outline some insights I have had working in the industry as an HCI researcher in a diverse team of NLP and Database researchers dedicated to conversational and language AI systems and their evaluations.

I use the term Tooling in this short blog post to refer to a generic UI-based tool that enables users to complete tasks in an industry setting, e.g. a dashboarding tool that visualizes an AI model's input and output, a tool for creating custom datasets for niche customer problems, a tool of extracting insights from a large dataset of documents, etc.

Assessing tooling using quantitative methods popular in HCI research is typically not practical in an enterprise/industry setting.

Most HCI textbooks and HCI research emphasize quantitative methods for evaluating tooling, which usually focuses on within-subject studies and between-subject studies. Conducting such studies is common in HCI research – a paper titled, Evaluator Strategies for Toolkit Research, __ discusses a study of 68 research papers about toolkit evaluation that confirms this.

In academia, the practicality of conducting such studies makes sense. For instance, there is the ease of accessibility to students as qualified human subjects for user studies within their universities. However, conducting such user studies may not be practical in an enterprise setting due to various reasons, or what I call business constraints.

Business constraints restrict the types of HCI research that can be conducted for enterprise tooling

Business constraints limit the ability to conduct controlled studies that one typically is taught in HCI textbooks and even what's popular in mainstream HCI research. These business constraints include:

Constraints on timeline deliverables: it takes a considerable amount of time to design a user study, pilot test the study, conduct the study with human users, and finally analyze the results of the study. In the enterprise setting that process is often followed by another round of user studies based on the previous user study results. The full cycle of a controlled user study is often longer than qualitative user studies, but often tooling is being simultaneously used by customers and stakeholders who have expectations and deadlines in a much shorter period than what a controlled user study requires.
Human subject population availability: often, users recruited for a user study for a proprietary tool must have access to said proprietary information and tooling in the first place. This limits the population of users, who might even already be users of the tool itself. For some controlled user studies, it is desirable to reduce any bias from previously having "practice" using the tooling or being exposed to the tooling ahead of the study. The availability of human subjects within the enterprise limits the HCI researcher's access to human subjects with certain qualities ideal in a user study for tooling.
Resource restrictions in terms of researcher availability in conducting the study. A single researcher conducting 8 user studies with 8 different human subjects is time-consuming, especially when one user study often takes one to two hours. This also includes the analysis for such a study which typically takes longer than conducting the study. Moreover, businesses often prioritize hiring technical people that primarily build the tooling who do not necessarily have an HCI background and so the burden of bringing user perspectives into the tooling development process lies on the handful of people who do have some level of an HCI background.

Yet, user studies are important in tooling, especially for enterprise tools. The user journey and perception of the tool can make or break a tooling's success and contribute to a primary business metric, yet academic HCI practices are not always applicable in such settings.

Tooling created in business settings often supports complex business workflows that make designing a task for a smaller, controlled user study harder (and impractical)

While controlled user studies are often designed over smaller tasks that can be completed in less than an hour, many custom tooling in enterprise settings are created for a niche workflow that is often complex:

A tool for identifying suspicious errors in a dataset containing millions of logs (like DTTool from IBM Research): the tool suggests suspicious logs for a user to inspect. Inspecting a single log can take a minute or so but finding a log containing a significant error worth fixing may take several hours.
An interactive tool for building efficient AI models such as Talaria from Apple: Building such models is complex and time-consuming and requires several iterations.

A controlled user study with such a task will force human subjects to go well beyond the standard of 2 hours per session. What would be a good task that can be completed in under 1 hour for a human subject in a controlled user study to complete?

Yet, for these toolings, there is a human perspective to them that is important to quantitatively capture and measure for effectiveness.

Qualitative methods and alternative methods are the most practical methods for evaluating enterprise tooling with business constraints

This includes methods such as analyzing user logs, formative studies, qualitative user studies, interviews, focus groups, surveys, and observational studies. Conducting these studies does not take up as much time and energy on the target users (often the employees within the company or the consumer) as controlled user studies do.

However, there is a bias for peer-reviewed research papers to favor evaluations on tooling that have controlled studies. I have only been able to find a handful of research papers on tooling that primarily have a qualitative evaluation or even uniquely a single-condition study on the tool (not involving a between-subject or within-subject design):

ChainForge is an open-source tool for prompt engineering LLMs. Its evaluation consists of in-lab qualitative interviews and external interviews paired with iterative development of the tool from feedback from its users.
Talaria is a tool at Apple that enables engineers to build efficient ML models. The evaluation consists of (1) a log analysis of using the tool from 800+ users, (2) a usability survey with 26 users, and (3) a qualitative interview with 7 power users.
Meta-Manager is a tool that **** helps developers answer historically "hard-to-answer" questions about code history. The evaluation of this tool was a single condition study (no control condition on the tool) followed by a survey. Users were tasked to explore an unfamiliar code base while using Meta-Manager to answer questions about the history of the code, without modifying or running the code.
Frequence is a tool that takes on a visual and interactive approach to event sequence mining in datasets that contain complex, hierarchical information. Its evaluation consists of demonstrations of the tool on various datasets and key findings from different datasets, e.g. demonstrating their discovery regarding different diseases by using Frequence on a dataset of Electronic Health Records.
Beagle is a visual analytics tool for visualizing and understanding scamming activities. The tool was developed in collaboration with an investigation company. Two iterations of the tool were presented in a paper as they developed the tool to fit the needs of real-world, complex challenges of discovering and understanding scamming activities.

So what evaluations should a particular tool take on? I like to refer to what the authors of Evaluator Strategies for Toolkit Research say on this:

Rather than considering some methods as better than others, we believe that it is more important to use methods that best match the claims of the toolkit paper…

However, as researchers, we also have to be aware that while the HCI in the industry comes with valuable insights that may immensely benefit the research community, peer-reviewed research tends to favor controlled user studies. Researchers and academics should be open-minded towards qualitative user studies, especially for toolings. An example of this is D3.js which is a toolkit for data visualization. Its evaluation was primarily demonstration-based (rather than a controlled user study), and in the long run, it was proven useful through its adoption by several thousands of users due to its ease in visualizing data in web browsers.

The opposite can also be true, as the paper Usability evaluation considered harmful (some of the time) says that a __ tool can be "highly usable, but totally useless". In other words, a tooling's evaluation can prove that it is usable, but in the real world, users simply think the tooling is absolutely useless, e.g. their particular use cases are not even covered by the tooling.

Conclusion

To conclude, there is a gap between the HCI in the academic world (and in mainstream HCI research) and the HCI in the industry. It is important for researchers from a primarily academic background to be mindful of business constraints when reviewing research papers where its primary contribution is tooling.

Tags: Data Tools Hci Notes From Industry Tooling User Research