UX researcher Amy Heger recently shared how Microsoft conducted some of its most comprehensive qualitative research. The research team had to analyze more than 90 hours of intensive interviews with AI specialists. And the insights they uncovered became…
The world’s first empirically-based Responsible AI Maturity Model.
If this sounds intimidating to you, don’t fret. We’re about to go on a deep dive. Amy detailed the complexity in unraveling qualitative research data to turn it into a tangible framework.
Read on to discover:
- Defining research objectives (the why behind the maturity model)
- Conducting user interviews with AI practitioners
- Tagging 80+ hours of rich qualitative research data
- Synthesizing thousands of interview notes
- Sharing insights with teammates & stakeholders
- Leading future research for responsible AI
Why Microsoft Created the Responsible AI Maturity Model
AI systems have failed certain demographics.
In an ever-evolving environment, laws surrounding the development of AI simply can’t keep up with the release of new technologies. The onus is on companies to self-regulate before releasing anything into the world.
Microsoft is one of those companies taking a big stand. In fact, they have a group dedicated to qualitative research and recommendations on responsible AI issues: Microsoft Aether, their internal AI and ethics committee.
(You might remember our conversation with Mihaela Vorvoreanu, aka Mickey, Director of UX Research and Education at Aether at Microsoft. She deconstructed the relationship between responsible AI and UX.)
To understand the current landscape around responsible AI (RAI), Mickey, Amy and team conducted a thorough examination of existing literature. They reviewed white papers released by reputed thought leaders such as Oracle, Salesforce, IBM, Accenture, Boston Consulting Group on AI and Machine Learning maturity models. Aether also examined literature around security and privacy, made publicly available by governmental agencies such as the AICPA and the NIST.
While there was no lack of documentation on the subject, they noticed that existing maturity models missed something. Current frameworks were geared more towards individuals and not organizations. They also lacked proper guidance on responsible AI — there were no real-life instances of what a mature RAI practice looks like.
“It was a humble understanding that there is no pre-existing maturity model, and being open to learning what should go in,” Amy said.
It became clear what Aether had to do — they decided to create a framework that addressed the RAI perspective. Intended to supplement other maturity models, the team referenced guidelines from existing white papers, such as Fairness, Security and Privacy, and Accountability.
“The more we talked to participants, we found it wasn’t just about these practices, it was about the organizational foundations that facilitate them, collaborative relationships and the way people approach these problems,” she said.
Let’s take a look at how Amy and team carried out their qualitative research…
Conducting Interviews with 90 AI Practitioners
Aether turned to a host of different subject matter experts. Interviewing AI practitioners and specialists allowed them to involve diverse perspectives throughout development of the model.
Key stats from the interview process:
- Number of Interviews = 47
- Interview Time = 80 hrs
- Participants = 90
- Interview Format = Individual and Focus Groups
- Interview Type = Semi-structured
Studies like these previously only included 10-13 participants. Using Marvin, the Microsoft team reached seven times more participants than the norm. That’s a mountain of rich data.
Marvin’s capability to handle video transcription for hours of interview data saved Microsoft thousands of dollars on human transcription. (And the results are just as accurate but much, MUCH faster.)
Interviews consisted of one main interviewer, accompanied by a second person who’d take notes and jump in with questions on occasion. It’s helpful to have someone with a more zoomed out view of proceedings, to keep the bigger picture in mind. What started out as run-of-the-mill interviews, slowly changed over the course of the study.
Aether were interested in learning from practitioners’ widely different experiences and opinions. Interviewers had to be adaptable by changing and tweaking discussion guides along the way.
“After the first 10 (interviews), we really had a better idea of what the interview guide should look like,” said Amy.
Before beginning their interviews, the team conducted background research on participants due to the diverse backgrounds of interviewees. Interviews varied widely in subject matter and depth into certain topics. Some interviewees gave specific examples, and some higher level strategies.
“(Part of it) was acknowledging how diverse the participants were, and having to adapt our interview guide template to match,” she said.
Interview discussion guides were in a state of constant flux. Using our platform, Aether saved templates of discussion guide templates and created iterations as they interviewed more people from the diverse participant pool.
Discover how product teams use Marvin to conduct productive and effective technical interviews.
Tagging Over 80 Hours of Rich Qualitative Research Data
With over 80 hours of interview recordings, there was a lot to unpack during analysis. The study commanded intricate coding (or labeling or tagging) methods due to the sheer complexity of the data.
The Aether team developed a preliminary codebook during the interview process. It served as a base from which to build out and calibrate the codes over time.
Mickey chimed in with great coding advice for the initial stages of qualitative analysis:
“Always code more rather than less granular. It’s easier to collapse codes than go back and recode for nuances.”
Often while coding, a slice of information can easily be applicable to two different categories. Amy approached the process from a social psychology standpoint, where information must be filed within discrete and separate categories. She had to overcome this and learn to embrace the complexity:
“I realized the nuance and interdependence between these different areas is really valuable and, in itself, one of the findings,” she said.
Data with overlapping labels is a breeze with Marvin. Oscillating between Marvin’s Analyze and Labels pages, Amy and team were able to collapse codes on the fly and create coding hierarchies for their codebook.
They took many passes at transcripts while coding for different themes. For example, when they first looked at a bit of text, they would ask “Is this characteristic of a less or more mature RAI environment?”
The second review would touch on the dimensions that were being alluded to — such as transparency or fairness.
Amy gave us an example of how she embraced the interconnectedness of elements.
“There’s a chain when you need organizational support, interactions to go smoothly, have a learning process that allows you to implement those practices, and understand them on the ground,” she said.
This allowed her to tease out categories in which to bucket the data, “I was able to clearly break them apart and understand how to make them separate dimensions,” she said.
Check out our comprehensive guide on how to tag your data and analyze your research faster.
Synthesizing 2,000 Notes into a Framework
The data was more nuanced and convoluted than Aether imagined. Insights were deeply embedded in the transcripts and required finesse to unravel them. Coding took several iterations and the process of note-taking, or “brain dumps” as Amy refers to them, had begun.
Amy ran us through an example of a brain dump. Let’s take the topic of RAI policy. She would filter and stitch together related notes and video clips into one playlist. After watching the entirety of the playlist, she’d begin taking notes.
After a rigorous and iterative note-taking process, they had amassed over 2,000 notes from the 47 interviews. How did they synthesize these into a tangible model?
“The platform for us was really helpful — on the analysis side, it was so helpful for me to drill down and say, ‘I want to see all codes about transparency overlapping with a lower maturity level.’ That allowed a really helpful organization and sifting through the codes,” she said.
Using the Analyze tab, Amy and team were able to filter for specific, overlapping themes while taking notes. They could also visualize the data from more structured questions that they asked participants.
Once they drafted the framework, Aether vetted their efforts with 56 internal experts. Through focus groups and interviews, they garnered feedback helped them validate the dimensions outlined and identify any knowledge gaps.
The result? The RAI Maturity Model contains 24 measurable dimensions with a five-point scale to determine the discrete maturity of each dimension.
Shared Insights Throughout the Process
Amy’s under no illusions: this qualitative research project could not have been completed without a key ingredient — collaboration.
With members of the team spread across the country, Marvin facilitated the easy sharing of data. Interns conducted most of the interviews — Amy and her colleague Samir weren’t present, but performed the bulk of the nuanced qualitative data analysis. They used Marvin to string together playlists and share insights with each other. This enabled a constant back and forth in the coding process.
Periodically, Amy briefed the team on the progress in analysis and presented her updated findings. Bouncing ideas off people (who weren’t coding data) helped her unravel the interconnectedness of the various dimensions identified in the framework.
For each theme, she compiled recordings into playlists to facilitate note taking. Having all clips of a theme like “Responsible AI policy” in one place made the process of note taking smooth and focused.
“It was a way in which I could take this huge ball of string and start to break it down,” she said.
Heavily engrossed in coding the data, Amy found herself “in the weeds,” isolated in her own work. How did she ensure she didn’t miss the forest for the trees? How did she zoom out and get a view of the bigger picture?
“This project couldn’t have been done alone — so much of what allowed me to analyze and code successfully was having conversations, like (viewing) the forest and the trees and the calibration of the codebook,” she said.
Future Qualitative Research for Responsible AI
With their monumental effort (and a little help from their friends at Marvin), Microsoft Aether has created a blueprint for future versions of the model. The authors of the RAI Maturity Model explicitly state that it’s a living document — like every application or operating system, this is just version number one (v1). Amy informed us that work on v2 of the RAI Maturity Model is already underway.
Aether hopes to pilot the model with teams at companies sooner rather than later. Whether individuals embrace it or not is immaterial — constructive feedback is vital and will undoubtedly inform future iterations. AI is uncharted terrain, and learning will be shaped by our collective experiences.
Amy shared her advice aimed at budding researchers undertaking complex studies like this one:
“Be aware of your own biases. Instead of trying to impose my view, it was (about) accepting what the data was telling me — that these things are interconnected. Frustrating as that was, it made for the most valuable understanding of the topic. Don’t shy away when things don’t go the way you expected,” she said.
Photo by Federico Beccari on Unsplash