What a Rigorous Study Actually Looks Like

The adoption metrics were fine. I still did not know if anyone was getting better.

Jun 03, 2026

A client wanted to know if their AI tool rollout was working.

I asked what working meant. They pointed to adoption metrics. Utilization rates, completion data, a satisfaction score that had been holding steady since launch. The rollout team was pleased. Leadership was satisfied. The dashboard looked healthy.

I knew adoption was not the answer. I also did not have a ready-made framework to hand them something better. That gap, between knowing the standard metrics were insufficient and being able to offer a rigorous alternative, showed up consistently across the engagements that followed. Federal modernization contexts. Enterprise AI platforms. Workforce deployments of various kinds. Always the same shape: adoption data, satisfaction scores, and a capability question that nobody had the infrastructure to answer.

This part is the framework I wished I had.

It is also, in an early form, the basis for a research design tool I built to help practitioners work through this problem systematically. More on that below. First the design itself.

What the study needs to do

A capability study is not an adoption study with better questions. It is a fundamentally different kind of research with different design requirements, different comparison conditions, and a different relationship to time.

Adoption studies measure behavior at a point in time. Did employees use the tool. How often. For how long. Did they complete the tasks the tool was built to support. These are reasonable questions and they are answerable with the instrumentation most organizations already have.

Capability studies measure change over time. Are employees better at exercising judgment in this domain than they were before the tool was deployed. Can they evaluate outputs critically. Can they perform in conditions where the tool is unavailable, wrong, or operating outside its reliable range.

Four things the study needs to do well:

Define the capability construct specifically. Not employees are better at their jobs but employees can accurately evaluate AI-generated outputs in this domain, identify errors or gaps, and make decisions that go beyond what the tool has already decided. The more specific the construct the more measurable it is and the more honest you can be about what the study can and cannot tell you.

Establish a baseline before deployment or as early as possible after it. Capability measurement without a pre-measurement is descriptive at best. You need to know where people started to know whether they have moved. This sounds obvious and gets skipped constantly because baseline data collection requires organizational will and timeline discipline that most rollouts do not have built in.

Build in comparison conditions. The ideal comparison is a pre-post design with a control group that does not have access to the tool during the study period. That is often organizationally impossible. The practical alternative is a pre-post design without a control group, combined with honest acknowledgment of what that limits you to claiming, plus supplementary methods like think-aloud protocols or scenario-based assessments that can isolate judgment from tool output.

Measure at multiple timepoints. Capability development is not linear and it is not fast. A single post-deployment measurement will tell you something. Measurements at thirty days, ninety days, and six months will tell you something meaningfully different and considerably more useful. The confidence problem from the previous part, the tendency for AI tools to increase felt confidence faster than actual capability, is only visible in longitudinal data. Cross-sectional measurement will miss it almost every time.

How to design it

The measures that matter most for capability are the ones organizations are least likely to already have. That is not a coincidence. They are harder to collect, harder to present, and harder to defend under scrutiny than adoption metrics. They are also the only measures that will tell you what you actually need to know.

Scenario-based assessments are the most direct measure of judgment. Present employees with realistic situations that require them to evaluate an AI output, identify a potential error, or make a decision that goes beyond what the tool has provided. Score their responses against expert judgment. Do this before deployment and at multiple points after. The gap between pre and post, and the trajectory of that gap over time, is your capability signal.

Think-aloud protocols during task completion surface the reasoning behind the behavior. Completion rates tell you that the employee finished the task. Think-aloud data tells you whether they understood what they were doing, whether they evaluated the output critically, and whether they are developing genuine judgment or efficient tool dependency. This is qualitative and therefore underweighted in most organizational measurement contexts. It is also frequently the data that explains what the quantitative measures cannot.

Calibration probes embedded in workflow are a lighter-touch option for ongoing measurement. Periodically present employees with AI outputs that contain deliberate errors or ambiguities and measure whether they catch them. This does not require a separate study. It requires instrumentation built into the tool deployment itself and organizational commitment to treating the results as signal rather than as a quality control exercise.

Self-report confidence measures are worth collecting and worth treating with significant caution. As the previous part established, confidence and capability diverge in specific and predictable ways when AI tools are involved. Collecting confidence data alongside capability measures lets you track that divergence directly, which is some of the most useful signal the study can produce. An employee whose confidence is rising faster than their measured capability is showing you exactly the pattern the study is designed to catch.

What to do with what you find

The study will produce one of three findings, or some combination of them.

Capability is developing alongside confidence. The tool is doing what AI enablement is supposed to do. Employees are genuinely getting better, not just faster, not just more confident, but more capable of exercising judgment in the domain. This finding is worth documenting carefully because it is rarer than most organizations expect and more valuable than the adoption metrics suggest.

Confidence is outpacing capability. Employees feel better than they are performing. This is the pattern the previous part described and it is the most common finding in contexts where the tool has been deployed without capability measurement infrastructure. It is not a reason to abandon the tool. It is a reason to redesign the interaction, add calibration support, build in deliberate friction at the points where employees are most likely to over-trust the output, and measure again.

Capability is not developing. Employees are completing tasks and not getting better at the underlying domain. The tool is substituting for judgment rather than developing it. This is the hardest finding to deliver and the most important one to catch early. The longer an organization runs a substitution tool under the assumption that it is a development tool, the more expensive the correction becomes.

All of these findings are available from a study designed to ask the capability question directly.

The tool

I built an early prototype of a research design assistant specifically for this problem. It takes inputs about the AI tool being evaluated, the employee population, and the organizational context, and generates a tailored study design with recommended measures, suggested timepoints, a draft scoring rubric, and honest acknowledgment of what the design can and cannot claim.

The intent is to make the capability question answerable for practitioners who know the adoption metrics are insufficient but do not have a ready-made alternative. The prototype is available here. The code is open on GitHub for anyone who wants to look under the hood.

Here is an example of a capability question the tool generated for a hypothetical global procurement context:

Are procurement managers maintaining and developing independent supplier risk judgment, or are declining override rates and faster cycle times reflecting increasing over-reliance on tool outputs they are no longer evaluating critically?

More specifically: when the tool is wrong about a supplier risk, can managers catch it? And is that ability strengthening, holding steady, or eroding over the eight months since deployment?

And here is part of what it produced for honest limitations in that same scenario:

You do not have a pre-deployment baseline, and nothing in this design creates one. The baseline you collect in weeks one through three reflects capability at eight months post-deployment. You will not be able to say whether the tool improved, maintained, or eroded procurement judgment compared to before deployment. You will be able to say whether capability is developing, stable, or eroding during the measurement period. For the CPO’s expansion decision, that is an honest and useful answer. It is not a complete one.

Your context will produce something different. That is the point.

The current version is a research design assistant. The next version will be able to do more. Feedback from practitioners who use it will make both versions better.

The framework in this part is the conceptual foundation. The tool is the attempt to operationalize it. Neither is finished. Both are more useful than another adoption dashboard.

This is the third part in The Capability Question, a series on measuring what AI enablement actually does to people. Start with the series introduction here. Next: Toward a Capability Measurement Practice.

Ground Truth is written by Shannon Coleman, PhD. Product strategy leader and researcher with a background in psychology and quantitative methods. Her work spans govtech, healthtech, fintech, and consumer product contexts, with particular depth in enterprise AI platforms. Portfolio: shannonlcoleman.com

Shannon Coleman

Discussion about this post

Ready for more?