Back to Blog Product & Process

Designing for the Gap: Two Years of Lessons from Building Brainpathio

Abstract illustration of two diverging conceptual paths representing lessons learned from designing for learning gaps

We started building Brainpathio in early 2024 with a clear hypothesis: that misconception-aware adaptive routing would produce better instructional signals than difficulty-based adaptive routing. Eighteen months later, the hypothesis still holds. But several of our early design decisions turned out to be wrong in ways that were only visible once we put the product in front of practicing teachers. Here are two lessons we'd take back and redo.

The Context for These Lessons

When we started building Brainpathio in early 2024, we had a well-defined technical hypothesis and a clear sense of the instructional problem we were solving. The technical work of building a misconception taxonomy, designing diagnostic items, and implementing a classification mechanism is substantial but tractable — there's a body of research to draw on, and the engineering problems are well-understood. What we underestimated was the design problem on the teacher-facing side: how do you represent uncertain, probabilistic information about student cognition in a way that a busy classroom teacher, managing 25 students and a pacing guide and a curriculum coordinator who wants weekly updates, will actually find useful rather than anxiety-inducing?

These two lessons come from the gap between what we built in our first version and what teachers told us, consistently, was missing. They're not algorithmic lessons or data science lessons. They're design and trust lessons. We think they're relevant to anyone building or evaluating adaptive learning tools that aim to surface diagnostic information to teachers.

Lesson One: Certainty Displays Erode Trust Faster Than Uncertainty Displays Do

Our first version of the teacher misconception dashboard showed a misconception label for each student with a progress indicator — a partially-filled bar that represented confidence. If the system had high confidence in a misconception classification, the bar was mostly full. If the system was still gathering evidence, the bar was partially empty. We thought this was honest and informative. It was honest. It wasn't informative in the way we expected.

What we heard from teachers during early feedback sessions was something like: "I'm not sure I trust the partial ones. How do I know when to act on it versus wait for more data?" And, harder to hear: "When I looked at the full-bar ones and checked them against what I know about those kids, some of them were right and some of them were clearly wrong. So now I don't know whether to trust any of them."

The core problem was that we were displaying certainty as a continuous variable when teachers' decision-making process was binary: act, or don't act. A partially-filled confidence bar invites teachers to calibrate their response to the level of certainty — but teachers don't have the bandwidth to do that calibration for 24 students on a daily basis. And a high-confidence label that turns out to be incorrect for a specific student undermines trust in all the labels, not just that one.

The redesign we arrived at, after several iterations, was a threshold approach: the system only surfaces a misconception label to the teacher when the evidence crosses a minimum confidence threshold — roughly equivalent to "we're confident enough that acting on this is more likely to help than hurt." Below the threshold, the system says "more information needed" and routes additional diagnostic items. This means teachers see fewer labels in the dashboard at any given time, but the ones they see are actionable. The display represents the system's epistemic state accurately, but it translates that epistemic state into a teacher-relevant framing: act now versus more data being gathered.

We're not saying uncertainty should be hidden from teachers — that would be a different kind of dishonesty. We're saying that how uncertainty is represented matters as much as whether it's represented. Continuous probability estimates are honest but they shift the cognitive burden of interpretation onto the teacher. Threshold-based displays shift that burden back to the system, where it belongs.

Lesson Two: The Taxonomy's Grain Size Has to Match What Teachers Can Do

When our head of learning science, Marcus Webb, built out the initial misconception taxonomy for 5th and 6th grade mathematics, he did it with rigor. The taxonomy distinguishes between closely related misconceptions that have different instructional implications. For fraction operations alone, the taxonomy includes seven distinct misconceptions, each with different diagnostic signatures and different re-teaching approaches.

The problem: when teachers saw seven distinct fraction misconceptions in the dashboard, their first question wasn't "which of these seven does my student have?" It was "what should I do tomorrow?" And seven distinctions, each with a different recommended response, didn't produce clarity — it produced decision paralysis.

This is a real tension in the design of any misconception-aware system, and we don't think there's a clean resolution. The taxonomy's grain size that's ideal for classification accuracy — fine-grained enough to distinguish closely related misconceptions that require different interventions — is often finer than the grain size that's useful for a teacher making an in-the-moment instructional decision. There aren't seven different things a teacher can do in response to seven different fraction misconceptions before the next lesson. There are probably two or three.

Our current approach groups the underlying fine-grained taxonomy into teacher-facing "action clusters" — broader categories that map onto practical instructional responses, while preserving the fine-grained classification internally for analysis and curriculum reporting purposes. A teacher sees three fraction misconception categories rather than seven, each with a suggested re-teaching approach. The department dashboard, by contrast, surfaces the fine-grained taxonomy when department leads are doing curriculum analysis — because at that level of decision-making, the distinction between two closely related misconceptions is relevant to which curriculum materials to revise.

The lesson isn't that the taxonomy should be coarser — coarser taxonomies lose diagnostic power. The lesson is that the same taxonomy needs to be presented at different grain sizes depending on who is looking and what decision they're making. Building that translation layer — from fine-grained classification to teacher-actionable presentation — turned out to be as much design work as building the taxonomy itself.

What These Two Lessons Have in Common

Both lessons circle back to the same underlying principle: the hardest design problem in formative assessment tools isn't getting the algorithm right. It's representing what the algorithm knows — including what it doesn't know — in a form that's legible and trustworthy to a person who has 35 minutes of prep time, four sections of students, and no tolerance for information that adds cognitive load without producing a usable decision.

Adaptive learning has a long history of technically sophisticated systems that failed in deployment because the information they generated wasn't designed for the humans who were supposed to use it. Intelligent tutoring system research from the 1990s and early 2000s is full of examples of high-accuracy classification systems with teacher dashboards that no one consulted. The accuracy of the underlying model and the utility of the teacher-facing representation are independent design problems, and both have to be solved.

We're still iterating on both. The misconception taxonomy for 6th-grade science is better than the 5th-grade mathematics taxonomy was at version one, partly because we'd learned these lessons before we built it. The dashboard design for the department-level view is substantially more useful than the first classroom-level view was, partly because we'd had those hard feedback sessions first.

If there's a single principle worth extracting: build the most accurate classification system you can — and then spend at least as much time on how the output of that system gets communicated to the people who have to act on it. The second part is not a UI problem. It's a trust problem. And trust, in formative assessment, is upstream of everything else. Teachers who don't trust a system's outputs won't use them. Teachers who do trust them — specifically because the system has been honest about uncertainty and legible about what to do — will build practices around them. That's the difference between a tool that changes instruction and one that sits unused in a browser tab.