Agentic AI Medical Coding: How to Measure What Matters

Arasu Elango
May 13
5 min read

Agentic AI Medical Coding: How to Measure What Matters

More than 240 health systems now use Epic's Penny agent to assist with professional billing and coding. According to Healthcare IT News coverage of HIMSS26, the top-performing organizations report more than 20% fewer coding-related claim denials. Denial appeal letters are drafted 23% faster across Epic's community, and Summit Health cut prior authorization submission time by 42%. Those are striking numbers — but they also raise a harder question that most revenue cycle teams aren't asking yet: how do you know when your agentic AI is actually working, and how do you catch it when it isn't?

As agentic AI in medical coding moves from pilot to production — Epic's own roadmap calls for Penny to autonomously complete "straightforward coding sessions" within the next release cycle — the measurement frameworks governing these systems matter as much as the systems themselves. Getting the metrics wrong is expensive. Getting them right is the difference between a 20% denial reduction and a compliance exposure you didn't see coming.

What "Agentic" Actually Means for Coding Workflows

Traditional AI-assisted coding tools function as recommendation engines: a coder sees a suggested code, reviews it, and accepts or rejects. The human is always in the loop, and the audit trail is straightforward. Agentic AI is different in kind, not just degree.

An agentic coding system — whether it's Epic's Penny, Innovaccer's Flow Capture (which autonomously codes approximately 80% of encounters without human intervention), or IKS Health's audit-ready engine — takes actions on its own. It assigns codes, applies modifier logic, checks NCCI edits, validates against payer propensity data, and routes the claim. The coder may never see the majority of encounters. That's the efficiency gain. It's also the compliance surface area that traditional measurement frameworks weren't designed to handle.

The distinction matters because the metrics that work for an AI recommendation tool — accuracy rate on reviewed cases, acceptance rate, time-to-review — are insufficient for a system that acts autonomously on a large fraction of your encounter volume.

The Three Metrics That Actually Predict Risk

Across the revenue cycle vendors that have published real-world data on autonomous coding deployments, three leading indicators consistently surface as the most predictive of downstream problems.

1. Autonomous Handling Rate by Complexity Band

The percentage of encounters coded without human review is not a single number — it should be broken out by CPT complexity, payer, and diagnosis category. A system that autonomously handles 80% of encounters may be handling 95% of your low-complexity office visits and 60% of your complex inpatient cases. Those two populations have very different error profiles and audit risk. If you're only tracking the aggregate autonomous rate, you're flying blind on the cases that matter most to CMS and commercial payers.

2. Payer-Specific First-Pass Denial Rate (Not Blended)

Epic's reported 20% denial reduction is a community average across more than 240 organizations. Your own payer mix will produce a very different distribution. A system that performs well across Blue Cross contracts may underperform on Medicare Advantage plans, particularly as HCC-relevant encounters are coded under V28 model requirements. Blended denial rates mask payer-specific failures that could trigger audit flags or contract-level disputes. Revenue cycle teams deploying agentic AI should track first-pass denial rate by payer, with a specific lens on Medicare Advantage given ongoing RADV enforcement activity.

3. Undercoding and Upcoding Detection Rate

Denial rate is a lagging indicator. By the time a denial surfaces, a claim has already been submitted, adjudicated, and rejected — a cycle that takes weeks. A more useful leading indicator is the rate at which your agentic system's code selection deviates from what a certified coder would have assigned, measured by internal audits on a statistically significant sample of autonomously coded encounters. This is what IKS Health calls "audit-ready coding" — every autonomous code assignment includes a justification report citing specific charted clinical evidence, which makes retrospective audits possible without reconstructing the AI's reasoning from scratch.

What an Accountability Framework Looks Like in Practice

The operational question is how to build oversight structures that don't simply recreate the manual review bottleneck that agentic AI was deployed to eliminate. The following framework reflects practices emerging across health systems that have moved beyond pilot stage.

Tiered human review by confidence score. Most production agentic coding systems generate a confidence score per code. IKS Health's engine routes high-confidence codes through automated rules and payer validation while flagging low-confidence codes for coder review. The key operational decision is setting the confidence threshold — too low and you've re-created manual coding; too high and your error rate on auto-approved codes becomes a compliance liability.
Monthly audit samples stratified by payer and complexity. An audit covering 5–10% of autonomously coded encounters, stratified by the payer and DRG/CPT complexity bands with the highest denial exposure, gives compliance teams early warning of systematic errors before they surface as denial patterns or OIG attention.
CDI integration checkpoints. Autonomous coding accuracy is bounded by documentation quality. Systems like Innovaccer's Flow Capture address this by embedding CDI query logic at the point of care — the ambient scribe captures the visit, a CDI assistant closes documentation gaps before the note is signed, and the coding engine then works from a more complete record. Health systems without this upstream integration will see agentic coding accuracy capped by whatever their documentation gaps already are.
Real-time payer rule updates. CPT 2026 introduced 288 new codes and 46 revisions — the largest update in recent memory, and the first CPT cycle to formally recognize AI-assisted services. Payer coverage policies for many of these new codes are still being finalized. An agentic system that isn't continuously updated against payer-specific edits will generate denials on new code categories at a higher rate than the overall system baseline.

The Governance Gap Nobody Is Talking About

Epic's HIMSS26 presentation made clear that the trajectory for Penny is toward greater autonomy — drafting and submitting appeals independently, completing coding sessions without human sign-off, operating within guardrails set by each organization. Waystar, Innovaccer, IKS Health, and others are on the same trajectory. By the end of 2026, autonomous coding of the majority of encounters will be a default setting, not an advanced option, for health systems running integrated AI platforms.

That trajectory creates a governance gap that most health systems haven't closed. The gap is not technical — the systems exist and they work. The gap is procedural: most organizations don't yet have a named owner for agentic coding AI performance, a defined audit cadence, a documented policy for adjusting autonomous handling thresholds, or a process for escalating systematic errors before they accumulate into audit exposure.

The 20% denial reduction that Epic's top performers are achieving isn't an accident of better software. It reflects organizations that deployed agentic AI with the same rigor they'd apply to a major EHR configuration change — defined KPIs, rollout governance, exception processes, and ongoing oversight. The organizations that don't build that infrastructure before they go live are the ones that will be explaining anomalies to their compliance committees six months later.

Starting Points for Revenue Cycle Teams

If your organization is evaluating or currently running agentic coding AI, the measurement infrastructure should be in place before you scale autonomous handling rates above 50% of encounter volume. At minimum: establish baseline payer-specific first-pass denial rates before go-live so you have a clean before/after comparison; define the confidence threshold for autonomous approval and document the rationale; assign ownership of the monthly audit process to a specific individual or team; and ensure your AI vendor's payer rule update cadence is contractually defined — not just a feature roadmap item.

Agentic AI in medical coding is not a technology problem anymore. It's a governance and measurement problem. The health systems getting 20% denial reductions have solved both halves.

Medikode's automated medical coding platform is built for revenue cycle teams that need both speed and defensibility — autonomous code assignment with the audit trail and override controls your compliance team requires. See how it works at the live demo.

Agentic AI Medical Coding: How to Measure What Matters