How does session token replay bypass MFA and traditional detection tools?

Session token replay works by stealing an already-authenticated token, such as a Primary Refresh Token (PRT), which lets an attacker access services without triggering a new login or MFA challenge. Because the token is valid and carries the original device's compliance claims, SIEM correlation rules, behavioral analytics, and identity threat detection tools see normal authenticated activity. The attack enters through the same door as legitimate users, which is why investigation depth matters more than adding another detection layer. Prophet Security's AI investigation approach focuses on analyzing the telemetry surrounding these tokens rather than waiting for a signature to fire.

Why do SOC teams auto-close suspicious login alerts, and what do they miss?

SOC analysts auto-close suspicious login alerts because the base rate of false positives is overwhelming. At most organizations, the vast majority of location-based login anomalies are legitimate users on VPNs, traveling, or working remotely. Analysts develop rational shortcuts: check device compliance, confirm it is managed, close the ticket. The risk is that session hijacking produces alerts identical to benign ones, and the fastest triage path leads directly to a false negative. An AI-driven investigation layer applies consistent depth to every alert without the cognitive erosion that comes from processing thousands of identical-looking signals.

How does device compliance create a blind spot during alert triage?

Device compliance checks confirm that the original authenticating device was managed, enrolled, and healthy at the time of login. However, in token theft scenarios, the stolen token inherits those compliance claims even when presented from a completely different device or network. This means an attacker replaying a PRT from adversary-controlled infrastructure still appears compliant during triage. The compliance check answers whether the original device was managed, not whether the current entity using the token is the same one that authenticated. Closing that gap requires investigating token behavior across the full session window, which is the approach Prophet Security's investigation framework now applies automatically.

How can AI-driven alert investigation reduce SOC alert fatigue without suppressing signals?

Traditional approaches to alert fatigue involve suppression rules, auto-close thresholds, or simply ignoring low-fidelity alert types. These reduce volume but also eliminate the chance of catching real attacks hiding in noisy categories. An AI investigation layer addresses this differently: it runs the same structured, deep investigation on every alert without degradation, examining token types, IP behavior, user-agent history, and behavioral baselines. This means noisy alert categories like suspicious logins become investigation triggers rather than queue clutter, turning a low-fidelity signal into a structured determination of benign vs. malicious activity. Prophet Security applies this model to ensure investigative depth scales with alert volume rather than competing against it.

What does a technique-level investigation framework look for in suspicious login telemetry?

A technique-level framework examines several structural signals across the session window: PRT refresh activity from multiple distinct IPs in a short timeframe, geographic diversity inconsistent with single-user behavior, absence of any fresh authentication or password entry, presence of commercial VPN or hosting infrastructure, and user-agent shifts that break from the user's historical baseline. No single signal is conclusive, but the combination forms a pattern structurally inconsistent with legitimate access. This framework generalizes across token theft methods, making it reusable for every suspicious login alert. Prophet Security's platform encodes these patterns so they are applied automatically rather than requiring manual analyst judgment on each alert.

99 False Positives and 1 Stolen Session: Why SOCs Need Investigation Depth, Not More Detection

Q: What is the difference between investigating at the indicator level vs. the technique level?

Indicator-level investigation relies on specific IOCs like known malicious IPs or file hashes, which expire quickly as attackers rotate infrastructure. Technique-level investigation looks for structural behavioral patterns, such as multi-IP token refresh activity, geographic diversity without fresh authentication, and presence of anonymizing infrastructure, that hold regardless of the specific tooling an attacker uses. This distinction matters because technique-level patterns can be applied to every suspicious login alert as a general-purpose framework, not just to alerts tied to a known incident. Prophet Security's platform operationalizes technique-level investigation so that behavioral patterns identified once are applied consistently at scale.

Suspicious login alerts are the most ignored category in most SOCs because the signal-to-noise ratio has been broken for years.

A user logs in from a hotel. Alert.
A user connects through their company’s VPN and the egress IP resolves to another state. Alert.
A user travels for work and authenticates from an airport lounge in a city they have never logged in from before. Alert.
A contractor uses a personal device with a consumer VPN because the corporate one is slow. Alert.

These are not just edge cases. This is what normal enterprise authentication looks like. Major identity providers will flag location-based anomalies, such as impossible travel anomalies, and most of those anomalies are policy-compliant users doing ordinary things from ordinary places. The alerts fire because the heuristic is simple: this location is new or anomalous. The investigation required to determine whether that matters is not simple at all.

Most teams handle this the way you would expect. They skim, they spot-check, they auto-close below a threshold, or they build suppression rules until the queue is tolerable. Anything to suppress the growing volume of alerts that are just users being users. All of those approaches are rational. None of them solve the underlying problem.

Adversaries know this, and they account for it. The real attacks use the same door as the false positives.

What a Pentest Proved

At Prophet Security, we’ve seen this abuse first hand. During a red team engagement at a customer environment, an attacker gains initial access through a ClickFix social engineering technique, a widely-abused method where the user is tricked into executing a clipboard payload, typically disguised as a browser or application fix. The payload dropped a malicious DLL that hooked into excel.exe at launch, establishing a command-and-control channel that blended into normal Office process behavior.

From that C2 foothold, the attacker obtained a valid session token, a Primary Refresh Token (PRT), using it to authenticate as the compromised user to multiple Microsoft services. With a valid PRT, the attacker did not need the user's password. They had a renewable key to the “front door”. They used it to access the Microsoft Azure CLI, move laterally through cloud resources, and operate with the privileges of a legitimate user session.

This is not a novel technique. PRT abuse and session token replay have been documented extensively. Researchers and red teams have demonstrated that once an attacker obtains a PRT, they can refresh it from any network, any IP, any location, and the resulting sign-in events look like normal token-based authentication. The authentication never triggers a user-interactive login, so there is no MFA challenge to flag, no credential entry to key on, no brute force to rate-limit. The session just continues.

Here is what the customer's detection stack saw: nothing.

Their endpoint detection platform did not alert on the DLL injection.
Their SIEM correlation rules did not fire.
Their behavioral analytics engine did not flag the lateral movement through cloud services.
Their identity threat detection did not catch the PRT replay.

The only alert that fired across the entire kill chain was an “uncommon location OAuth2 login to Office 365”. One alert. The same alert type that fires hundreds or thousands of times a month at most organizations, and that most SOC teams have learned to treat as background noise.

This Is Not Unusual. That Is the Problem.

If this was an edge case failure, it would be an interesting war story and nothing more. But it's not rare. It is structurally common, and most organizations only discover that after it is too late.

Whether it surfaces during a pentest or after a real breach, the moment lands the same way. The detection stack that leadership invested in, that the security team tuned and maintained, that vendors promised would catch exactly this kind of activity, did not fire. Or it fired once, on the nosiest alert type in the queue, and nobody had a reason to pull that thread. That is the moment that changes the conversation inside an organization. If the tooling missed a controlled exercise using known techniques, has an actual attacker already done the same thing? The organization was confident in its security posture two hours ago. Now it's not.

This is the quiet cost of treating detection as the finish line instead of the starting point. Detection stacks are not broken. They do what they are designed to do: fire on signatures, correlations, and heuristics. But modern attacks, especially those that abuse legitimate tokens and sessions, do not trigger signatures. They look like normal authenticated activity because, from the identity provider’s perspective, they are normal authenticated activity. The token is valid. The session is real. The device that was originally authenticated is compliant. The alert fired. What was missing was an investigation workflow deep enough to tell the difference between a legitimate user and an attacker holding a legitimate token.

The Compliant Device Problem

There is another reason this alert would have been dismissed quickly under normal triage.

Because the attacker stole the session from a legitimate, enrolled device, the token carried the device compliance claims with it. When a SOC analyst pulls up a suspicious login and checks whether the device is compliant (managed, enrolled, passing health checks) that is often the fastest path to closing the alert as benign. A compliant device is one of the strongest signals an analyst has that the session belongs to a real employee on a real corporate machine.

In this case, the device was compliant. The user’s machine was enrolled, managed, and healthy. The attacker was not on that machine. They had hijacked the session from it. But the token they replayed still inherited the device compliance posture of the original authentication, because that is how token-based identity works. The compliance claim travels with the token, not with the network or device presenting it.

That means the most common SOC shortcut for resolving suspicious login alerts (check device compliance, confirm it is managed, close the ticket) would have led directly to a false negative. The alert would have been closed as a compliant user logging in from an unusual location. Case closed. Attacker still inside.

This is a structural limitation of how device compliance is evaluated at triage time. The compliance check answers the question “was the original device managed?” It does not answer the question “is the entity using this token right now the same entity that authenticated on that device?” Those are very different questions, and the gap between them is exactly where session hijacking lives.

One Alert Is Enough if the Investigation Is Deep Enough

The alert itself contained almost nothing useful. A user. A location. A timestamp. An application. That is the standard payload for a suspicious login alert, and it is the reason these alerts are expensive to investigate and easy to dismiss, especially when the device looks clean.

But when that alert was investigated with enough depth, pulling the surrounding sign-in telemetry, examining token types, analyzing IP behavior across the session window, and correlating user-agent patterns against historical baselines, the picture changed entirely.

The sign-in data showed PRT-based token refresh activity originating from multiple distinct IP addresses within a narrow time window. No fresh authentication, no password entry, no MFA challenge, appeared anywhere in the surrounding events. The IPs spanned different geographies and included infrastructure associated with commercial VPN providers and hosting platforms. The user-agent patterns shifted in ways inconsistent with the user’s historical behavior.

None of these individual data points would have generated an alert on their own. Token refreshes are normal. IP changes happen. User-agent variation exists. But the combination of multi-IP PRT activity, geographic diversity, absence of fresh authentication, and anonymizing infrastructure forms a pattern that is structurally inconsistent with a single legitimate user on a single device.

That pattern is what session hijacking looks like from the telemetry side. And it was visible in the data the entire time. No detection engine surfaced it. The device compliance check actively obscured it. The investigation found it. But the required depth of investigation to find it is dramatically different from what most SOCs are able to do today for these types of alerts.

Why This Alert Type Is Structurally Dangerous

Suspicious login alerts fire at enormous volume. At most organizations, 99 out of 100 are legitimate users on unfamiliar networks. SOC teams know this. Over time, analysts internalize the base rate. They develop pattern recognition not for the attack, but for the alert: check device compliance, confirm managed, close. That is not a failure of discipline. It is human cognition developing mental shortcuts, or heuristics, which allow them to respond to the noise. The 99 false positives train the analyst to expect the 100th will be false too.

An AI investigation layer does not have that problem. It does not experience alert fatigue. It does not need to develop heuristics to save precious brain energy. It runs the same depth of investigation on alert number 4,000 as it did on alert number 1, without degradation. But investigation depth is only as good as the framework it operates within.

When this pentest ran, the standard investigation pattern for suspicious login alerts, the same pattern used across the industry, treated device compliance as a strong signal of session legitimacy. That was reasonable. It was also incomplete. The telemetry to distinguish a legitimate session from a replayed token was present in the environment. But the investigation framework had not yet been built to examine it.

After the engagement, we mapped the red team's activity against the available telemetry and closed the gap. The platform's investigation framework was expanded so that device compliance at the start of a session is no longer the end of the analysis.

This is the compounding advantage. A detection engine cannot update its own logic. A human analyst can learn, but still has to apply that knowledge manually against alert volume that guarantees most of the queue will never get that depth. An AI investigation layer absorbs the new framework and applies it to every alert, at full depth, without fatigue, and without the cognitive erosion that comes from seeing the same noisy alert type thousands of times.

Detection Found the Door. Investigation Found the Intruder.

This is the distinction that matters: the detection layer did its job. It flagged an anomaly. But investigating that anomaly with the depth required to catch a session hijack, full session telemetry, token-type analysis, IP enrichment, geo-correlation, user-agent history, behavioral baselining, takes real time. Apply that to every suspicious login alert and your team will not have the cycles to do anything else. That is not a problem that staffing alone can solve because it is structural. It is the same dynamic that breaks DLP programs, phishing response workflows, and most alert-driven security operations. Detection produces volume. Investigation requires depth. The gap between the two is where risk lives.

The Investigation Found Something Else Too

When the session hijacking pattern was identified and the investigation methodology was refined, tuning what telemetry to pull, what thresholds to apply, what behavioral signals to weight, something else became clear.

The methodology was not specific to the pentest. It was a general-purpose investigation pattern for any suspicious login alert. Multi-IP PRT behavior, geographic diversity in token refresh activity, absence of fresh authentication, presence of anonymizing infrastructure, user-agent inconsistencies. These are not signatures of one attack. They are structural indicators of token theft and session replay regardless of how the token was obtained. This is the difference between investigating at the indicator level, where IOCs are brittle and expire, and investigating at the technique level, where the behavioral pattern holds regardless of the tooling behind it.

That means every future suspicious login alert can be investigated against the same behavioral framework. Not just the ones that happen to coincide with a known pentest or a confirmed incident. All of them. The noisy ones. The ones that get auto-closed. The ones nobody has time to look at.

This shifts the value proposition of the alert entirely. A suspicious login alert is no longer a low-fidelity signal that might indicate something but probably does not. It becomes the trigger for a structured investigation that can actually distinguish a VPN user in a hotel from an attacker replaying a stolen session token, even when the stolen token looks like it came from a compliant device.

From Reactive Investigation to Proactive Hardening

There is a second-order effect here that is worth naming directly.

Every alert that gets investigated deeply, whether it turns out to be benign or malicious, produces information about what normal looks like and what abnormal looks like in that specific environment

Which VPN providers do employees actually use?
What does legitimate multi-IP behavior look like for users who travel?
How many PRT refreshes are normal in a 30-minute window?
What does the baseline geographic distribution of token activity look like for a given user population?

That information feeds back into investigation prioritization, detection tuning, and policy refinement. It is not threat hunting in the traditional sense. Nobody is starting from a hypothesis and searching for indicators. It is closer to continuous validation: using the AI investigation layer to confirm, at scale, that the alerts the SOC is closing actually are benign, and surfacing the ones that are not.

The pentest happened to be the forcing function. The capability it exposed is broader: an investigation layer that learns from every engagement, closes its own gaps, and operates at the speed and scale of the alert queue, not just at the speed of the analyst staffed to work it.

The Opportunity Is Investigation, Not Detection

Security teams do not need another detection engine that fires on the same suspicious login and adds one more item to the same queue.

What they need is the ability to properly investigate what detection already sees. Quickly enough to matter, deeply enough to distinguish real attacks from policy-compliant noise, and consistently enough that the answer does not depend on which analyst happens to pick up the ticket or whether they thought to look past the device compliance check.

The pentest proved that in specific, uncomfortable terms. Every purpose-built detection system either missed the attack entirely or surfaced it as a generic anomaly indistinguishable from the thousands of benign alerts that fire alongside it.

The only signal that fired was the noisiest, lowest-fidelity alert type in the stack. The fastest triage shortcut would have confirmed the attacker as legitimate. And inside the surrounding telemetry that the alert pointed to, the data that nobody had time to look at, was everything needed to identify a compromised session.

The first time, the investigation layer missed it. The second time, it will not. That is the difference between a rigid investigation playbook and a system that learns how to improve its investigations.

But a system that learns still needs something to learn from. No model independently concluded what separates a compromised session from a legitimate one in sign-in telemetry. A security practitioner did. Someone who has worked on real incidents, who understands what attacker behavior looks like in production, and who can tell the difference between a pattern that matters and one that does not. The AI did not replace that expertise. It gave it reach. What one practitioner identified from a single engagement, the platform now applies to every alert, at full depth, without fatigue or degradation. The scaling mechanism is artificial. The intelligence behind it is not.

Detection found the door. Investigation found the intruder. The expertise to know what to look for came from the people who understand how attacks actually work. The future of security operations is making sure that expertise scales.

Request a demo today to see how Prophet AI investigates your alerts.

Your Biggest Risk is the SOC Queue

The SOC is a queueing system. This eBook walks through the metrics that tell you whether yours is healthy

Download eBook

Download Ebook

Frequently Asked Questions

Insights

Text Link

Eric Jarlsberg

AUTHOR

Determined, organized, and hardworking leader with a passion for Cybersecurity, critical thinking and problem solving that is looking to drive results in offensive or defensive security solutions.