Are Robust LLM Fingerprints Adversarially Robust?

Anshul Nasery^† Edoardo Contente^◆ Alkin Kaz^‡ Pramod Viswanath^‡^◆ Sewoong Oh^†^◆

^† University of Washington ^‡ Princeton University ^◆ Sentient

Paper

alphaXiv

arXiv

Code

Full code will be released soon.

Front cover for Are Robust LLM Fingerprints Adversarially Robust?

TL;DR

Under a realistic malicious host threat model, we study ten recent black-box LLM fingerprinting schemes. For nine out of ten, we construct simple, efficient attacks that achieve near-perfect attack success (ASR ~100%) while preserving >90% of the model's original benchmark utility. Even the remaining scheme—domain-specific watermarks—can be significantly weakened (~65% ASR at ~92% utility).

Across schemes, we find four structural vulnerabilities that can be exploited to bypass fingerprint verification by malicious model hosts via four respective attacks:

Verbatim verification: memorized fingerprints are token-by-token fragile → Output suppression: lightly perturb all model outputs
Overconfidence: fingerprints rely on probability spikes → Output detection: selectively perturb overconfident outputs
Unnatural Queries: intrinsic fingerprints look unnatural → Input detection: filter out weird, high-perplexity queries
Statistical signatures: statistical fingerprints leak patterns → Statistical analysis: learn global signature statistics and correct for them

Table with vulnerabilities and attack success rates (ASR) for various model fingerprinting methods.

Summary of vulnerabilities and attack success rates (ASR) for existing model fingerprinting methods. Linked papers (table first column): Instructional FP, Chain&Hash, Perinucleus FP, Implicit FP, FPEdit, EditMF, RoFL, MergePrint, ProFLingo, and DSWatermark.

Overview

The main goal is to propose a framework to critically evaluate existing model fingerprinting schemes under a more realistic scenario where the model host is maliciously attacking the verification process while preserving the model utility.

Thus far, lack of such systematic evaluation has led to (i) fingerprinting schemes being introduced without proper robustness guardrails, (ii) most of those methods failing under easy-to-apply attack scenarios, and (iii) comparisons to existing baselines being inconsistent. To rectify this status quo, our goal is (i) to provide user-friendly benchmarks that can easily measure the utility-ASR curve for any fingerprinting methods of the user's choice under several attacks (see our GitHub repo), (ii) to provide a family of attacks that are powerful against existing family of fingerprints, and (iii) to demonstrate that such systematic stress-test on fingerprints is necessary by numerically showcasing how existing methods fail under our attacks.

In our repo we provide implementations of Instructional FP, Chain&Hash, Perinucleus FP, Implicit FP, FPEdit, EditMF, RoFL, MergePrint, ProFLingo, and DSWatermark and test their robustness to the four types of vulnerabilities under our corresponding attacks.

In achieving this fundamental goal, we started by adopting many of the existing methods (from other domains such as backdoor, jailbreaking, and watermarking) in our attacks and were surprised that existing fingerprinting schemes are quite vulnerable. In the process, we also progressively came up with stronger attacks (e.g. output detection and statistical analysis) targeting vulnerabilities unique to fingerprints (e.g. exact memorization) while also maintaining a high utility. We also explored stronger defenses (e.g. approximate verification) to ensure that the vulnerabilities are fundamental shortcomings and not easily patchable.

Overall, relatively simple attacks were sufficient to achieve our goals, which include (i) encouraging the future fingerprinting schemes to adopt our benchmark and be more mindful about robustness in a systematic way, and (ii) provide analyses on the factors affecting such robustness as a path to designing better fingerprints.

Fingerprinting landscape

Following recent fingerprinting surveys [3], we group existing fingerprints into:

Invasive (trained):
- Memorization-based: train specific (query, response) pairs into the model
- Statistical: embed subtle n-gram biases detectable via hypothesis tests
Intrinsic (natural): discover existing queries that elicit unique responses, without training

Threat model

We adopt a practical malicious-host threat model: a malicious model host has the fingerprinted weights, serves an API, and wants to keep utility high while evading verification. The fingerprinting protocol is public but the specific instantiation (the "private key") is secret (Kerckhoffs's principle¹). The host may perform offline white-box inference and light-weight post-processing, but cannot swap in a totally different LLM.

We measure:

Attack Success Rate (ASR): fraction of fingerprint queries that fail verification (= 1 − Effectiveness)
Normalized Utility: accuracy on IFEval, GSM8K, GPQA-Diamond, and TriviaQA, normalized by the un-attacked base model

A "perfect" attack has ASR ≈ 1 and Normalized Utility ≈ 1.

Our key insight: under this realistic adversary, many popular schemes share structural weaknesses that are easy to exploit.

The four common attacks (and the vulnerabilities they exploit)

We identify four recurring vulnerabilities and design an attack for each.

1) Verbatim verification → Output suppression

Memorization-based schemes verify by matching a fixed pattern in the response: exact prefix, substring, or keyword. An attacker can push generation away from those patterns while preserving utility.

We consider strategies that, for the first n decoding steps:

SuppressTop-k (ST): drop the top-k most probable tokens and sample from the rest. Trivially breaks strict prefix checks.
SuppressNeighbor (SN): guess likely fingerprint tokens and their lexical variants ("Paris", "paris", "Paris,", …) and downweight them all.
SuppressLookahead (SL): simulate a few steps ahead, find tokens that reliably lead to the fingerprint, and suppress them early.

Even for permissive MatchKeyword verifiers (Instructional FP [5]), stronger suppression like SL raises ASR from 0% to 87.5% while keeping normalized utility >90%.

ASR evolution under more sophisticated suppression techniques for different verifiers.

Progressively more complex suppression techniques achieve higher ASR even under the more lenient MatchKeyword (MK). MP stands for the more strict MatchPrefix verifier.

2) Overconfidence → Output detection

Memorized fingerprints often induce overconfident outputs: the model assigns very high probability to particular tokens on fingerprint queries. We exploit this with a simple confidence gate: at each step, if max token probability exceeds a threshold (e.g., > 0.9), treat this as a likely fingerprint token and trigger suppression only then.

This selective suppression avoids perturbing normal generations, recovering most of the utility lost by naive suppression while keeping ASR high.

Utility-ASR trade-off of ST vs ST + Detection for different verifiers.

Selective output detection keeps the operating point near the ideal upper-right corner: triggering suppression only on overconfident steps (ST + Detection) recovers most of the utility that naive SuppressTop-k (ST) sacrifices while maintaining high ASR.

3) Unnatural Queries → Input detection

Intrinsic fingerprints typically use queries that look weird: syntactically complex or near-gibberish (e.g., cryptographic hashes in Chain&Hash [6], GCG-optimized suffixes in RoFL [11]). We show that a malicious host can run a small helper LLM to compute input perplexity and use a simple threshold as a high-perplexity filter.

If perplexity is above threshold, flag the query as suspicious and route it to a backup model or produce a generic refusal. For intrinsic schemes like RoFL [11], a single perplexity threshold almost perfectly separates fingerprint queries from real WildChat prompts [15], yielding ~100% ASR with negligible utility cost.

Left: Histogram of the average log-perplexity of Wildchat queries and RoFL fingerprints. Right: ROC curve for detection using perplexity.

A single perplexity threshold almost perfectly separates benign WildChat prompts (low average log-perplexity) from RoFL fingerprints (high-perplexity tail), producing a ROC curve near the top-left corner and enabling a near-100% ASR filter with negligible utility cost.

4) Statistical signatures → Statistical analysis

Watermark-style fingerprints bias token statistics using secret green lists (e.g., DSWatermark [14] built on KGW [16]): for each token, a secret subset of continuations is marked "green" and receives a positive logit bias on fingerprinted queries. A verifier runs a hypothesis test on the fraction of green tokens.

In practice, this bias leaks into general behavior. An attacker can query both a watermarked model and a calibration model (same tokenizer, no watermark) on many prompts, compare log-probabilities to identify continuations consistently more likely under the watermarked model, and subtract a small bias from their logits during generation.

This "watermark stealing" attack substantially raises verification p-values and achieves about 65% ASR at ~92% normalized utility. DSWatermark is not fully broken, but significantly weakened.

Distribution of sparse, aggregated over max and z-score between two tokens

Subtracting the sparse bias learned from calibration data flattens DSWatermark's green-list signature: aggregated sparsity statistics reveal which continuations consistently favor the fingerprinted model, so the attacker can downweight them and erode the verifier's hypothesis test.

What actually breaks?

We study ten recent fingerprinting schemes covering invasive (memorization-based, statistical) and intrinsic approaches.

Memorization-based schemes (Instructional FP [5], Chain&Hash [6], Perinucleus FP [7], Implicit FP [8], FPEdit [9], EditMF [10]) all fall to output suppression + detection, achieving ASR ~ 94–100% under at least one verification metric, typically with <10% utility loss.

Intrinsic schemes (RoFL [11], MergePrint [12], ProFLingo [13]) use GCG-optimized triggers or prompts that look highly unnatural. A small-model perplexity filter flags them almost perfectly: ~100% ASR with essentially no observable utility drop.

Statistical fingerprints (DSWatermark [14]) are more robust: our statistical attack only reaches ~65% ASR at ~92% normalized utility, but that still means a sizable portion of fingerprint queries become indistinguishable from benign traffic at usual p-value thresholds.

Why these attacks are practical

Cheap at inference: thresholds on probabilities, a small helper model for perplexity, or small logit adjustments
Utility preserving: they selectively target rare triggers, overconfident spikes, or subtle statistics; normal behavior remains largely intact
Aligned with the threat model: we assume no key leakage and do not swap in a completely different LLM; we only exploit the public structure of the scheme

Takeaways and design recommendations

Memorization-based fingerprints are token-level fragile and easily suppressed (ASR ~ 94–100%, <10% utility loss)
GCG-optimized intrinsic fingerprints are statistically weird and easily filtered by perplexity (~100% ASR, negligible utility cost)
Statistical fingerprints leak global signatures that can be partially reverse-engineered (~65% ASR at ~92% utility)

If you're designing the next generation of fingerprints:

Make keys look natural: queries should be statistically indistinguishable from real prompts
Hide in the logits: avoid large confidence spikes on fingerprint responses
Avoid exact string checks: schemes relying on fixed substrings/keywords are inherently brittle to decoding tweaks
Avoid shared global signatures: if many fingerprints share the same statistical pattern (are not independent) and the pattern leaks, an attacker can learn and scrub it

Current schemes are best viewed as stepping stones for future more robust mechanisms for high-stakes, adversarial provenance verification.

Citation

Easily copy the citation for our paper (arXiv:2509.26598):

BibTeX

@article{nasery2025robustfingerprints,
  title={Are Robust LLM Fingerprints Adversarially Robust?},
  author={Nasery, Anshul and Contente, Edoardo and Kaz, Alkin and Viswanath, Pramod and Oh, Sewoong},
  journal={arXiv preprint arXiv:2509.26598},
  year={2025},
  url={https://arxiv.org/abs/2509.26598}
}

¹ The principle states that a cryptographic system should remain secure even if everything about the system except the key is public. From Auguste Kerckhoffs (1883). La Cryptographie Militaire. Journal des sciences militaires.