Notes on Web LLM Attacks and Defenses
Why these notes exist
These are personal review notes on LLM security from a web exploitation perspective.
They are intentionally short, high-level, and focused on how things break, not theory or hype.
The goal is simple:
to quickly refresh the mental model when testing LLM features in labs, bug bounties, or real systems.
Background reference:
https://portswigger.net/web-security
Big picture: what LLM security really means
Most LLM security problems reduce to three questions:
- What data does the model see?
- What tools / APIs can it control?
- How much trust does the system place in its output?
Mental model:
1
2
3
4
5
6
7
8
9
User input
↓
LLM
↓
Backend logic / APIs
↓
Real-world impact
Once an LLM influences backend behavior, it is part of the attack surface, not just a UI feature.
How LLMs behave (security view)
LLMs do not reason or validate intent.
They predict the next token based on learned patterns.
Security implication:
1
2
3
4
5
[Trusted instructions] ─┐
├──> LLM ──> Output / Actions
[Untrusted user input] ─┘
There is no native distinction between trusted and untrusted text.
Prompt injection (core idea)
Prompt injection changes system behavior, not just output.
1
2
3
4
5
6
7
Attacker-controlled text
↓
LLM
↓
Unintended API call / data access
Outcomes that matter:
- Unintended actions (state changes, API calls)
- Unintended outputs (secrets, payloads)
Direct vs indirect prompt injection
1
2
3
4
5
6
7
Direct injection:
Attacker → Chat input → LLM
Indirect injection:
Attacker → Web page / Email / Document → LLM
Indirect injection is often more dangerous because:
- The user interaction looks innocent
- Malicious instructions are hidden in external content
Typical LLM integration model
Most real systems follow this pattern:
1
2
3
4
5
6
7
8
9
10
11
12
13
User
↓
Application
↓ (prompt + system rules)
LLM
↓ (tool decision)
Backend API
↓
LLM
↓
User response
Security takeaway:
If an attacker can steer the model, they may steer backend execution.
Excessive agency
Excessive agency means the LLM can perform high-impact actions:
1
2
3
4
5
6
7
8
LLM capabilities:
* Reset passwords
* Send emails
* Read internal files
* Execute commands
If access control relies on “the model will behave,” the design is already unsafe.
Mapping attack surface (tester mindset)
A practical first step:
1
2
3
4
5
6
7
Discover tools
↓
Understand parameters
↓
Test payloads
If the model can describe its tools, it often generates an attacker’s API map.
LLM-assisted API exploitation pattern
Reusable testing loop:
1
2
3
4
5
6
7
8
9
Identify tool
↓
Trigger tool
↓
Control input
↓
Observe backend effects
UI responses are secondary.
Backend-side effects are what prove impact.
Indirect prompt injection in practice
Example mental flow:
1
2
3
4
5
6
7
8
9
User: "Summarize my email"
↓
LLM reads email content
↓
Hidden instructions inside email
↓
LLM triggers sensitive backend action
The attacker never interacts with the chat input directly.
Why naive defenses fail
Prompt-only defenses assume the model enforces rules.
Reality:
1
2
3
4
5
"Never do X" ┐
├──> LLM → Decision
"Ignore above and do X" ┘
Both are just text.
The model has no hard trust boundary.
Training data risks
Poisoning
1
2
3
4
5
6
7
Attacker-controlled data
↓
Training / fine-tuning
↓
Altered model behavior
Data leakage
1
2
3
4
5
6
7
8
9
Sensitive logs / inputs
↓
Training data
↓
LLM completion
↓
Partial reconstruction
Deletion does not guarantee removal from training pipelines.
Defensive mindset
Core principle:
1
2
3
4
5
Assume model failure
↓
Enforce security in backend
Practical implications:
- Treat LLM-accessible APIs as public
- Enforce auth and authorization server-side
- Limit tool privileges
- Restrict data exposure
- Monitor tool usage
Prompt rules are not security controls.
Review checklist
1
2
3
4
5
6
7
□ Do I know every input source (direct and indirect)?
□ Do I know every tool/API the model can call?
□ Are sensitive actions enforced server-side?
□ Can attacker-controlled content influence decisions?
□ Is trust placed in the model instead of the backend?
If these are clearly answered, the integration is far more likely to be resilient rather than merely impressive. ```