Modern AI systems don’t just rely on static datasets—they depend on continuous streams of real-time data to train, update, and make decisions. But what happens when that data can’t be trusted?
In this talk, we explore how streaming data pipelines—often built on systems like Apache Kafka—are becoming a critical and undersecured attack vector for AI-driven applications.
Rather than targeting models directly, attackers can manipulate the data flowing into them. By injecting, modifying, or replaying events in real-time streams, adversaries can:
- Poison training data and degrade model accuracy over time
- Manipulate real-time features used in fraud detection or recommendation systems
- Trigger unintended behaviors in downstream AI systems
- Quietly influence decisions without ever touching the model itself
We’ll examine how these attacks work in practice, from subtle data drift manipulation to targeted event injection, and why they are difficult to detect using traditional security tools.
The talk will break down the weak points in modern data pipelines:
- Lack of validation and trust boundaries in event streams
- Over-reliance on infrastructure-level security (encryption, ACLs)
- Blind spots in monitoring data integrity and semantic correctness
We’ll also explore how these risks evolve in systems that continuously retrain or adapt, where corrupted data doesn’t just affect a single decision—but becomes embedded in the model itself.
Finally, we’ll discuss defensive strategies that go beyond securing infrastructure: treating data as an attack surface, implementing validation and anomaly detection at the data level, and designing pipelines that can detect and recover from adversarial inputs.
This talk offers a new perspective on AI security - not by focusing on models, but on the data pipelines that feed them, where some of the most impactful and least visible attacks can occur.
This talk has been presented at AI Coding Summit London, check out the latest edition of this Tech Conference.























