Bets | Subhadip Mitra

AI safety that cannot be operationalized is mostly theater.

The questions are real - alignment, deception, sandbagging. But if your safety approach can't be measured, tested, or deployed, it's philosophy, not engineering. When I built sandbagging detection probes, the point was to get a classifier that actually works - 90%+ accuracy across multiple model families. The discussions matter, but they need to eventually become code that runs in production.

Holding Horizon · 2028

Wrong if the safety work that measurably prevents real incidents turns out to be the un-testable kind - norms, governance, interpretability-as-philosophy - while shipped classifiers and guards make little difference.

Memory bandwidth, not compute, is the real constraint in LLM deployment.

Most conversations about LLM infrastructure focus on GPU compute and FLOPS. But when you actually profile inference, the GPU is idle most of the time, waiting for model weights to load from memory. On an A100, loading a 7B model takes ~7ms; the computation takes ~0.1ms. We're memory-bound by two orders of magnitude. This changes which optimizations matter - speculative decoding works precisely because we're paying the memory tax anyway.

Playing out Horizon · 2027

Wrong if roofline analysis of mainstream decode workloads on next-gen accelerators lands in the compute-bound region at typical serving batch sizes.

The next wave of AI safety will be representation-level tooling, not RLHF.

RLHF was the first serious attempt at alignment, and it worked well enough to ship products. But it's a blunt instrument - you're shaping outputs without understanding what's happening inside the model. The next wave will be tools that operate on internal representations: probing for deception, steering activations, detecting when a model "knows" something it's not saying. My work on activation probing and steering vectors is a bet on this direction.

Playing out Horizon · 2028

Wrong if production alignment in 2028 is still dominated by preference optimization, and probes, steering, and circuit-level interventions remain research demos rather than deployed guards.

The player-coach model works better than pure management for deep-tech teams.

Conventional wisdom says senior leaders should step back from hands-on work and focus on strategy and people. I think this is wrong for technical organizations. When I stopped writing code, I made worse architectural decisions because I lost touch with real constraints. When I started again, I got better at calling bullshit on unrealistic timelines and knowing when to push versus back off.

Holding Horizon · Ongoing

Wrong if the strongest deep-tech orgs systematically separate leadership from hands-on work and consistently out-ship player-coach orgs.

Consent architecture will be as fundamental as security architecture.

Most consent systems today are legal checkboxes. But as AI systems train on user content, act as agents on users' behalf, and make decisions with user data, consent becomes an architectural concern - not a compliance one. You need to track provenance, enforce permissions at runtime, handle revocation gracefully. I've been working on this since OConsent in 2021, and more recently with LLMConsent.

Holding Horizon · 2030

Wrong if consent stays a compliance checkbox and no widely-adopted runtime consent and provenance primitives emerge the way authentication and authorization stacks did.

The infrastructure around the model matters more than the model itself.

There's a narrative that the biggest model wins - that differentiation comes from having the best foundation model. I don't think this holds for most enterprise applications. The model is maybe 10% of the system. The other 90% is data pipelines, evaluation infrastructure, serving, monitoring, and operational glue. This is why I spend more time on evaluation frameworks and data processing than on model architecture.

Playing out Horizon · 2028

Wrong if enterprise differentiation comes to hinge mainly on raw model quality, with eval, serving, and data glue commoditized enough that swapping in the best model is the dominant lever.

LLM-augmented ETL is the next frontier of enterprise data engineering.

Traditional ETL is brittle - it breaks when schemas change, when source systems evolve, when edge cases multiply. LLMs can handle ambiguity, interpret intent, and adapt to variation in ways that rule-based systems can't. The ETLC framework I worked on is an early version of this: using language models not to replace pipelines, but to make them more robust and self-healing.

Holding Horizon · 2028

Wrong if LLM-in-the-loop pipelines stay niche and rule-based ETL still runs the majority of production data engineering.

Robots will be bottlenecked by data and motor inference pipelines, not reasoning.

The popular narrative is that once we solve AGI, robotics will follow. I think the constraint is elsewhere. Getting sensor data processed fast enough, running inference at the edge with tight latency budgets, coordinating motor control in real-time - these are harder engineering problems than high-level reasoning for most robotic applications.

Holding Horizon · 2030

Wrong if general-purpose robotics is unblocked mainly by better high-level reasoning, with perception-to-actuation latency and data no longer the binding constraint.

Multi-agent architectures will become the default for enterprise AI.

Single-model approaches hit limits when problems require different types of reasoning, access to different tools, or coordination across domains. The systems that actually work aren't single brilliant models, they're orchestrated collections of specialized capabilities. ARTEMIS and CatchMe are bets on this architecture.

Playing out Horizon · 2028

Wrong if single-model, long-context, tool-augmented systems out-perform orchestrated multi-agent setups on most enterprise workloads and become the default pattern.

The verifier, not the model, will be the bottleneck in scaling reasoning to real-world domains.

RLVR gave us reasoning models that feel like magic - for math and code. But it only works because those domains have automatic verification: math has definitive answers, code has test suites. The moment you try to extend this to legal analysis, medical reasoning, or business strategy, you hit a wall. You can't write a programmatic verifier for "was this M&A recommendation sound?" The research community is making progress with soft verifiers and LLM-as-judge approaches, but the gap between "works on competition math" and "works on your enterprise workflow" is enormous and underappreciated.

Playing out Horizon · 2028

Wrong if reasoning models generalize to soft domains - legal, medical, strategy - without better verification, for example self-verification that survives independent audit.

Embedding space will be the next commercially contested territory.

Every era of the internet created a scarce resource and built a market around bidding for it. Web pages had links (PageRank). Search had keywords (AdWords). Social had attention (News Feed). The AI era's scarce resource is proximity in vector embedding space. Documents embedded close to high-value queries get retrieved, cited, and influence decisions. GEO and RAG poisoning are points on the same spectrum - one is marketing, the other is adversarial, but both exploit the same mechanism. There is no transparency, no regulation, and no honest auction for this real estate yet. That will change, and it will be a defining problem of the next decade.

Holding Horizon · 2030

Wrong if retrieval in embedding space never develops bidding or marketplace dynamics, and GEO and RAG poisoning stay fringe rather than becoming a governed ad market.

Runtime interpretability is cheap enough to run in production.

The assumption is that activation probes are too expensive to serve on live traffic. On compute, the objection is wrong by roughly six orders of magnitude: a linear probe is a single dot product against a hidden state the forward pass has already materialized, negligible next to the matmuls around it. The real costs are engineering - where you tap activations, how you batch the reads, how you keep off the critical serving path - not FLOPs. I built a harness to measure it end to end rather than argue it on paper.

Playing out Horizon · 2027

Wrong if a careful end-to-end benchmark shows always-on probing adds more than ~5-10% p50 latency, or meaningfully cuts throughput, on a standard serving stack with no reasonable engineering path below that.

Last updated: July 2026

Related Writing