Name: AI in the Real World, Not the Demo
Uploaded: 2026-03-12T20:56:25.000Z
Duration: 2156
Description: What does it actually look like to stress-test AI in the real world - not in a lab, not in a simulation, but in live environments where things go wrong in ways...

What does it actually look like to stress-test AI in the real world - not in a lab, not in a simulation, but in live environments where things go wrong in ways nobody predicted?

Callum Sharrock knows. As a Member of Technical Staff at Andon Labs, he builds frontier safety evaluations for AI systems, finding failure modes before they become everyone else's problem. He came to this work through robotics - building cleaning robots for Tesla's Robotaxi program, working on camera systems for Autopilot and Optimus, programming robots to paint in real acrylic on canvas, and teaching them to detect welding defects in Berlin.
In this episode, Peter Maddison and Dave Sharrock sit down with Callum to talk about where AI and robotics are actually heading, why reliability is a harder challenge than capability, and how the role of developers is shifting in ways most organizations haven't caught up with yet.

In this episode:
Callum's journey from hands-on robotics to AI safety evaluations
Why people keep getting AI wrong by looking at snapshots instead of trend lines

Why robotics is now fundamentally a software problem - and what that means for how fast things move
How AI is shifting developers away from writing syntax toward validating outcomes and architectural thinking
Why reliability, not capability, is the real bottleneck for real-world AI deployment
What high agency and rapid execution actually look like on modern technical teams

This Week's Takeaways:
Deciding what to build is the new bottleneck - the technical barriers have never been lower, so prioritization and agency matter more than ever
Traditional leadership focused on control and setting strategy is becoming less relevant - the new model is enabling fast decisions and rapid execution
Stop gathering teams for two-hour meetings to debate whether an idea will work. Spend those two hours building, deploying, and testing it instead

0:00 Welcome and guest introduction
1:01 Leaving Tesla for AI safety
2:32 Robotics roots and hands-on builds
4:45 Real-world AI evaluations explained
6:25 Seeing exponential progress clearly
10:47 Will AI replace developers
16:23 Validation, testing, and better requirements
19:43 Reliability as the adoption bottleneck
22:47 Society, law, and safer deployment paths
24:55 When deterministic beats AI
28:40 Skills that matter in AI teams
32:48 Three takeaways and closing

Follow Definitely Maybe Agile wherever you listen to podcasts and visit us at definitelymaybeagile.com

AI Safety, Frontier AI, Real-World AI Deployment, AI Evaluations, Andon Labs,