- Robots still fail quickly once removed from predictable factory environments
- Microsoft Rho-alpha connects language understanding directly to robotic motion control
- Tactile sensing is central to narrowing the gap between software and physical action
Robots have long performed reliably in tightly controlled industrial settings with predictable environments and limited deviations, but outside of that they often struggle.
To address this problem, Microsoft has announced Rho-alpha, the first robot model derived from its Phi vision language series, arguing that robots need better ways to see and understand instructions
The company believes that systems can work beyond the assembly line by responding to changing conditions rather than following rigid scripts.
What Rho-alpha is designed to do
Microsoft links this to what is widely called physical AI, where software models are expected to guide machines through less structured situations.
It combines language, perception and action, reducing reliance on fixed production lines or instructions.
Rho-alpha translates natural language commands into robot control signals, and it focuses on bimanual manipulation tasks that require coordination between two robotic arms and fine-grained control.
Microsoft characterizes the system as an extension of typical VLA approaches by extending both perception and learning input.
“The emergence of vision-language-action (VLA) models for physical systems enables systems to perceive, reason and act with increasing autonomy alongside humans in environments that are far less structured,” said Ashley Llorens, Corporate Vice President and CEO, Microsoft Research Accelerator
Rho-alpha includes tactile sensing alongside vision, with additional sensory modalities such as force being an ongoing development.
These design choices suggest an attempt to narrow the gap between simulated intelligence and physical interaction, although their effectiveness remains under evaluation.
A key part of Microsoft’s approach is based on simulation to handle limited robotic data at scale, especially data involving touch.
Synthetic trajectories are generated through reinforcement learning in Nvidia Isaac Sim, then combined with physical demonstrations from commercial and open datasets.
“Training foundational models that can reason and act requires overcoming the lack of diverse real-world data,” said Deepu Talla, Vice President of Robotics and Edge AI, Nvidia.
“By leveraging NVIDIA Isaac Sim on Azure to generate physically accurate synthetic datasets, Microsoft Research is accelerating the development of versatile models like Rho-alpha that can master complex manipulation tasks.”
Microsoft also emphasizes human corrective input during deployment, allowing operators to intervene using teleoperation devices and provide feedback that the system can learn from over time.
This training loop mixes simulation, real-world data, and human correction, reflecting a growing reliance on AI tools to compensate for scarce native data sets.
Professor Abhishek Gupta, assistant professor, University of Washington, said: “While generating training data using teleoperative robotic systems has become standard practice, there are many settings where telesurgery is impractical or impossible.”
“We are working with Microsoft Research to enrich pre-training datasets collected from physical robots with various synthetic demonstrations using a combination of simulation and reinforcement learning.”
Follow TechRadar on Google News and add us as a preferred source to get our expert news, reviews and opinions in your feeds. Be sure to click the Follow button!
And of course you can too follow TechRadar on TikTok for news, reviews, video unboxings, and get regular updates from us on WhatsApp also.



