- Samsung threatenbech exposes to chatbots to strict rules without partial credit
- Samsung uses 2,485 tests across languages to mimic workloads
- Inputs range from short prompt to documents over twenty thousand characters
The adoption of AI tools at work has grown rapidly and raises concerns not only about automation, but also about how these systems are judged.
Until now, most benchmarks have been narrow in scope, tested AI writers and AI -Chatbot systems with simple prompts that rarely look like office life.
Samsung has entered this debate with threatenbench, a new framework, it says, is designed to track whether AI models can handle tasks similar to actual work.
Test of AI at work
Throughbench, map for reliable evaluation of the realization of the real world, contains 2,485 test sets spread over ten categories and twelve languages.
Unlike conventional benchmarks that focus on disposable questions in English, it introduces longer, more complex tasks, such as the summary of multiple steps and translation across multiple languages.
Samsung says input varies from a handful of grades to over twenty thousand, an attempt to reflect both quick requests and long reports.
The company claims that these test sets expose the boundaries of AI-Chatbot platforms when facing conditions in the real world instead of queries in the classroom.
Each test has strict requirements: Unless all specified conditions are met, the model fails – this gives results that are demanding and less forgiving than many existing benchmarks, which often credit partial answers.
“Samsung Research brings deep expertise and a competitive advantage through its real AI experience,” said Paul (Kyungwhoon) Cheun, CTO for the DX Division at Samsung Electronics and head of Samsung Research.
“We expect the threatenching to determine evaluation standards for productivity and solidifying Samsung’s technological management.”
Samsung research outlines a process in which people and AI collaborate to design the evaluation criteria.
Human annotators first set the conditions, then AI reviews them to detect contradictions or unnecessary restrictions.
The criteria are repeatedly refined until they are consistent and precise.
Automatic score is then used for AI models, minimizes subjective assessments and makes comparisons more transparent.
One of the unusual aspects of threatenbench is its publication of embracing face, where Leaderboards allow direct comparison of up to five models.
In addition to performance results, Samsung also reveals the average length of response, a metric that helps weigh efficiency along with accuracy.
The decision to open parts of the system suggests a push for credibility, although it also exposes Samsung’s approach to control.
Since the advent of AI, many workers are already wondering how productivity will be measured when AI systems are given similar responsibility.
With threatenbench, managers can have a way to judge whether an AI chatbot can replace or supplement staff.
Despite its ambitions, benchmarks, but broad, still synthetic measures and cannot fully capture the mess with communication or decision -making in the workplace.
Throughbench can set higher standards for evaluation, but whether it can solve fear of job expulsion or simply sharpen them, remains an open question.
Follow Techradar on Google News and Add us as a preferred source To get our expert news, reviews and meaning in your feeds. Be sure to click the Follow button!
And of course you can too Follow Techradar at Tiktok For news, reviews, unboxings in video form and get regular updates from us at WhatsApp also.



