Truffle -Se security found thousands of pieces of private info in plain review
The archives are used to train some of the biggest LLMs today
The researchers informed the suppliers and helped solve the problem
CyberSecurity scientists have found thousands of login credentials and other secrets in the regular crawl dataset.
Common Crawl is a nonprofit organization that provides a freely available archive with web data, collected through large-scale web crawling. From the recent estimates, the organization hosts over 250 Petabytes web data, with monthly searches that add more petabytes more.