17.4 C
New York

Sonar announces new solution to optimize training datasets for coding LLMs

Published:

Sonar, a company that specializes in code quality, today announced a new solution that will improve how LLMs are trained for coding purposes.

According to the company, LLMs that are used to help with software development are often trained on publicly available, open source code containing security issues and bugs, which become amplified throughout the training process. “Even a small amount of flawed data can degrade models of any size, disproportionately degrading their output,” Sonar wrote in an announcement.

SonarSweep (now in early access) aims to mitigate those issues by ensuring that models are learning from high-quality, secure examples.

It works by identifying and fixing code quality and security issues in the training data itself. After analyzing the dataset, it applies a strict filtering process to remove low-quality code while also balancing the updated dataset to ensure it will still offer diverse and representative learning.

Some potential use cases for SonarSweep include improving foundation model pretraining and post-training, using reinforcement learning with swept data to improve existing models, and creating Small Language Models (SLMs) using distillation techniques.

Initial testing of models trained using SonarSweep found that the models generated code with 67% fewer security vulnerabilities and 42% fewer bugs than models trained on un-swept data.

“The best way to boost software development productivity, reduce risks, and improve security is to tackle the problem at inception—inside the models themselves,” said Tariq Shaukat, CEO of Sonar. “Vibe engineering leveraging models enhanced through SonarSweep will have fewer issues in production, reducing the burden on developers and enterprises. Combined with strong verification practices, we believe this will substantially remove a major bottleneck in AI software development.”

Source link

Related articles

Recent articles