NL2SQL-BUGs: A Benchmark for Detecting Semantic Errors in NL2SQL Translation

Abstract

Natural Language to SQL (i.e., NL2SQL) translation is crucial for democratizing database access, but even SOTA models frequently generate semantically incorrect SQL queries, hindering the widespread adoption of these techniques by database vendors. While existing NL2SQL benchmarks primarily focus on correct query translation, we argue that a benchmark dedicated to identifying common errors in NL2SQL translations is equally important, as accurately detecting these errors is a prerequisite for any subsequent correction – whether performed by humans or models.

To address this gap, we propose NL2SQL-BUGs, the first benchmark dedicated to detecting and categorizing semantic errors in NL2SQL translation. NL2SQL-BUGs adopts a two-level taxonomy to systematically classify semantic errors, covering 9 main categories and 31 subcategories. The benchmark consists of 2,018 expert-annotated instances, each containing a natural language query, database schema, and SQL query, with detailed error annotations for semantically incorrect queries.

Through comprehensive experiments, we demonstrate that current large language models exhibit significant limitations in semantic error detection, achieving an average detection accuracy of only 75.16%. Despite this, the models were able to successfully detect 106 errors (accounting for 6.91%) in the widely-used NL2SQL dataset, BIRD, which were previously annotation errors in the benchmark. This highlights the importance of semantic error detection in NL2SQL systems.

Error Taxonomy

The classification of semantic errors in NL2SQL is based on the structure of SQL queries, common translation mistakes, and their impact on query semantics. This approach allows for systematic error identification at various stages of query generation, helping to pinpoint where and why translation mistakes occur.

Our taxonomy classifies semantic errors into 9 main categories and 31 subcategories:

NL2SQL Error Taxonomy — Figure 1: A Taxonomy of NL2SQL Translation Semantic Errors

NL2SQL-BUGs Statistics

We collect 2,018 expert-annotated instances, each containing a natural language query, database schema, and SQL query. Among these instances, 1,019 are correct examples while 999 are incorrect examples with semantic errors. Each incorrect example is carefully annotated with detailed error types and explanations, providing a comprehensive resource for studying semantic errors in NL2SQL translation.

The statistics of NL2SQL-BUGs are shown in the following figure:

Experiments

Through comprehensive experiments on NL2SQL-BUGs, we evaluate the semantic error detection capabilities of state-of-the-art large language models (LLMs). Our results reveal significant limitations in their performance, with an average detection accuracy of only 75.16%. Among the tested models, GPT-4o and Claude-3.5-Sonnet demonstrate the most balanced performance, showing consistent capability across various types of semantic errors. Interestingly, while Gemini-2.0-Flash achieves high positive recall, it struggles with controlling false positives, indicating a potential trade-off between sensitivity and precision in error detection.

Our fine-grained analysis across error types reveals notable patterns in model performance. The models show particular strength in identifying certain error categories, especially condition-related errors and value errors. However, they consistently struggle with more complex error types, particularly subquery-related errors and those requiring deep database knowledge, such as join type mismatches and function-related errors. This suggests that current LLMs, despite their impressive capabilities, still lack sophisticated understanding of database operations and complex query structures.

LLM Performance on NL2SQL-BUGs — Figure 3: Semantic Error Detection Performance

Figure 4: Error Type Detection Performance

Errors Found in Existing Benchmarks

After applying our automated semantic error detection framework and conducting manual validation of the LLM's predictions, we discovered previously unidentified semantic errors in widely-used NL2SQL benchmarks.

There are 16 SQL queries in Spider (1.55% of the development set) and 106 SQL queries in BIRD (6.91% of the development set) contained semantic errors that had not been previously identified. These findings demonstrate both the utility of our semantic error detection model and the importance of systematic error detection in NL2SQL datasets.

These findings highlight the importance of rigorous validation in NL2SQL benchmark creation and the need for systematic error detection approaches.

Error Statistics in Benchmarks — Figure 5: Semantic Errors in BIRD and Spider Benchmarks

Error Examples — Figure 6: Examples of Semantic Errors in BIRD and Spider Benchmarks

BibTeX

@misc{liu2025nl2sqlbugsbenchmarkdetectingsemantic,
      title={NL2SQL-BUGs: A Benchmark for Detecting Semantic Errors in NL2SQL Translation}, 
      author={Xinyu Liu and Shuyu Shen and Boyan Li and Nan Tang and Yuyu Luo},
      year={2025},
      eprint={2503.11984},
      archivePrefix={arXiv},
      primaryClass={cs.DB},
      url={https://arxiv.org/abs/2503.11984}, 
}

NL2SQL-Bugs: A Benchmark for Detecting Semantic Errors in NL2SQL Translation