[](#natural-sql-7b-by-chatdb)**Natural-SQL-7B by ChatDB**
=========================================================

[](#natural-sql-7b-is-a-model-with-very-strong-performance-in-text-to-sql-instructions-has-an-excellent-understanding-of-complex-questions-and-outperforms-models-of-the-same-size-in-its-space)Natural-SQL-7B is a model with very strong performance in Text-to-SQL instructions, has an excellent understanding of complex questions, and outperforms models of the same size in its space.
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

![](https://cdn-uploads.huggingface.co/production/uploads/648a374f00f7a3374ee64b99/hafdsfrFCqrVbATIzV_EN.png)

[ChatDB.ai](https://chatdb.ai) | [Notebook](https://github.com/cfahlgren1/natural-sql/blob/main/natural-sql-7b.ipynb) | [Twitter](https://twitter.com/calebfahlgren)

[](#benchmarks)**Benchmarks**
=============================

### [](#results-on-novel-datasets-not-trained-on-via-sql-eval)_Results on Novel Datasets not trained on via SQL-Eval_

![](https://cdn-uploads.huggingface.co/production/uploads/648a374f00f7a3374ee64b99/5ynfoKPzI3_-WasQQt7qR.png)

_Big thanks to the [defog](https://huggingface.co/defog) team for open sourcing [sql-eval](https://github.com/defog-ai/sql-eval)_

Natural-SQL also can handle complex, compound questions that other models typically struggle with. There is a more detailed writeup Here is a write up, small test done [here](https://chatdb.ai/post/naturalsql-vs-sqlcoder-for-text-to-sql).

[](#usage)Usage
===============

Make sure you have the correct version of the transformers library installed:

    pip install transformers==4.35.2
    

### [](#loading-the-model)Loading the Model

Use the following Python code to load the model:

    import torch
    from transformers import AutoModelForCausalLM, AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained("chatdb/natural-sql-7b")
    model = AutoModelForCausalLM.from_pretrained(
        "chatdb/natural-sql-7b",
        device_map="auto",
        torch_dtype=torch.float16,
    )
    

### [](#license)**License**

The model weights are licensed under `CC BY-SA 4.0`, with extra guidelines for responsible use expanded from the original model's [Deepseek](https://github.com/deepseek-ai/deepseek-coder/blob/main/LICENSE-MODEL) license. You're free to use and adapt the model, even commercially. If you alter the weights, such as through fine-tuning, you must publicly share your changes under the same `CC BY-SA 4.0` license.

### [](#generating-sql)Generating SQL

    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    generated_ids = model.generate(
        **inputs,
        num_return_sequences=1,
        eos_token_id=100001,
        pad_token_id=100001,
        max_new_tokens=400,
        do_sample=False,
        num_beams=1,
    )
    
    outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
    print(outputs[0].split("```sql")[-1])
    

[](#prompt-template)Prompt Template
===================================

    # Task 
    Generate a SQL query to answer the following question: `{natural language question}`
    
    ### PostgreSQL Database Schema 
    The query will run on a database with the following schema: 
    
    <SQL Table DDL Statements>
    
    # SQL 
    Here is the SQL query that answers the question: `{natural language question}` 
    '''sql
    

[](#example-sql-output)Example SQL Output
=========================================

### [](#example-schemas)Example Schemas

    CREATE TABLE users (
            user_id SERIAL PRIMARY KEY,
            username VARCHAR(50) NOT NULL,
            email VARCHAR(100) NOT NULL,
            password_hash TEXT NOT NULL,
            created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
        );
    CREATE TABLE projects (
        project_id SERIAL PRIMARY KEY,
        project_name VARCHAR(100) NOT NULL,
        description TEXT,
        start_date DATE,
        end_date DATE,
        owner_id INTEGER REFERENCES users(user_id)
    );
    CREATE TABLE tasks (
        task_id SERIAL PRIMARY KEY,
        task_name VARCHAR(100) NOT NULL,
        description TEXT,
        due_date DATE,
        status VARCHAR(50),
        project_id INTEGER REFERENCES projects(project_id)
    );
    CREATE TABLE taskassignments (
        assignment_id SERIAL PRIMARY KEY,
        task_id INTEGER REFERENCES tasks(task_id),
        user_id INTEGER REFERENCES users(user_id),
        assigned_date DATE NOT NULL DEFAULT CURRENT_TIMESTAMP
    );
    CREATE TABLE comments (
        comment_id SERIAL PRIMARY KEY,
        content TEXT NOT NULL,
        created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
        task_id INTEGER REFERENCES tasks(task_id),
        user_id INTEGER REFERENCES users(user_id)
    );
    

### [](#example-sql-outputs)Example SQL Outputs

**Question**: **Show me the day with the most users joining**

    SELECT created_at::DATE AS day, COUNT(*) AS user_count
    FROM users
    GROUP BY day
    ORDER BY user_count DESC
    LIMIT 1;
    

**Question**: **Show me the project that has a task with the most comments**

    SELECT p.project_name, t.task_name, COUNT(c.comment_id) AS comment_count
    FROM projects p
    JOIN tasks t ON p.project_id = t.project_id
    JOIN comments c ON t.task_id = c.task_id
    GROUP BY p.project_name, t.task_name
    ORDER BY comment_count DESC
    LIMIT 1;
    

**Question**: **What is the ratio of users with gmail addresses vs without?**

    SELECT 
        SUM(CASE WHEN email ILIKE '%@gmail.com%' THEN 1 ELSE 0 END)::FLOAT / NULLIF(SUM(CASE WHEN email NOT ILIKE '%@gmail.com%' THEN 1 ELSE 0 END), 0) AS gmail_ratio
    FROM 
        users;

## Model overview

The `natural-sql-7b` model by ChatDB is a powerful text-to-SQL generation model that outperforms other models of similar size in its space. It has excellent performance on complex, compound SQL questions and can handle tasks that other models struggle with. The model is trained to convert natural language instructions into SQL queries, making it a valuable tool for non-technical users to interact with databases.

Similar models include [pipSQL-1.3b](https://aimodels.fyi/models/huggingFace/pip-sql-13b-pipableai) by PipableAi, which also focuses on text-to-SQL generation, and the [SQLCoder](https://aimodels.fyi/models/huggingFace/sqlcoder-defog) and [SQLCoder2](https://aimodels.fyi/models/huggingFace/sqlcoder2-defog) models developed by Defog, which are state-of-the-art large language models for natural language to SQL conversion.

## Model inputs and outputs

### Inputs
- **Natural language instructions**: The model takes in natural language questions or instructions and converts them into SQL queries.

### Outputs
- **SQL queries**: The model generates SQL queries based on the provided natural language input.

## Capabilities

The `natural-sql-7b` model has exceptional performance in text-to-SQL tasks, outperforming models of similar size. It can handle complex, compound questions that often trip up other models. For example, the model can generate SQL queries to find the total revenue from customers in New York compared to San Francisco, including the difference between the two.

## What can I use it for?

The `natural-sql-7b` model is a valuable tool for non-technical users to interact with databases. It can be used in a variety of applications, such as:

- **Business intelligence and data analysis**: Users can ask natural language questions about the data in their database and get the corresponding SQL queries, allowing them to quickly generate insights without needing to learn SQL.
- **Customer support**: The model can be used to build chatbots that can help customers find information in a database by understanding their natural language requests.
- **Productivity tools**: The model can be integrated into productivity software, allowing users to quickly generate SQL queries to extract the data they need.

## Things to try

One interesting aspect of the `natural-sql-7b` model is its ability to handle complex, compound questions. Try asking the model questions that involve multiple steps or conditions, such as "Find the top 3 best-selling products by revenue, but only for products with a price above the average product price." The model should be able to generate the appropriate SQL query to answer this type of complex question.

Another interesting thing to try is fine-tuning the model on a specific database schema or domain. By training the model on data more closely related to the task at hand, you may be able to further improve its performance and tailor it to your specific needs.