[](#pipsql-13b)pipSQL-1.3b
==========================

[pipableAi](https://www.linkedin.com/company/pipable.ai/about/)

[colab\_notebook](https://colab.research.google.com/drive/1insSxvc3jjAXe0zmdIjmbG3ttb5mpRgQ?usp=sharing)

[](#what-have-we-built)What have we built?
------------------------------------------

A 1.3 bn SQL model that outperforms most SQL expert models and chatgpt on popular benchmarks. This is a distilled model built on the deepseek base model. Please refer to [https://huggingface.co/PipableAI/pip-library-etl-1.3b](https://huggingface.co/PipableAI/pip-library-etl-1.3b) for our state of the art model.

[](#how-we-built-it)How we built it?
------------------------------------

We used softmax cross entropy and a modified form of policy grad along with Q loss, optimized in an EM set up. Loss behaviour in the set up mentioned above -

[![image/png](https://cdn-uploads.huggingface.co/production/uploads/658d8095a2a6a6e0da8bb8a6/I80Ru1r4thoYrLagIWALa.png)](https://cdn-uploads.huggingface.co/production/uploads/658d8095a2a6a6e0da8bb8a6/I80Ru1r4thoYrLagIWALa.png)

[](#benchmarking-)Benchmarking :
--------------------------------

For benchmarking purposes we are using Semantic Evaluation for Text-to-SQL with Distilled Test Suites, an officially accepted evaluation framework for Spider, SParC, and CoSQL which was proposed by a research team of Yale and Berkeley. The benchmark contains 2200 test data points Here is the link to run the evaluation:

[Test Suite SQL Eval](https://github.com/taoyds/test-suite-sql-eval)

model

easy

medium

hard

extra

sqlcoder-7b-2

72.0

58.0

40.6

37.3

pipSQL-1.3b

78.5

57.5

42.1

28.3

pipSQL-7b

63.0

40.0

30.2

25.0

sqlcoder-7b

60.6

48.2

28.3

20.4

gpt-3.5

58.8

44.7

31.0

28.4

We have also benchmarked it on defog eval. It contains 200 test data points handpicked by defog team. Here is the link to it:

[Defog SQL-Eval](https://github.com/defog-ai/sql-eval) These are the results -

[![image/png](https://cdn-uploads.huggingface.co/production/uploads/64d32c6b921678fdc9de3302/fFeLSEYBNpQk_JWjFsF5M.png)](https://cdn-uploads.huggingface.co/production/uploads/64d32c6b921678fdc9de3302/fFeLSEYBNpQk_JWjFsF5M.png)

[](#license)License
-------------------

The model is open source under apache 2.0. License

[](#usage)Usage
---------------

### [](#installation)Installation

    pip install transformers
    

### [](#prompt)Prompt

    prompt = f"""<schema>{schema}</schema>
    <question>{question}</question>
    <sql>"""
    

### [](#pytorch)PyTorch

    from transformers import AutoModelForCausalLM, AutoTokenizer
    device = "cuda"
    model = AutoModelForCausalLM.from_pretrained("PipableAI/pip-sql-1.3b")
    tokenizer = AutoTokenizer.from_pretrained("PipableAI/pip-sql-1.3b")
    
    inputs = tokenizer(text, return_tensors="pt")
    outputs = model.generate(**inputs, max_new_tokens=200)
    print(tokenizer.decode(outputs[0], skip_special_tokens=True).split('<sql>')[1].split('</sql>')[0])
    

### [](#flax)Flax

    from transformers import FlaxAutoModelForCausalLM, AutoTokenizer
    device = "cuda"
    model = FlaxAutoModelForCausalLM.from_pretrained("PipableAI/pip-sql-1.3b",from_pt=True)
    tokenizer = AutoTokenizer.from_pretrained("PipableAI/pip-sql-1.3b")
    
    inputs = tokenizer(text, return_tensors="jax")
    outputs = model.generate(**inputs, max_new_tokens=200)
    print(tokenizer.decode(outputs[0], skip_special_tokens=True).split('<sql>')[1].split('</sql>')[0])
    

[](#examples)Examples
---------------------

### [](#schema)Schema

    CREATE TABLE Products (
      product_id number,
      parent_product_id number,
      product_name text,
      product_price number,
      product_color text,
      product_size text,
      product_description text);
    
    CREATE TABLE Customers (
      customer_id number,
      gender_code text,
      customer_first_name text,
      customer_middle_initial text,
      customer_last_name text,
      email_address text,
      login_name text,
      login_password text,
      phone_number text,
      address_line_1 text,
      town_city text,
      county text,
      country text);
    
    CREATE TABLE Customer_Payment_Methods (
      customer_id number,
      payment_method_code text);
    
    CREATE TABLE Invoices (
      invoice_number number,
      invoice_status_code text,
      invoice_date time);
    
    CREATE TABLE Orders (
      order_id number,
      customer_id number,
      order_status_code text,
      date_order_placed time);
    
    CREATE TABLE Order_Items (
      order_item_id number,
      product_id number,
      order_id number,
      order_item_status_code text);
    
    CREATE TABLE Shipments (
      shipment_id number,
      order_id number,
      invoice_number number,
      shipment_tracking_number text,
      shipment_date time);
    
    CREATE TABLE Shipment_Items (
      shipment_id number,
      order_item_id number);
    

### [](#questions)Questions

What are the email address, town and county of the customers who are of the least common gender?

    SELECT email_address ,  town_city ,  county FROM customers GROUP BY gender_code ORDER BY count(*) ASC LIMIT 1
    

What are the product price and the product size of the products whose price is above average?

    SELECT product_price ,  product_size FROM products WHERE product_price  > (SELECT avg(product_price) FROM products)
    

Which customers did not make any orders? List the first name, middle initial and last name.

    SELECT T1.customer_first_name ,  T1.customer_middle_initial ,  T1.customer_last_name FROM Customers AS T1 WHERE T1.customer_id NOT IN (SELECT T2.customer_id FROM Orders AS T2)
    

### [](#team)Team

Avi Kothari, Pratham Gupta, Ritvik Aryan Kalra, Rohan Bhatial, Soham Acharya

## Model overview

The `pip-sql-1.3b` model, developed by [PipableAI](https://aimodels.fyi/creators/huggingFace/PipableAI), is a 1.3 billion parameter SQL model that outperforms most SQL expert models and even GPT-3.5 on popular benchmarks. It is a distilled version of the DeepSeek base model, trained using a combination of softmax cross entropy, modified policy gradient, and Q loss in an EM setup. This novel training approach has enabled the model to achieve exceptional performance on text-to-SQL tasks.

Compared to similar models like [distilbert-base-cased-distilled-squad](https://aimodels.fyi/models/huggingFace/distilbert-base-cased-distilled-squad-distilbert), [sqlcoder-70b-alpha](https://aimodels.fyi/models/huggingFace/sqlcoder-70b-alpha-defog), and [sqlcoder](https://aimodels.fyi/models/huggingFace/sqlcoder-defog), the `pip-sql-1.3b` model stands out for its significant performance improvements on SQL-related tasks. It leverages a unique training approach to deliver state-of-the-art results, making it a valuable tool for analysts and developers working with SQL databases.

## Model inputs and outputs

### Inputs
- **Schema**: The schema of the database that the SQL query will be executed against.
- **Question**: The natural language question that the model will attempt to translate into a SQL query.

### Outputs
- **SQL query**: The SQL query generated by the model based on the provided schema and question.

## Capabilities

The `pip-sql-1.3b` model excels at translating natural language questions into SQL queries. It outperforms most SQL expert models and even GPT-3.5 on popular benchmarks like Semantic Evaluation for Text-to-SQL with Distilled Test Suites and Defog SQL-Eval. For example, on the Semantic Evaluation benchmark, the `pip-sql-1.3b` model achieves an overall accuracy of 42.1% on the "hard" and "extra" difficulty questions, significantly higher than the 31% accuracy of GPT-3.5.

## What can I use it for?

The `pip-sql-1.3b` model can be a valuable tool for developers, analysts, and anyone working with SQL databases. It can be used to quickly generate SQL queries based on natural language questions, saving time and effort. This can be particularly useful for non-technical users who need to extract data from a database but are not proficient in SQL.

Additionally, the model's strong performance on SQL-related tasks makes it a compelling choice for building applications that require natural language processing capabilities for database interactions, such as chatbots, voice assistants, or data visualization tools.

## Things to try

One interesting aspect of the `pip-sql-1.3b` model is its use of a novel training approach that combines softmax cross entropy, modified policy gradient, and Q loss in an EM setup. This approach has enabled the model to achieve exceptional performance on text-to-SQL tasks, outperforming even much larger models like GPT-3.5.

Researchers and developers interested in advancing the state of the art in natural language processing for database interactions could explore ways to further refine or build upon this training approach. Additionally, testing the model's performance on a wider range of SQL-related tasks or evaluating its robustness to different types of database schemas and queries could provide valuable insights into its capabilities and limitations.