Ideas about to_sql? #213

wangxiaoying · 2022-01-11T02:42:42Z

wangxiaoying
Jan 11, 2022
Maintainer

We are collecting feedbacks about the to_sql functionality (similar to pandas.to_sql). It would be very helpful if you could let us know your answer to the following questions:

What tool(s) do you use to write dataframe back to database currently?
Do you need a better tool for to_sql functionality? If yes, what's the current issues you have with pandas.read_sql and/or other tools you use?
Could you simply describe your usecase for to_sql? (e.g. what does the SQL/workload look like, data size, any constraints or index on the table? do you write to a new table or insert to an existing one?)
What database do you write back to? (Help us to prioritize the supporting order.)
Any suggestions if you have.

rishab-goel-pmy · 2022-01-11T06:59:58Z

rishab-goel-pmy
Jan 11, 2022

Using simple sqlalchemy engine to create the engine for connection with database.

engine = create_engine(
                    URL.create(
                        drivername='mysql+pymysql',
                        username=DB_USER,
                        password=DB_PASS,
                        host=DB_HOST,
                        port=PORT,
                        database=DB_SCHEMA
                    ),
                    echo=False,

Then inserting to SQL using:-

      dataframe.to_sql(
                            name = 'name_of_table',
                            con = engine,
                            schema = DB_SCHEMA,
                            if_exists = 'replace',
                            index = False,
                            chunksize=4000)

The issue is that My data has around 400000 rows and I have to run the process daily. On local machine, it takes around 300 seconds to push that data. Even using a scalable solution, it takes around 180 seconds. This is really slow. Compared to how massive the difference is there in the read_sql functionalities of pandas and connectorx.

The data size is 400000 rows and 100 columns. Since almost 80% of the values change everyday in my table, hence I go for the replace approach rather than the append approach.
I write back the data to a MySql workbench database.

0 replies

wseaton · 2022-01-11T15:09:15Z

wseaton
Jan 11, 2022
Collaborator

A lot of databases like Redshift/Snowflake support COPY from Object Storage (like S3) which is generally the fastest way to write data in bulk. I think connector-x could likely wrap this functionality as exposed by the underlying arrow2 and parquet2 crates, and then issue the appropraite COPY statement per backend. The question then becomes 'is it worth it' compared to current pyarrow, which we could benchmark.

0 replies

chitralverma · 2022-06-05T05:58:08Z

chitralverma
Jun 5, 2022

At my organization we use Apache spark for our bigger datasets but the main problem is the data transfer and copies. All the data (after internal optimisations) gets pulled from a DB like Oracle/ PostgreSQL to Spark executor and processed there in memory and then written to targets like Hive.

It would be great if there was a way to optimise this without data movement directly from source to sink while retaining the parallelism that connector-x has to offer.

I'm still not sure what the role of pyarrow would be in writing of data though.

For us the top sources/ destinations are Snowflake, bigquery, Hive and Oracle.

0 replies

mcrumiller · 2022-07-05T19:09:27Z

mcrumiller
Jul 5, 2022

FYI you should look into writing a wrapper around the bcp utility for MS SQL Server. At work I wrote a python wrapper for this that does the following:

Determines based on column datatype the format file
Packs the dataframe pretty efficiently into native binary format for bcp
Issues the bcp command via the shell.

The result is a roughly 300x speedup over pandas.DataFrame.to_sql. My sample dataframe was (108,695) x 94, with a mixture of all data types--integer, float, datetime, categorical, etc.

pd.DataFrame.to_sql() took 40 minutes (which is a bit absurd--it's over VPN/ethernet, but still, why so slow?). My bcp wrapper took 11 seconds. For large dataframes I think this is the way to go for to_sql.

0 replies

wangxiaoying · 2022-08-25T20:56:09Z

wangxiaoying
Aug 25, 2022
Maintainer Author

Another ETL use case: #331

0 replies

wonb168 · 2022-12-07T10:55:45Z

wonb168
Dec 7, 2022

I used cx as my etl tool, and my datahouse is Greenplum,
I need turn pandas to csv, and then use pg's copy csv to greenplum.

df.to_csv(test.csv)
copy test_table from 'test.csv'

If cx have a better method to write gp? pandas's to_sql is SO SO slowly for gp.

0 replies

GhaziBenDahmane · 2022-12-12T10:52:09Z

GhaziBenDahmane
Dec 12, 2022

What tool(s) do you use to write dataframe back to database currently?

Currently for writing back to the database we psycopg pipeline mode https://www.psycopg.org/psycopg3/docs/advanced/pipeline.html https://www.postgresql.org/docs/current/libpq-pipeline-mode.html inserting the rows one by one as we iterate on them.

Do you need a better tool for to_sql functionality? If yes, what's the current issues you have with pandas.read_sql and/or other tools you use?

Pandas to_sql seems slow for postgres ( in comparison with the pipeline benchmark we did )

Could you simply describe your usecase for to_sql? (e.g. what does the SQL/workload look like, data size, any constraints or index on the table? do you write to a new table or insert to an existing one?)

the sql usecase is simply delivering some aggregation results ( the result would be around 300 mb in size ) to postgres so it would be accessed through a standard web app. ( we use postgres 13 atm ).

What database do you write back to? (Help us to prioritize the supporting order.)

Postgres

Any suggestions if you have.

I think pipeline mode could be a good way to support postgres, I know it worked wonders for us and I think it could benefit anyone trying to insert large chunks into postgres ( fast insertion of large chunks being the primary usecase of connector-x )

0 replies

daincrawford · 2023-04-23T19:48:48Z

daincrawford
Apr 23, 2023

My issue is with read_sql which in the new versions of pandas requires sqlalchemy. sqlalchemy only supports a limited number of dialects. I understand completely why they do not want to try to support every possible database, especially for to_sql functionality. However, in the process they are dropping the functionality of ODBC. I see that you have ODBC listed as "WIP." I hope that means that you are planning to keep ODBC for accessing databases that have an ODBC API, even if it is read-only.

I my case, my company uses an ERP system (Sage 100 Advanced) which uses Providex a pretty obscure database. They provide an ODBC driver for read-only access which works. I extract data from our ERP on a regular basis to provide better reporting for my users beyond the included Crystal reports. I do not want my users to be able to write back to this system from python. I may be unique in using this particular database, but I have to believe that there are many other similarly obscure databases driving other systems that would benefit from keeping some form of ODBC access with a read_sql type of function.
Going forward, I can use pydobc but I have to use he cursor function to get my data.
I used to be able to use:
cnxn = pyodbc.connect("DSN=SOTAMAS16",autocommit=True)
sql3 = """ SELECT ItemCode ,ItemCodeDesc FROM CI_Item """

items = pd.read_sql(sql3, cnxn,)
but now I have to use:
cursor = cnxn.cursor()
cursor.execute(sql3)
items = pd.DataFrame.from_records(cursor.fetchall(), columns=[col[0] for col in cursor.description])
items = items.apply(pd.to_numeric, errors='ignore')

This doesn't look like much of a difference, but the cursor.fetchall method is slower and I now have to worry about data typing for DECIMAL and FLOAT columns.
It sounds like connector-x and Polars has the potential to be a great replacement for Pandas for me if the read only ODBC functionality were available.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ideas about to_sql? #213

{{title}}

Replies: 8 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Ideas about to_sql? #213

wangxiaoying Jan 11, 2022 Maintainer

Replies: 8 comments

rishab-goel-pmy Jan 11, 2022

wseaton Jan 11, 2022 Collaborator

chitralverma Jun 5, 2022

mcrumiller Jul 5, 2022

wangxiaoying Aug 25, 2022 Maintainer Author

wonb168 Dec 7, 2022

GhaziBenDahmane Dec 12, 2022

daincrawford Apr 23, 2023

wangxiaoying
Jan 11, 2022
Maintainer

rishab-goel-pmy
Jan 11, 2022

wseaton
Jan 11, 2022
Collaborator

chitralverma
Jun 5, 2022

mcrumiller
Jul 5, 2022

wangxiaoying
Aug 25, 2022
Maintainer Author

wonb168
Dec 7, 2022

GhaziBenDahmane
Dec 12, 2022

daincrawford
Apr 23, 2023