Ideas about to_sql? #213
Replies: 8 comments
-
Then inserting to SQL using:-
The issue is that My data has around 400000 rows and I have to run the process daily. On local machine, it takes around 300 seconds to push that data. Even using a scalable solution, it takes around 180 seconds. This is really slow. Compared to how massive the difference is there in the read_sql functionalities of pandas and connectorx.
|
Beta Was this translation helpful? Give feedback.
-
A lot of databases like Redshift/Snowflake support COPY from Object Storage (like S3) which is generally the fastest way to write data in bulk. I think |
Beta Was this translation helpful? Give feedback.
-
At my organization we use Apache spark for our bigger datasets but the main problem is the data transfer and copies. All the data (after internal optimisations) gets pulled from a DB like Oracle/ PostgreSQL to Spark executor and processed there in memory and then written to targets like Hive. It would be great if there was a way to optimise this without data movement directly from source to sink while retaining the parallelism that connector-x has to offer. I'm still not sure what the role of pyarrow would be in writing of data though. For us the top sources/ destinations are Snowflake, bigquery, Hive and Oracle. |
Beta Was this translation helpful? Give feedback.
-
FYI you should look into writing a wrapper around the bcp utility for MS SQL Server. At work I wrote a python wrapper for this that does the following:
The result is a roughly 300x speedup over
|
Beta Was this translation helpful? Give feedback.
-
Another ETL use case: #331 |
Beta Was this translation helpful? Give feedback.
-
I used cx as my etl tool, and my datahouse is Greenplum,
If cx have a better method to write gp? pandas's to_sql is SO SO slowly for gp. |
Beta Was this translation helpful? Give feedback.
-
Currently for writing back to the database we psycopg pipeline mode https://www.psycopg.org/psycopg3/docs/advanced/pipeline.html https://www.postgresql.org/docs/current/libpq-pipeline-mode.html inserting the rows one by one as we iterate on them.
Pandas to_sql seems slow for postgres ( in comparison with the pipeline benchmark we did )
the sql usecase is simply delivering some aggregation results ( the result would be around 300 mb in size ) to postgres so it would be accessed through a standard web app. ( we use postgres 13 atm ).
Postgres
I think pipeline mode could be a good way to support postgres, I know it worked wonders for us and I think it could benefit anyone trying to insert large chunks into postgres ( fast insertion of large chunks being the primary usecase of connector-x ) |
Beta Was this translation helpful? Give feedback.
-
My issue is with read_sql which in the new versions of pandas requires sqlalchemy. sqlalchemy only supports a limited number of dialects. I understand completely why they do not want to try to support every possible database, especially for to_sql functionality. However, in the process they are dropping the functionality of ODBC. I see that you have ODBC listed as "WIP." I hope that means that you are planning to keep ODBC for accessing databases that have an ODBC API, even if it is read-only. I my case, my company uses an ERP system (Sage 100 Advanced) which uses Providex a pretty obscure database. They provide an ODBC driver for read-only access which works. I extract data from our ERP on a regular basis to provide better reporting for my users beyond the included Crystal reports. I do not want my users to be able to write back to this system from python. I may be unique in using this particular database, but I have to believe that there are many other similarly obscure databases driving other systems that would benefit from keeping some form of ODBC access with a read_sql type of function.
This doesn't look like much of a difference, but the cursor.fetchall method is slower and I now have to worry about data typing for DECIMAL and FLOAT columns. |
Beta Was this translation helpful? Give feedback.
-
We are collecting feedbacks about the
to_sql
functionality (similar to pandas.to_sql). It would be very helpful if you could let us know your answer to the following questions:to_sql
functionality? If yes, what's the current issues you have withpandas.read_sql
and/or other tools you use?to_sql
? (e.g. what does the SQL/workload look like, data size, any constraints or index on the table? do you write to a new table or insert to an existing one?)Beta Was this translation helpful? Give feedback.
All reactions