如何从大型数据库将数据加载到熊猫中?

问题描述：

我有一个Postgres数据库，其中包含时间序列数据.数据库的大小约为1 GB.当前正在读取数据，这就是我要做的

I have a postgres database which contains time series data.The size of the database is around 1 GB.Currently to read data, this is what I do

import psycopg2
import pandas as pd
import pandas.io.sql as psql

conn = psycopg2.connect(database="metrics", user="*******", password="*******", host="localhost", port="5432")
cur = conn.cursor()
df = psql.read_sql("Select * from timeseries", conn)
print(df)

但这会将全部数据加载到内存中.现在我知道可以将数据库转储到csv文件中，然后可以按此处建议的方式分块读取csv文件的技术了

But this loads the entire data into the memory.Now I am aware of techniques where the database can be dumped to a csv file and then the csv file can be read in chunks as suggested here How to read a 6 GB csv file with pandas

但是对我来说这不是一个选择，因为数据库会不断变化，我需要即时读取它.是否有任何技术可以分块读取数据库内容或使用任何第三方库?

But for me that is not an option since the database will be continously changing and I need to read it on the fly.Is there any technique to read the database content maybe in chunks or use any third party libraries?

答

pd.read_sql() also has parameter chunksize, so you can read data from SQL table/query in chunks:

for df in pd.read_sql("Select * from timeseries", conn, chunksize=10**4):
    # process `df` chunk here...

如何从大型数据库将数据加载到熊猫中?

相关推荐