python读取几千万行的大表内存问题

tony

Python导数据的时候，需要在一个大表上读取很大的结果集。
如果用传统的方法，Python的内存会爆掉，传统的读取方式默认在内存里缓存下所有行然后再处理，内存容易溢出

解决的方法：
1）使用SSCursor(流式游标)，避免客户端占用大量内存。(这个cursor实际上没有缓存下来任何数据，它不会读取所有所有到内存中，它的做法是从储存块中读取记录，并且一条一条返回给你。)
2）使用迭代器而不用fetchall,即省内存又能很快拿到数据。

import MySQLdb.cursors
 
conn = MySQLdb.connect(host='ip地址', user='用户名', passwd='密码', db='数据库名', port=3306,
                       charset='utf8', cursorclass = MySQLdb.cursors.SSCursor)
cur = conn.cursor()
cur.execute("SELECT * FROM bigtable");
row = cur.fetchone()
while row is not None:
    do something
    row = cur.fetchone()
 
cur.close()
conn.close()

需要注意
1.因为SSCursor是没有缓存的游标,结果集只要没取完，这个conn是不能再处理别的sql，包括另外生成一个cursor也不行的。
如果需要干别的，请另外再生成一个连接对象。

每次读取后处理数据要快，不能超过60s，否则mysql将会断开这次连接，也可以修改 SET NET_WRITE_TIMEOUT = xx 来增加超时间隔。

Python 使用 MySql 的分片的方式读取数据库中大量数据

import gc
import pymysql
from pymysql.cursors import DictCursor

__sql_dict = {
    'host': '',
    'user': ' ',
    'passwd': '',
    'charset': 'utf8',
    'db': ''
}
def get_data(table, source, step=10000):
    _conn = pymysql.connect(**__sql_dict)
    with _conn.cursor(cursor=DictCursor) as cursor:
        sql = 'select count(*) from {0} where source="{1}"'.format(table, source)
        cursor.execute(sql)
        total = list(cursor.fetchall()[0].values())[0]
        _count = 0
        for start in range(0, total + 1, step):
            sql = 'select * from {0} where source="{1}" limit {2},{3}'.format(table, source, start, step)
            cursor.execute(sql)
            for line in cursor.fetchall():
                yield line
                _count += 1
                if _count == 10000:
                    _count = 0
                    gc.collect()
    _conn.close()