ES遍历所有搜索结果(Python实现)

ES的搜索是有数量限制的,因此利用官方提供是scroll API实现了一个对全量数据处理的函数。


def travel_es(es,process_func, **kwargs): 
    """
    遍历es的搜索结果,并使用process_func处理返回的item
    process_func: function to process item. 
    kwargs: arguments same as elasticsearch search api.
    """
    kwargs.setdefault("scroll","2m")
    kwargs.setdefault("size",1000)
    res = es.search(**kwargs)
    
    sid = res['_scroll_id']
    scroll_size = len(res['hits']['hits'])
    total_size=scroll_size
    while scroll_size > 0:
        "Scrolling..."

        # Before scroll, process current batch of hits
        process_func(res['hits']['hits'])

        data = es.scroll(scroll_id=sid, scroll='4m')

        # Update the scroll ID
        sid = data['_scroll_id']

        # Get the number of results that returned in the last scroll
        scroll_size = len(data['hits']['hits'])
        total_size+=scroll_size
        print(total_size)
    return scroll_size
    

示例用法

下面是把所有的结果存到item_list中

item_list=[]
def save_to(item):
    item_list.append(item)

# 使用save_to 函数处理结果
count=travel_es(
        es,
        save_to,
        index="reveye_v2",
        
        body={
            "query": {
                "match_all": {}
            }
        }
)

参考

文章地址

ES遍历所有搜索结果(Python实现)