@looper1211
2018-12-17T06:49:22.000000Z
字数 4532
阅读 1047
哈希表又称为散列表、链表散列等,由列表+链表构成,每个列表的元素项都是一个链表,在python中,字典(dict)就是利用哈希表结构实现的。
当向哈希表插入一个(key,val)时,先通过哈希函数对key算出一个值,再对哈希表的长度(即列表的长度)取模,得到的结果就是列表的下标,然后再遍历对应位置上链表结点,如果发现某个结点存储的key相同,就进行覆盖val值,否则为当前(key,val)创建新结点,插入到链表中
数据真实存储在链表的结点中,哈希表集合了顺序表和链表的优势,查找效率、增删效果极高。所有的操作O(1)
爬虫爬取的数据量过大时,如果使用set集合进行取重,可能会占用太多的内存空间,此时可以采用布隆过滤器进行去重,虽然存在一定的失误率(性质3),但能够保证重复的数据一定能判断出来,它只关注样本量的大小,而不考虑单个样本的大小,这个由哈希的特性(性质1)决定的,对于大数据大样本的海量去重,特别有效
和样本数量有关、和单个样本的大小无关,但存在一定的失误率
根据m和n,计算哈希函数的个数k
根据m、n、k的值,计算实际的真实失误率
由于m和k都进行向上取整,所以P(真) < P(预期)
reids提供了一个2^32次方长度的位数组,大小是512MB,可以如下使用:
127.0.0.1:6379> getbit "bloom" 10
(integer) 0
127.0.0.1:6379> setbit "bloom" 10 1
(integer) 0
127.0.0.1:6379> getbit "bloom" 10
(integer) 1
127.0.0.1:6379>
from BloomFilter.GeneralHashFunctions import *
import redis
import pickle
import base64
class BloomFilterRedis(object):
# 哈希函数列表
hash_list = [rs_hash, js_hash, pjw_hash, elf_hash, bkdr_hash, sdbm_hash, djb_hash, dek_hash]
def __init__(self, key, host='127.0.0.1', port=6379):
self.key = key
self.redis = redis.StrictRedis(host=host, port=port, charset='utf-8')
def random_generator(self, hash_value):
'''
将hash函数得出的函数值映射到[0, 2^32-1]区间内
'''
return hash_value % (1 << 32)
def do_filter(self, item, save=True):
"""
过滤,判断是否存在
:param item:
:return:
"""
flag = True # 默认存在
for hash in self.hash_list:
# 计算哈希值
hash_value = hash(item)
# 获取映射到位数组的下标值
index_value = self.random_generator(hash_value)
# 判断指定位置标记是否为 0
if self.redis.getbit(self.key, index_value) == 0:
# 如果不存在需要保存,则写入
if save:
self.redis.setbit(self.key, index_value, 1)
flag = False
return flag
class Stu(object):
def __init__(self, name, age):
self.name = name
self.age = age
if __name__ == '__main__':
bloom = BloomFilterRedis("bloom_obj")
# 对对象进行去重,必须实现序列化
data = pickle.dumps(Stu("xiaohong", 18))
data1 = base64.b64encode(data).decode()
ret = bloom.do_filter(data1)
print(ret)
data = pickle.dumps(Stu("xiaohong", 18))
data1 = base64.b64encode(data).decode()
ret = bloom.do_filter(data1)
print(ret)
bloom = BloomFilterRedis("bloom_url")
ret = bloom.do_filter("http://www.baidu.com")
print(ret)
ret = bloom.do_filter("http://www.baidu.com")
print(ret)
开源的哈希函数:
#
#**************************************************************************
#* *
#* General Purpose Hash Function Algorithms Library *
#* *
#* Author: Arash Partow - 2002 *
#* URL: http://www.partow.net *
#* URL: http://www.partow.net/programming/hashfunctions/index.html *
#* *
#* Copyright notice: *
#* Free use of the General Purpose Hash Function Algorithms Library is *
#* permitted under the guidelines and in accordance with the MIT License. *
#* http://www.opensource.org/licenses/MIT *
#* *
#**************************************************************************
def rs_hash(key):
a = 378551
b = 63689
hash_value = 0
for i in range(len(key)):
hash_value = hash_value * a + ord(key[i])
a = a * b
return hash_value
def js_hash(key):
hash_value = 1315423911
for i in range(len(key)):
hash_value ^= ((hash_value << 5) + ord(key[i]) + (hash_value >> 2))
return hash_value
def pjw_hash(key):
bits_in_unsigned_int = 4 * 8
three_quarters = (bits_in_unsigned_int * 3) / 4
one_eighth = bits_in_unsigned_int / 8
high_bits = 0xFFFFFFFF << int(bits_in_unsigned_int - one_eighth)
hash_value = 0
test = 0
for i in range(len(key)):
hash_value = (hash_value << int(one_eighth)) + ord(key[i])
test = hash_value & high_bits
if test != 0:
hash_value = ((hash_value ^ (test >> int(three_quarters))) & (~high_bits))
return hash_value & 0x7FFFFFFF
def elf_hash(key):
hash_value = 0
for i in range(len(key)):
hash_value = (hash_value << 4) + ord(key[i])
x = hash_value & 0xF0000000
if x != 0:
hash_value ^= (x >> 24)
hash_value &= ~x
return hash_value
def bkdr_hash(key):
seed = 131 # 31 131 1313 13131 131313 etc..
hash_value = 0
for i in range(len(key)):
hash_value = (hash_value * seed) + ord(key[i])
return hash_value
def sdbm_hash(key):
hash_value = 0
for i in range(len(key)):
hash_value = ord(key[i]) + (hash_value << 6) + (hash_value << 16) - hash_value;
return hash_value
def djb_hash(key):
hash_value = 5381
for i in range(len(key)):
hash_value = ((hash_value << 5) + hash_value) + ord(key[i])
return hash_value
def dek_hash(key):
hash_value = len(key);
for i in range(len(key)):
hash_value = ((hash_value << 5) ^ (hash_value >> 27)) ^ ord(key[i])
return hash_value
def bp_hash(key):
hash_value = 0
for i in range(len(key)):
hash_value = hash_value << 7 ^ ord(key[i])
return hash_value
def fnv_hash(key):
fnv_prime = 0x811C9DC5
hash_value = 0
for i in range(len(key)):
hash_value *= fnv_prime
hash_value ^= ord(key[i])
return hash_value
def ap_hash(key):
hash_value = 0xAAAAAAAA
for i in range(len(key)):
if (i & 1) == 0:
hash_value ^= ((hash_value << 7) ^ ord(key[i]) * (hash_value >> 3))
else:
hash_value ^= (~((hash_value << 11) + ord(key[i]) ^ (hash_value >> 5)))
return hash_value