@looper1211
2018-12-17T06:49:22.000000Z
字数 4532
阅读 1166

哈希表又称为散列表、链表散列等,由列表+链表构成,每个列表的元素项都是一个链表,在python中,字典(dict)就是利用哈希表结构实现的。
当向哈希表插入一个(key,val)时,先通过哈希函数对key算出一个值,再对哈希表的长度(即列表的长度)取模,得到的结果就是列表的下标,然后再遍历对应位置上链表结点,如果发现某个结点存储的key相同,就进行覆盖val值,否则为当前(key,val)创建新结点,插入到链表中
数据真实存储在链表的结点中,哈希表集合了顺序表和链表的优势,查找效率、增删效果极高。所有的操作O(1)
爬虫爬取的数据量过大时,如果使用set集合进行取重,可能会占用太多的内存空间,此时可以采用布隆过滤器进行去重,虽然存在一定的失误率(性质3),但能够保证重复的数据一定能判断出来,它只关注样本量的大小,而不考虑单个样本的大小,这个由哈希的特性(性质1)决定的,对于大数据大样本的海量去重,特别有效
和样本数量有关、和单个样本的大小无关,但存在一定的失误率

根据m和n,计算哈希函数的个数k

根据m、n、k的值,计算实际的真实失误率

由于m和k都进行向上取整,所以P(真) < P(预期)
reids提供了一个2^32次方长度的位数组,大小是512MB,可以如下使用:
127.0.0.1:6379> getbit "bloom" 10(integer) 0127.0.0.1:6379> setbit "bloom" 10 1(integer) 0127.0.0.1:6379> getbit "bloom" 10(integer) 1127.0.0.1:6379>
from BloomFilter.GeneralHashFunctions import *import redisimport pickleimport base64class BloomFilterRedis(object):# 哈希函数列表hash_list = [rs_hash, js_hash, pjw_hash, elf_hash, bkdr_hash, sdbm_hash, djb_hash, dek_hash]def __init__(self, key, host='127.0.0.1', port=6379):self.key = keyself.redis = redis.StrictRedis(host=host, port=port, charset='utf-8')def random_generator(self, hash_value):'''将hash函数得出的函数值映射到[0, 2^32-1]区间内'''return hash_value % (1 << 32)def do_filter(self, item, save=True):"""过滤,判断是否存在:param item::return:"""flag = True # 默认存在for hash in self.hash_list:# 计算哈希值hash_value = hash(item)# 获取映射到位数组的下标值index_value = self.random_generator(hash_value)# 判断指定位置标记是否为 0if self.redis.getbit(self.key, index_value) == 0:# 如果不存在需要保存,则写入if save:self.redis.setbit(self.key, index_value, 1)flag = Falsereturn flagclass Stu(object):def __init__(self, name, age):self.name = nameself.age = ageif __name__ == '__main__':bloom = BloomFilterRedis("bloom_obj")# 对对象进行去重,必须实现序列化data = pickle.dumps(Stu("xiaohong", 18))data1 = base64.b64encode(data).decode()ret = bloom.do_filter(data1)print(ret)data = pickle.dumps(Stu("xiaohong", 18))data1 = base64.b64encode(data).decode()ret = bloom.do_filter(data1)print(ret)bloom = BloomFilterRedis("bloom_url")ret = bloom.do_filter("http://www.baidu.com")print(ret)ret = bloom.do_filter("http://www.baidu.com")print(ret)
开源的哈希函数:
##**************************************************************************#* *#* General Purpose Hash Function Algorithms Library *#* *#* Author: Arash Partow - 2002 *#* URL: http://www.partow.net *#* URL: http://www.partow.net/programming/hashfunctions/index.html *#* *#* Copyright notice: *#* Free use of the General Purpose Hash Function Algorithms Library is *#* permitted under the guidelines and in accordance with the MIT License. *#* http://www.opensource.org/licenses/MIT *#* *#**************************************************************************def rs_hash(key):a = 378551b = 63689hash_value = 0for i in range(len(key)):hash_value = hash_value * a + ord(key[i])a = a * breturn hash_valuedef js_hash(key):hash_value = 1315423911for i in range(len(key)):hash_value ^= ((hash_value << 5) + ord(key[i]) + (hash_value >> 2))return hash_valuedef pjw_hash(key):bits_in_unsigned_int = 4 * 8three_quarters = (bits_in_unsigned_int * 3) / 4one_eighth = bits_in_unsigned_int / 8high_bits = 0xFFFFFFFF << int(bits_in_unsigned_int - one_eighth)hash_value = 0test = 0for i in range(len(key)):hash_value = (hash_value << int(one_eighth)) + ord(key[i])test = hash_value & high_bitsif test != 0:hash_value = ((hash_value ^ (test >> int(three_quarters))) & (~high_bits))return hash_value & 0x7FFFFFFFdef elf_hash(key):hash_value = 0for i in range(len(key)):hash_value = (hash_value << 4) + ord(key[i])x = hash_value & 0xF0000000if x != 0:hash_value ^= (x >> 24)hash_value &= ~xreturn hash_valuedef bkdr_hash(key):seed = 131 # 31 131 1313 13131 131313 etc..hash_value = 0for i in range(len(key)):hash_value = (hash_value * seed) + ord(key[i])return hash_valuedef sdbm_hash(key):hash_value = 0for i in range(len(key)):hash_value = ord(key[i]) + (hash_value << 6) + (hash_value << 16) - hash_value;return hash_valuedef djb_hash(key):hash_value = 5381for i in range(len(key)):hash_value = ((hash_value << 5) + hash_value) + ord(key[i])return hash_valuedef dek_hash(key):hash_value = len(key);for i in range(len(key)):hash_value = ((hash_value << 5) ^ (hash_value >> 27)) ^ ord(key[i])return hash_valuedef bp_hash(key):hash_value = 0for i in range(len(key)):hash_value = hash_value << 7 ^ ord(key[i])return hash_valuedef fnv_hash(key):fnv_prime = 0x811C9DC5hash_value = 0for i in range(len(key)):hash_value *= fnv_primehash_value ^= ord(key[i])return hash_valuedef ap_hash(key):hash_value = 0xAAAAAAAAfor i in range(len(key)):if (i & 1) == 0:hash_value ^= ((hash_value << 7) ^ ord(key[i]) * (hash_value >> 3))else:hash_value ^= (~((hash_value << 11) + ord(key[i]) ^ (hash_value >> 5)))return hash_value