腾讯词向量实战：通过Annoy进行索引和快速查询

2011 年 10 月 15 日

上周《玩转腾讯词向量：词语相似度计算和在线查询》推出后，有同学提到了annoy，我其实并没有用annoy，不过对annoy很感兴趣，所以决定用annoy试一下腾讯 AI Lab 词向量。

学习一个东西最直接的方法就是从官方文档走起： https://github.com/spotify/annoy , Annoy是Spotify开源的一个用于近似最近邻查询的C++/Python工具，对内存使用进行了优化，索引可以在硬盘保存或者加载：Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk。

Annoy (Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings to search for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mmapped into memory so that many processes may share the same data.

照着官方文档，我在自己的机器上进行了简单的测试（Ubuntu16.04, 48G内存, Python2.7, gensim 3.6.0, annoy, 1.15.2），以下是Annoy初探。

安装annoy很简单，在virtuenv虚拟环境中直接：pip install annoy，然后大概可以按着官方文档体验一下最简单的case了：

In [1]: import random
 
In [2]: from annoy import AnnoyIndex
 
# f是向量维度
In [3]: f = 20
 
In [4]: t = AnnoyIndex(f)
 
In [5]: for i in xrange(100):
   ...:     v = [random.gauss(0, 1) for z in xrange(f)]
   ...:     t.add_item(i, v)
   ...:     
 
In [6]: t.build(10)
Out[6]: True
 
In [7]: t.save('test.ann.index')
Out[7]: True
 
In [8]: print(t.get_nns_by_item(0, 10))
[0, 45, 16, 17, 61, 24, 48, 20, 29, 84]
 
# 此处测试从硬盘盘索引加载
In [10]: u = AnnoyIndex(f)
 
In [11]: u.load('test.ann.index')
Out[11]: True
 
In [12]: print(u.get_nns_by_item(0, 10))
[0, 45, 16, 17, 61, 24, 48, 20, 29, 84]

看起来还是比较方便的，那么Annoy有用吗? 非常有用，特别是做线上服务的时候，现在有很多Object2Vector, 无论这个Object是Word, Document, User, Item, Anything, 当这些对象被映射到向量空间后，能够快速实时的查找它的最近邻就非常有意义了，Annoy诞生于Spotify的Hack Week，之后被用于Sptify的音乐推荐系统，这是它的诞生背景：

There are some other libraries to do nearest neighbor search. Annoy is almost as fast as the fastest libraries, (see below), but there is actually another feature that really sets Annoy apart: it has the ability to use static files as indexes. In particular, this means you can share index across processes. Annoy also decouples creating indexes from loading them, so you can pass around indexes as files and map them into memory quickly. Another nice thing of Annoy is that it tries to minimize memory footprint so the indexes are quite small.
Why is this useful? If you want to find nearest neighbors and you have many CPU's, you only need to build the index once. You can also pass around and distribute static files to use in production environment, in Hadoop jobs, etc. Any process will be able to load (mmap) the index into memory and will be able to do lookups immediately.
We use it at Spotify for music recommendations. After running matrix factorization algorithms, every user/item can be represented as a vector in f-dimensional space. This library helps us search for similar users/items. We have many millions of tracks in a high-dimensional space, so memory usage is a prime concern.
Annoy was built by Erik Bernhardsson in a couple of afternoons during Hack Week.

Annoy还有很多优点（Summary of features）：

Euclidean distance , Manhattan distance , cosine distance , Hamming distance , or Dot (Inner) Product distance
Cosine distance is equivalent to Euclidean distance of normalized vectors = sqrt(2-2*cos(u, v))
Works better if you don’t have too many dimensions (like <100) but seems to perform surprisingly well even up to 1,000 dimensions
Small memory usage
Lets you share memory between multiple processes
Index creation is separate from lookup (in particular you can not add more items once the tree has been created)
Native Python support, tested with 2.7, 3.6, and 3.7.
Build index on disk to enable indexing big datasets that won’t fit into memory (contributed by Rene Hollander )

现在回到腾讯词向量的话题，关于如何用Annoy做词向量的索引和查询这个问题，在用Annoy玩腾讯词向量之前，我google了一下相关的资料，这篇文章《超平面多维近似向量查找工具annoy使用总结》提到了一个特别需要注意的坑：

但是我还是想弄明白到底怎么回事，于是我去官网问作者，作者就说了一句，你需要进行整数映射，（而且应该是非负整数）卧槽！！！其实官网写的明明白白：

a.add_item(i, v) adds item i (any nonnegative integer) with vector v. Note that it will allocate memory for max(i)+1 items.

也就是说我的txt文件需要是

1 vec

2 vec

所以从一开始我就考虑避开这个坑，刚好gensim的相关接口支持得很好，另外gensim官方文档里也有一份关于Annoy的文档，引入了Annoy的接口，这个之前用gensim的时候没有注意到：

不过这次操作的时候还是直接用annoy的接口，因为基于gensim的word2vec的接口，本身就可以很方便的操作了，以下是简单的操作记录，关键步骤我简单做了注释，仅供参考：

In [15]: from gensim.models import KeyedVectors
 
# 此处加载时间略长，加载完毕后大概使用了12G内存，后续使用过程中内存还在增长，如果测试，请用大一些内存的机器
In [16]: tc_wv_model = KeyedVectors.load_word2vec_format('./Tencent_AILab_Chines
    ...: eEmbedding.txt', binary=False)
 
# 构建一份词汇ID映射表，并以json格式离线保存一份（这个方便以后离线直接加载annoy索引时使用）
In [17]: import json
 
In [18]: from collections import OrderedDict
 
In [19]: word_index = OrderedDict()
 
In [21]: for counter, key in enumerate(tc_wv_model.vocab.keys()):
    ...:     word_index[key] = counter
    ...:     
 
In [22]: with open('tc_word_index.json', 'w') as fp:
    ...:     json.dump(word_index, fp)
    ...: 
 
# 开始基于腾讯词向量构建Annoy索引，腾讯词向量大概是882万条
In [23]: from annoy import AnnoyIndex
 
# 腾讯词向量的维度是200
In [24]: tc_index = AnnoyIndex(200)
 
In [25]: i = 0
 
In [27]: tc_index = AnnoyIndex(200)
 
In [28]: for key in tc_wv_model.vocab.keys():
    ...:     v = tc_wv_model[key]
    ...:     tc_index.add_item(i, v)
    ...:     i += 1
    ...: 
 
# 这个构建时间也比较长，另外n_trees这个参数很关键，官方文档是这样说的：
# n_trees is provided during build time and affects the build time and the index size. 
# A larger value will give more accurate results, but larger indexes.
# 这里首次使用没啥经验，按文档里的是10设置，到此整个流程的内存占用大概是30G左右
In [29]: tc_index.build(10)
 
Out[29]: True
 
# 可以将这份index存储到硬盘上，再次单独加载时，内存占用大概在2G左右
In [30]: tc_index.save('tc_index_build10.index')
Out[30]: True
 
# 准备一个反向id==>word映射词表
In [32]: reverse_word_index = dict([(value, key) for (key, value) in word_index.item
    ...: s()])   
 
# 然后测试一下Annoy，自然语言处理和AINLP公众号后台的结果基本一致
# 感兴趣的同学可以关注AINLP公众号，查询：相似词 自然语言处理
In [33]: for item in tc_index.get_nns_by_item(word_index[u'自然语言处理'], 11):
    ...:     print(reverse_word_index[item])
    ...:     
自然语言处理
自然语言理解
计算机视觉
深度学习
机器学习
图像识别
语义理解
自然语言识别
知识图谱
自然语言
自然语音处理
 
# 不过英文词的结果好像有点不同
In [34]: for item in tc_index10.get_nns_by_item(word_index[u'nlp'], 11):
    ...:     print(reverse_word_index[item])
    ...: 
 
nlp
神经语言
机器学习理论
时间线疗法
神经科学
统计学习
统计机器学习
nlp应用
知识表示
强化学习
机器学习研究

到此，我们初步过了一遍Annoy在腾讯词向量上的实战，我没有仔细对比查询速度，感兴趣的同学可以参考这篇博客：

topk相似度性能比较（kd-tree、kd-ball、faiss、annoy、线性搜索）

里面有很详细的对比，这次时间匆忙，后续我会继续测试，感兴趣的同学欢迎一起探讨。

另外上次文章推出后，还有同学后台问腾讯词向量是怎么来的，所以这里再贴一下腾讯 AI Lab 词向量官方文档和下载地址：

Tencent AI Lab Embedding Corpus for Chinese Words and Phrases

https://ai.tencent.com/ailab/nlp/embedding.html

参考：

Annoy: https://github.com/spotify/annoy

Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk

超平面多维近似向量查找工具annoy使用总结

https://zhuanlan.zhihu.com/p/50604120

topk相似度性能比较（kd-tree、kd-ball、faiss、annoy、线性搜索）