Search notes:
Python library: ScaNN
import numpy as np
import scann
We create 3 million vectors.
1 million is centered around (1, 0, 0)
, 1 million around (0, 1, 0)
and 1 million around (0, 0, 1)
:
v1 = np.array([
np.array([
np.random.normal(1, 0.01),
np.random.normal(0, 0.01),
np.random.normal(0, 0.01)
])
for _ in range(1000000)
]).astype(np.float32)
v2 = np.array([
np.array([
np.random.normal(0, 0.01),
np.random.normal(1, 0.01),
np.random.normal(0, 0.01)
])
for _ in range(1000000)
]).astype(np.float32)
v3 = np.array([
np.array([
np.random.normal(0, 0.01),
np.random.normal(0, 0.01),
np.random.normal(1, 0.01)
])
for _ in range(1000000)
]).astype(np.float32)
We additionially create 10000 vectors that are centered around (0.5, 0.5, 0.5)
. The goal of this example is to find these vectors (which is why they're called needles here):
needles = np.array([
np.array([
np.random.normal(0.5, 0.01),
np.random.normal(0.5, 0.01),
np.random.normal(0.5, 0.01)
])
for _ in range(10000)
]).astype(np.float32)
The vectors are combined and randomly shuffled:
data = np.concatenate( (v1, v2, v3, needles) )
np.random.shuffle(data)
Creating a builder:
builder = scann.scann_ops_pybind.builder(
data,
num_neighbors = 10,
distance_measure = 'squared_l2' # or, alternatively: 'dot_product'
)
builder = builder.tree(
num_leaves = 10000,
num_leaves_to_search = 10000,
training_sample_size = 1000000
)
builder = builder.score_ah(
10000,
anisotropic_quantization_threshold = 0.001
)
builder = builder.reorder(
1000
)
Creating a searcher :
searcher = builder.build()
Executing the query:
query = np.array([ 0.5, 0.5, 0.5 ]).astype(np.float32)
neighbors, distances = searcher.search(query, final_num_neighbors=10)
Printing the result:
for x in zip(neighbors,distances):
print(x)