Effortlessly Query Documents Using Upstash Vector DB

A tutorial on using Upstash Vector DB for querying documents in a serverless fashion

Skanda Vivek
6 min readFeb 6, 2024
Vector DB from Bing Image Creator | Skanda Vivek

With the rise of industry LLM use-cases, vector DBs offer a neat way for folks to query custom data with high performance. Upstash recently introduced their vector DB flavor — that offers a nice balance between ease of usage and visual analytics of performance.

Creating a Vector DB Index

The first step is to create a new index on Upstash — you can do this online quite easily as below. Note, the maximum dimensionality for the free version is 1536 dimensions, which corresponds to the same dimension as the OpenAI default embedding model (“text-embedding-ada-002”).

Creating a new vector index on upstash | Skanda Vivek

They also provide some code for easy uploading of vectors into a DB.

from upstash_vector import Index

index = Index(url="https://apt-badger-85452-us1-vector.upstash.io", token="ABQFMGFwdC1iYWRnZXItODU0NTItdXMxYWRtaW5NMkkyTTJJeU56VXRNak0xT1MwME5qQTJMV0kxTWpVdE5EWmhZek5qWXpNNU9EWmw=")

index.upsert(
vectors=[
("id1", [0.9, 0.93, 0.54, 0.4,...]
)

index.query(
vector=[0.9, 0.93, 0.54, 0.4, 0.64,...] top_k=1,
include_vectors=True,
include_metadata=True
)

Uploading, Embedding, Tokenizing Data

In this example, I want to search over the Mistral 7B paper, so I upload this into a text file using the PyMuPDF reader as below:

import requests
import fitz
import io

#mistral 7B paper
url = "https://arxiv.org/pdf/2310.06825.pdf"
request = requests.get(url)
filestream = io.BytesIO(request.content)
with fitz.open(stream=filestream, filetype="pdf") as doc:
text = ""
for page in doc:
text += page.get_text()
print(text[:100])

Mistral 7B
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford,
Devendra Singh Cha

The next part is embedding the data and tokenizing it:


def get_embedding(text, model="text-embedding-ada-002"):
text = text.replace("\n", " ")
return client.embeddings.create(input = [text], model=model).data[0].embedding

def tokenize(text,max_tokens) -> pd.DataFrame:
""" Function to split the text into chunks of a maximum number of tokens """

# Load the cl100k_base tokenizer which is designed to work with the ada-002 model
tokenizer = tiktoken.get_encoding("cl100k_base")

df=pd.DataFrame(['0',text]).T
df.columns = ['title', 'text']

# Tokenize the text and save the number of tokens to a new column
df['n_tokens'] = df.text.apply(lambda x: len(tokenizer.encode(x)))

# Visualize the distribution of the number of tokens per row using a histogram
# df.n_tokens.hist()

################################################################################
# Step 8
################################################################################

shortened = []

# Loop through the dataframe
for row in df.iterrows():

# If the text is None, go to the next row
if row[1]['text'] is None:
continue

# If the number of tokens is greater than the max number of tokens, split the text into chunks
if row[1]['n_tokens'] > max_tokens:
shortened += split_into_many(row[1]['text'], tokenizer, max_tokens)

# Otherwise, add the text to the list of shortened texts
else:
shortened.append(row[1]['text'])


df = pd.DataFrame(shortened, columns=['text'])
df['n_tokens'] = df.text.apply(lambda x: len(tokenizer.encode(x)))


df['embeddings'] = df.text.apply(lambda x: get_embedding(x))

return df

def split_into_many(text: str, tokenizer: tiktoken.Encoding, max_tokens: int = 1024) -> list:
""" Function to split a string into many strings of a specified number of tokens """

# Split the text into sentences
sentences = text.split(' ')

# Get the number of tokens for each sentence
n_tokens = [len(tokenizer.encode(" " + sentence))
for sentence in sentences]

chunks = []
tokens_so_far = 0
chunk = []

# Loop through the sentences and tokens joined together in a tuple
for sentence, token in zip(sentences, n_tokens):

chunk.append(sentence)
tokens_so_far += token + 1

# If the number of tokens so far plus the number of tokens in the current sentence is greater
# than the max number of tokens, then add the chunk to the list of chunks and reset
# the chunk and tokens so far
if tokens_so_far + token > max_tokens:
chunks.append(" ".join(chunk))
chunk = []
tokens_so_far = 0

# If the number of tokens in the current sentence is greater than the max number of
# tokens, go to the next sentence
# if token > max_tokens:
# continue



return chunks

df = tokenize(text, 100)
Document embeddings | Skanda Vivek

Uploading to Upstash Vector DB

Next, it is fairly straightforward to upload this data into the Upstash Vector DB as below:

df['vector_id'] = df.index
df['vector_id'] = df['vector_id'].apply(str)

# Models a simple batch generator that make chunks out of an input DataFrame
class BatchGenerator:


def __init__(self, batch_size: int = 10) -> None:
self.batch_size = batch_size

# Makes chunks out of an input DataFrame
def to_batches(self, df: pd.DataFrame) -> Iterator[pd.DataFrame]:
splits = self.splits_num(df.shape[0])
if splits <= 1:
yield df
else:
for chunk in np.array_split(df, splits):
yield chunk

# Determines how many chunks DataFrame contains
def splits_num(self, elements: int) -> int:
return round(elements / self.batch_size)

__call__ = to_batches

df_batcher = BatchGenerator(300)

from upstash_vector import Vector

vectors=[]


# Upsert content vectors
print("Uploading vectors to content namespace..")
for batch_df in df_batcher(df):
for i in range(0,len(batch_df)):
vec = Vector(id=batch_df.vector_id[i], vector=batch_df.embeddings[i], metadata={
"text": batch_df.text[i]
})

vectors.append(vec)

index.upsert(vectors)

Vector DB Querying and Visualization

The nice part about the Upstash Vector DB is that you can visualize all your data as below. Note that you can query by vector ID, or embedding (and it is pretty fast):

Upstash Vector DB Query Tool | Skanda Vivek

Here’s a sample query with top 5 relevant chunks:

# Here we can search for similar vectors

embedding = get_embedding("What is Mistral?")

# Search for similar vectors
res = index.query(vector=embedding, top_k=5, include_metadata=True)
[r.metadata['text'] for r in res]

[‘Lachaux,\nPierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix,\nWilliam El Sayed\nAbstract\nWe introduce Mistral 7B, a 7–billion-parameter language model engineered for\nsuperior performance and efficiency. Mistral 7B outperforms’,

‘Hugging Face 3 is\nalso streamlined for easier integration. Moreover, Mistral 7B is crafted for ease of fine-tuning across\na myriad of tasks. As a demonstration of its adaptability and superior performance, we present a chat\nmodel fine-tuned from Mistral 7B’,

‘that significantly outperforms the Llama 2 13B — Chat model.\nMistral 7B takes a significant step in balancing the goals of getting high performance while keeping\nlarge language models efficient. Through our work, our aim is to help the community create more\naffordable,’,

‘model, Mistral 7B, demonstrates that\na carefully designed language model can deliver high performance while maintaining an efficient\ninference. Mistral 7B outperforms the previous best 13B model (Llama 2, [26]) across all tested\nbenchmarks, and surpasses the best 34B’,

‘2 7B/13B, and Llama 1 34B4 in different\ncategories. Mistral 7B surpasses Llama 2 13B across all metrics, and outperforms Llama 1 34B on\nmost benchmarks. In particular, Mistral 7B displays a superior performance’]

You can also visualize relevant metrics — throughput, latency, and vector count, that is super useful:

Upstash Vector DB Analytics | Skanda Vivek

Takeaways

Upstash is a great way to get started in vector DBs. The usage is very similar to Pinecone, but even easier to get started in building vector DBs. I had a hard time figuring out minute details about inserting data into Pinecone (they had made a few changes to code formatting, causing old tutorial code to not work properly) — but Upstash was a breeze. I also like Upstash’s analytics tool to quickly get a sense of data usage and query performance.

There’s definitely room for improvement though — Upstash could provide more detailed analytics on their dashboard (e.g. types of queries, errors, etc.) but this is a good start. Also, I initially thought their data browser supported native language queries — but it doesn’t, it just supports index/embedding queries whose value is limited. To be fair, it doesn’t know apriori what embedding model is being used, just the dimension — but there must be some way around this.

Here is the full tutorial in GitHub: https://github.com/skandavivek/upstash-vectordb-tutorial/tree/main

If you like this post, follow me — I write on Generative AI in real-world applications and, more generally, on the intersections between data and society.

Feel free to connect on LinkedIn!

--

--

Skanda Vivek
Skanda Vivek

Written by Skanda Vivek

Senior Data Scientist in NLP and advisor

No responses yet