To walk you through this, we have written the following Python tutorial as an example. In the following guides, we will assume you continue from the previous point of the guides.
Downloading a dataset
Here is an example in Python where we have downloaded a Python dataset.
Once you have downloaded your dataset, the next steps will be to format the dataset into JSON documents (Python dictionaries) and then insert them into VecDB.
Things to note about documents
The id attribute is important to uniquely identify the document. If you insert them without "_id" attributes, it will be automatically generated for you but this can be problematic if it accidentally errors upon insertion and you need to investigate why - so we always recommend including the id field in your documents where possible.
From here, there are 2 ways to insert them, you can either encode them and then insert them or you can insert them and then encode them in 1 go. VecDB supports both ways.
For this particular instance, since the images are stored locally, we will encode them on the fly (i.e. as we chunk through the docs and insert them, we will encode just before we insert. This is explored in the next section!
import json
import uuid
df = json.load(open("datasets/annotations/captions_train2014.json"))
docs = [{"annotations": a, "image": im} for a, im in
zip(df['annotations'], df['images'])]
{d.update({"id": uuid.uuid4()._str()}) for d in docs}