πŸ’»Creating your first dataset

Inserting JSON

Creating and inserting your first dataset

VecDB is a NoSQL database and functions similar to popular databases such as MongoDB.

You can ingest JSON into our database simply using the insert documents endpoint.

To walk you through this, we have written the following Python tutorial as an example. In the following guides, we will assume you continue from the previous point of the guides.

Downloading a dataset

Here is an example in Python where we have downloaded a Python dataset.

import os
from tensorflow.keras.utils import get_file

root_dir = "datasets"
annotations_dir = os.path.join(root_dir, "annotations")
images_dir = os.path.join(root_dir, "train2014")
annotation_file = os.path.join(annotations_dir, "captions_train2014.json")

# Download caption annotation files
annotation_zip = get_file(
    "captions.zip",
    cache_dir=os.path.abspath("."),
    origin="http://images.cocodataset.org/annotations/annotations_trainval2014.zip",
    extract=True,
)

# Download image files
image_zip = get_file(
    "train2014.zip",
    cache_dir=os.path.abspath("."),
    origin="http://images.cocodataset.org/zips/train2014.zip",
    extract=True,
)

Creating a new dataset

Once you have downloaded your dataset, the next steps will be to format the dataset into JSON documents (Python dictionaries) and then insert them into VecDB.

import json 
import uuid
df = json.load(open("datasets/annotations/captions_train2014.json"))
docs = [{"annotations": a, "image": im} for a, im in 
    zip(df['annotations'], df['images'])]
{d.update({"id": uuid.uuid4()._str()}) for d in docs}

Things to note about documents

The id attribute is important to uniquely identify the document. If you insert them without "_id" attributes, it will be automatically generated for you but this can be problematic if it accidentally errors upon insertion and you need to investigate why - so we always recommend including the id field in your documents where possible.

From here, there are 2 ways to insert them, you can either encode them and then insert them or you can insert them and then encode them in 1 go. VecDB supports both ways.

For this particular instance, since the images are stored locally, we will encode them on the fly (i.e. as we chunk through the docs and insert them, we will encode just before we insert. This is explored in the next section!

Last updated

Was this helpful?