π»Creating your first dataset
Inserting JSON
Creating and inserting your first dataset
VecDB is a NoSQL database and functions similar to popular databases such as MongoDB.
You can ingest JSON into our database simply using the insert documents endpoint.
To walk you through this, we have written the following Python tutorial as an example. In the following guides, we will assume you continue from the previous point of the guides.
Downloading a dataset
Here is an example in Python where we have downloaded a Python dataset.
import os
from tensorflow.keras.utils import get_file
root_dir = "datasets"
annotations_dir = os.path.join(root_dir, "annotations")
images_dir = os.path.join(root_dir, "train2014")
annotation_file = os.path.join(annotations_dir, "captions_train2014.json")
# Download caption annotation files
annotation_zip = get_file(
"captions.zip",
cache_dir=os.path.abspath("."),
origin="http://images.cocodataset.org/annotations/annotations_trainval2014.zip",
extract=True,
)
# Download image files
image_zip = get_file(
"train2014.zip",
cache_dir=os.path.abspath("."),
origin="http://images.cocodataset.org/zips/train2014.zip",
extract=True,
)
Creating a new dataset
Once you have downloaded your dataset, the next steps will be to format the dataset into JSON documents (Python dictionaries) and then insert them into VecDB.
import json
import uuid
df = json.load(open("datasets/annotations/captions_train2014.json"))
docs = [{"annotations": a, "image": im} for a, im in
zip(df['annotations'], df['images'])]
{d.update({"id": uuid.uuid4()._str()}) for d in docs}
Things to note about documents
The id attribute is important to uniquely identify the document. If you insert them without "_id" attributes, it will be automatically generated for you but this can be problematic if it accidentally errors upon insertion and you need to investigate why - so we always recommend including the id field in your documents where possible.
From here, there are 2 ways to insert them, you can either encode them and then insert them or you can insert them and then encode them in 1 go. VecDB supports both ways.
For this particular instance, since the images are stored locally, we will encode them on the fly (i.e. as we chunk through the docs and insert them, we will encode just before we insert. This is explored in the next section!
Last updated
Was this helpful?