VecDB API
  • 🌱VecDB
  • 🌱Structure
  • 🌱Glossary
  • 🌱FAQ
  • πŸ’»API Reference
  • Get Started
    • πŸ’»Authentication
    • πŸ’»Build Your First Search
      • πŸ’»Creating your first dataset
      • πŸ’»Encoding + Inserting
      • πŸ’»Your First Text To Image Search!
  • for your understanding
    • 🌱Concepts about vectors
    • 🌱What is vector search?
    • 🌱Vectors for classification
    • 🌱Limitations of vectors
  • Guides
    • πŸ’»Combine keyword search with vector search
  • ADMIN
    • 🌱Project Overview
      • πŸ’»Project Creation
      • πŸ’»List All Datasets
      • Best Practice With Project Management
      • πŸ’»Copy dataset
      • πŸ’»Copy dataset from another project
      • πŸ’»Request an API key
      • πŸ’»Request a read-only API key
  • Services
    • πŸ”Search
      • Text Search
      • Vector Search
      • Hybrid Search
      • Traditional Search
    • πŸ”Predict
      • πŸ’»KNN regression from search results
      • πŸ’»KNN Regression
    • πŸ”Tag
      • πŸ’»Tagging
      • πŸ’»Diversity Tagging
    • πŸ”Cluster
      • Cluster A Dataset Field
  • DATASETS
    • 🌱Datasets Overview
      • 🌱Special Field Types
  • WORKFLOWS
    • 🌱Workflow Overview
Powered by GitBook
On this page
  • Inserting JSON
  • Creating and inserting your first dataset
  • Downloading a dataset
  • Creating a new dataset
  • Things to note about documents

Was this helpful?

  1. Get Started
  2. Build Your First Search

Creating your first dataset

PreviousBuild Your First SearchNextEncoding + Inserting

Last updated 3 years ago

Was this helpful?

Inserting JSON

Creating and inserting your first dataset

VecDB is a NoSQL database and functions similar to popular databases such as MongoDB.

You can ingest JSON into our database simply using the .

To walk you through this, we have written the following Python tutorial as an example. In the following guides, we will assume you continue from the previous point of the guides.

Downloading a dataset

Here is an example in Python where we have downloaded a Python dataset.

import os
from tensorflow.keras.utils import get_file

root_dir = "datasets"
annotations_dir = os.path.join(root_dir, "annotations")
images_dir = os.path.join(root_dir, "train2014")
annotation_file = os.path.join(annotations_dir, "captions_train2014.json")

# Download caption annotation files
annotation_zip = get_file(
    "captions.zip",
    cache_dir=os.path.abspath("."),
    origin="http://images.cocodataset.org/annotations/annotations_trainval2014.zip",
    extract=True,
)

# Download image files
image_zip = get_file(
    "train2014.zip",
    cache_dir=os.path.abspath("."),
    origin="http://images.cocodataset.org/zips/train2014.zip",
    extract=True,
)

Creating a new dataset

Once you have downloaded your dataset, the next steps will be to format the dataset into JSON documents (Python dictionaries) and then insert them into VecDB.

import json 
import uuid
df = json.load(open("datasets/annotations/captions_train2014.json"))
docs = [{"annotations": a, "image": im} for a, im in 
    zip(df['annotations'], df['images'])]
{d.update({"id": uuid.uuid4()._str()}) for d in docs}

Things to note about documents

The id attribute is important to uniquely identify the document. If you insert them without "_id" attributes, it will be automatically generated for you but this can be problematic if it accidentally errors upon insertion and you need to investigate why - so we always recommend including the id field in your documents where possible.

From here, there are 2 ways to insert them, you can either encode them and then insert them or you can insert them and then encode them in 1 go. VecDB supports both ways.

For this particular instance, since the images are stored locally, we will encode them on the fly (i.e. as we chunk through the docs and insert them, we will encode just before we insert. This is explored in the next section!

πŸ’»
πŸ’»
insert documents endpoint