Activeloop Deep Lake
Activeloop Deep Lake as a Multi-Modal Vector Store that stores embeddings and their metadata including text, Jsons, images, audio, video, and more. It saves the data locally, in your cloud, or on Activeloop storage. It performs hybrid search including embeddings and their attributes.
This notebook showcases basic functionality related to Activeloop Deep Lake
. While Deep Lake
can store embeddings, it is capable of storing any type of data. It is a serverless data lake with version control, query engine and streaming dataloaders to deep learning frameworks.
For more information, please see the Deep Lake documentation or api reference
Setting upโ
%pip install --upgrade --quiet langchain-openai langchain-community 'deeplake[enterprise]' tiktoken
Example provided by Activeloopโ
Deep Lake locallyโ
from langchain_community.vectorstores import DeepLake
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
import getpass
import os
if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
activeloop_token = getpass.getpass("activeloop token:")
embeddings = OpenAIEmbeddings()
from langchain_community.document_loaders import TextLoader
loader = TextLoader("../../how_to/state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings()
Create a local datasetโ
Create a dataset locally at ./deeplake/
, then run similarity search. The Deeplake+LangChain integration uses Deep Lake datasets under the hood, so dataset
and vector store
are used interchangeably. To create a dataset in your own cloud, or in the Deep Lake storage, adjust the path accordingly.
db = DeepLake(dataset_path="./my_deeplake/", embedding=embeddings, overwrite=True)
db.add_documents(docs)
# or shorter
# db = DeepLake.from_documents(docs, dataset_path="./my_deeplake/", embedding=embeddings, overwrite=True)
Query datasetโ
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)
Dataset(path='./my_deeplake/', tensors=['embedding', 'id', 'metadata', 'text'])
tensor htype shape dtype compression
------- ------- ------- ------- -------
embedding embedding (42, 1536) float32 None
id text (42, 1) str None
metadata json (42, 1) str None
text text (42, 1) str None
To disable dataset summary printings all the time, you can specify verbose=False during VectorStore initialization.
print(docs[0].page_content)
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youโre at it, pass the Disclose Act so Americans can know who is funding our elections.
Tonight, Iโd like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyerโan Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service.
One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.
And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nationโs top legal minds, who will continue Justice Breyerโs legacy of excellence.
Later, you can reload the dataset without recomputing embeddings
db = DeepLake(dataset_path="./my_deeplake/", embedding=embeddings, read_only=True)
docs = db.similarity_search(query)
Deep Lake Dataset in ./my_deeplake/ already exists, loading from the storage
Deep Lake, for now, is single writer and multiple reader. Setting read_only=True
helps to avoid acquiring the writer lock.