Chang She, formerly the VP of Engineering at Tubi and a seasoned expert at Cloudera, brings extensive experience in constructing data tools and infrastructure. However, upon entering the AI sector, She encountered significant challenges with conventional data infrastructure that hindered the deployment of AI models.
“Machine learning engineers and AI researchers often end up with a suboptimal development environment,” She explained to TechCrunch in an interview. “Data infrastructure companies fundamentally lack a deep understanding of the specific needs for machine learning data.”
Recognizing this gap, Chang, who is also a co-creator of the highly acclaimed Python data science library Pandas, partnered with software engineer Lei Xu to establish LanceDB.
LanceDB is in the process of developing its namesake open-source database software, designed explicitly for supporting multimodal AI models—models that can train on and generate images, videos, text, and more. Boosted by backing from Y Combinator, LanceDB recently secured $8 million in a seed funding round led by CRV, Essence VC, and Swift Ventures, escalating its total funding to $11 million.
“If multimodal AI is crucial to the future prosperity of your company, you want your valuable AI team concentrating on the model and integrating AI with business value,” Chang noted. “Regrettably, today’s AI teams spend a majority of their time contending with low-level data infrastructure details. LanceDB delivers the essential foundation AI teams need, enabling them to focus on what truly drives enterprise value and accelerates the go-to-market timeline for AI products.”
LanceDB functions fundamentally as a vector database—a repository containing series of numbers (“vectors”) that encapsulate the essence of unstructured data such as images and text.
As highlighted by my colleague Paul Sawers recently, vector databases are gaining prominence as the AI hype cycle surges. Their utility spans various AI applications, from content recommendations in e-commerce and social media platforms to minimizing hallucinations in AI outputs.
The competition in the vector database segment is intense, with players like Qdrant, Vespa, Weaviate, Pinecone, and Chroma jostling for supremacy—not to mention the entrenched Big Tech incumbents. According to Chang, LanceDB distinguishes itself through superior flexibility, performance, and scalability.
LanceDB, built upon Apache Arrow, leverages a proprietary data format known as Lance Format, optimally engineered for multimodal AI training and analytics. Lance Format empowers LanceDB to handle vast quantities—up to billions of vectors and petabytes of data, including text, images, and videos. It also allows engineers to manage various metadata forms linked to the data.
“Historically, no system has unified training, exploration, search, and large-scale data processing,” Chang stated. “Lance Format enables AI researchers and engineers to consolidate their data pipeline into a single source of truth, achieving exceptional performance throughout. It’s about much more than merely storing vectors.”
LanceDB’s revenue model revolves around offering fully managed versions of its open-source software, enriched with additional features such as hardware acceleration and governance controls. Business appears robust, with an impressive roster of clients including text-to-image platform Midjourney, chatbot unicorn Character.ai, autonomous vehicle startup WeRide, and Airtable.
Chang assured that the recent VC investment wouldn’t detract from their commitment to the open-source project, which he claimed is now achieving approximately 600,000 downloads monthly.
“Our aim was to create a solution that would dramatically simplify the workflow for AI teams dealing with large-scale multimodal data,” he affirmed. “LanceDB offers—and will continue to offer—a comprehensive set of ecosystem integrations to ease the adoption process.”