Episode 130: Unlocking AI Vector Databases with James Luan, Zilliz CPO

I think it's already taking a lot of my jobs as an engineer. Yesterday I was trying to fix a bug that usually would take me a couple of days, but with AI it ran for four hours while I just had meetings. They were actually running at the same time and fixed it by themselves, giving me very strong advice. So I think it's already partially taken off my job.

Hi, my name is Demetri Bonichi and I'm a content creator, agency owner, and AI enthusiast. You're listening to the AI Agents podcast brought to you by Jot Form and featuring our very own CEO and founder Idkin Tank. This is the show where artificial intelligence meets innovation, productivity, and the tools shaping the future of work. Enjoy the show.

Hello and welcome back to another episode of the AI Agents podcast. We're here today with James Lewan, the co-founder and VP of engineering at Zillas. Seems like you have a master's in engineering. Really excited to chat with you. Obviously AI is a thing that naturally has come out of the world of engineering. I'm curious, what was your first aha moment when you experienced working with AI? Could be in 2023, 2022. When did you first have an experience with an LLM and thought, 'Wow, this is going to change how we work?'

I'm not a typical AI person. The first year of my engineering journey was more about building infrastructure and focusing on how high-performance code can work in a large-scale environment. My first aha moment was even before large language models came out. About five or six years ago, we tried to build a time series database using machine learning algorithms to predict stock prices. We used ten years of stock price data to train in-house models, which took several GPU cards and a couple of days. It worked pretty well with over 70% accuracy after running for more than 100 days.

The second moment was when pre-trained models like BERT and image generation models like DALL-E came out. I realized that using databases alone isn't enough to understand all your data. We need to combine large models together with infrastructure. That's why we tried to build Veas. When ChatGPT came out in November 2022, I was an early adopter and knew we needed to build something that made sure the record base could work together with ChatGPT. After reading a paper on React, I realized some people were already doing this, which was really cool.

Over the last couple of years, we built several retrieval-augmented generation (RAG) systems. Although it's not super new now, RAG really shocked me with how good it is. We just put all our documents into RAG and it worked perfectly, like having a coworker with one year of experience on our project.

I totally agree and I'm curious to learn more about vector databases and RAG. Could you explain to the audience what those are exactly and how they work? We've had plenty of people discuss this topic on the show, but I don't think we've had anyone from a vector database company explain it before. Hearing it from the horse's mouth would be awesome.

Sure. Large language models are really powerful and trained with a lot of material, so they have basic knowledge. The challenge is that their knowledge stops at a certain time because each pre-training takes more than 12 months. When you start fine-tuning, the knowledge is already stable, so you can't add new knowledge easily. The second problem is hallucination. It's not a bug but a feature because these models do reasoning, which sometimes leads to wrong answers.

To fix that, you have to give them enough prompt, context, or information, just like humans do when answering questions. You might search on Google to gather enough information and then compile it into a solid answer. Similar things happen with larger models. That's why we use RAG, which stands for retrieval-augmented generation. First, we retrieve all the information, which can come from vector databases, graph databases, or general web searches, then combine that information with your prompt and feed it into large models to get better answers.

I really appreciate that explanation. Retrieval-augmented generation is something many are using now. It was one of the immediate quick gains for AI agents, which were practically built on RAG for a while and still are to some extent. How does that compare with what you're doing at Zillas? Could you tell us about the background of Zillas, how you and your team got together, and what was the moment you decided to start the company?

We actually started long before large models came out, closer to when pre-training models like BERT and ResNet were dominant. We saw the opportunity because people had so much unstructured data that traditional big data or relational databases couldn't handle. All our founders come from the database industry and have built several different databases, so we know how to build stable and scalable systems.

For the first year, we focused on building GPU-accelerated databases to process massive amounts of data. But GPU was expensive and memory limited, so it wasn't a good fit for traditional workloads. When we looked into unstructured data, things were different because processing vectors has high computation density where GPUs shine. Since we have GPU and database backgrounds, when a user asked to build a reverse image search, we thought why not build a database for images, videos, and audios so people can easily process their data. That's where the first Veas came out, and we open-sourced it, which became popular.

We have three founders. Our CEO is a former Oracle employee, I'm also from Oracle, and we have another co-founder who works on the product side. I joined a little later after the company began, while the other two are joint founders.

Our CEO had a small set of startups before founding this company. I worked for several other startups before we founded this company together.

If I had to describe Zillas Cloud simply for founders using standard databases, Zillas Cloud solves the complexity of scaling vector search applications by designing the system specifically for vector search from day one. Unlike other databases like PGVector or Elastic, which focus on relational data or traditional search, we focus on computation, which is critical for vector search. We leverage new hardware like TPUs and CPU SIMD instructions to accelerate the database, which is uncommon in other databases.

We're still new compared to other databases with 10 or 20 years of history. Our production is about five or six years old and was born after Kubernetes came out. Our database runs inside Kubernetes with public cloud storage like S3, making it super easy for people to scale and not worry about data storage.

We cover many industries with shiny use cases like systematic search, image reverse search, multimodality for drug discovery, video search, and cross-modality like text-to-video or audio search. Industries we work closely with include legal and healthcare. We also have users from large delivery companies using us for recommendation systems and multimodality search. So there's no specific industry we focus on.

One interesting use case is working with robotics companies collecting videos from their robots. We help them get fine-tuning or training datasets by tagging data. For example, if a robot fails to stop at a stop sign, we convert that information into embeddings and tags to search in the vector database. This helps find many failure cases to improve their robots, including autonomous driving cars.

We have two modes of operation. Most users use our managed service with standard use cases like RAG, where they pick a model vendor and use our vector base as an API. For new use cases or companies with accuracy and cost challenges, we work closely with data scientists to tune models and parameters, spending more time fixing bad cases.

Our long-term goal is to reduce cost and make sure everyone can fully utilize their unstructured data. Vector search isn't new; companies like Google and Facebook have been using it in production since 2014 or 2015. But individual developers and startups haven't had easy access due to cost. We open-sourced our product to let more people use it on cloud, managed service, or locally. We also host managed services to reduce overhead for users. Over the last few years, we've reduced search costs by almost ten times, which has unlocked many new use cases and data growth.

We foresee more small startups with one or two people building cool things needing many building blocks to integrate their systems, and we want to be part of that.

This year, we built a cool product called MCP server. I was a big fan of cloud code but found it lost a lot of context with huge codebases over two million lines of code. Testing complex functionalities was slow and stuck. So we combined our vector search with MCP server integrated with cloud code, fixing about 30% of the original problems.

Next year, we want to build another layer of agentic search on top of vector databases to help people easily search. Right now, people build agents themselves, but we want to expose agentic search at the MCP server so people can integrate easily, not only for code but for general multimodal data.

MCP is like another access layer to your databases where you can use natural language to query data instead of writing code. For example, you can ask to find the most similar result with an image or filter by categories like dog or color yellow. MCP converts natural language into expressions and runs the database.

MCP acts like a bus or standard protocol for building applications with multiple data sources. Traditionally, you write connectors and code to integrate tools. With MCP, it's easier to expose services and interact with agents using a standard way instead of building SDKs or protocols. Developers can build applications with just a few lines of code specifying the target, and the agent handles everything else, saving time on integrations.

Everyone talks about autonomous agents and reasoning models requiring long-term memory. How do you see vector databases evolving to support complex agent workflows where simple retrieval isn't enough?

It's a trend to build agentic search because some questions need multi-hop reasoning. For example, if you ask about famous sights in Germany, the answer might require multiple queries and combining facts. AI needs to split questions into sub-questions, query, and combine facts with several rounds of reasoning and search. We've seen many such systems in production.

For long-term agent memories, pure databases aren't enough. You also need graph databases and batch processing. Like humans, learning isn't just reading once; it requires practice, summarizing, compressing, and note-taking. Similarly, agent memory needs backend processing with batch models and infrastructure. We're working with large AI application providers to help build their agent memories.

I think autonomous agents are already taking a lot of my jobs as an engineer. Yesterday, I tried to fix a bug that usually takes me days, but AI ran for four hours while I had meetings. They fixed it by themselves and gave me strong advice. So AI has partially taken my job, but I don't think it's the end for software engineers. AI still needs design and review work, especially with complex codebases of millions of lines. AI might fix some issues but can also mess things up, requiring refactoring.

There's a trend in AI with output generation and analysis. Many tools output a lot, but synthesizing and understanding information is lacking. For example, in HR, AI analyzes resumes and cover letters to check accuracy. Coders notice AI outputs a lot but struggles to evaluate and synthesize results. Do you see this inverse of output versus analysis needing improvement?

I see AI as a very junior but fast worker. Humans can work as fast as AI, but agents can run concurrently. Some smart coders use five concurrent agents, prompting on one screen and switching projects. AI can offer multiple outputs to pick from. However, AI isn't good at evaluating results. I see myself as a manager guiding AI with code conventions, tests, and reviews to ensure quality and avoid breaking human knowledge.

My favorite thing about this AI age is saving tons of time and doing multiple things concurrently. AI helps not only in coding but also in daily tasks. For example, planning a 10-day trip to Japan used to take weekends, but with AI, it takes four hours to get a plan, check facts, and even book tickets while I review code or watch a movie. AI enables parallel tasking, which is super helpful, especially for startups. I've seen startups with 20-30 people making 50 million in revenue, which was almost impossible five years ago.

Parallel tasking is amazing and exciting. I use many tools at once, multitabbing and multitasking. We used to wait for others to do work, but now AI does it concurrently. I don't like making slides, but AI saves me tons of time by building slides automatically. I use Gamma for slides, which is really good.

Besides Gamma, I use Notebook LM heavily. When learning new concepts, I put them into Notebook LM to search more and generate videos or audios to help me understand faster. Originally, I had to read a lot, but now a three to five-minute video helps me grasp concepts quickly. I also create videos for our product and share them on YouTube, which helps many people understand basic concepts.

You can find me on LinkedIn by searching Juan. I'm also very active on GitHub with our open-source project MUS and Actorbase. I'm active on issues and PRs, and you can contact me or chat with me on Discord Slack channel. Also, check out zillas.com for everything vector database related. Thank you so much for watching this episode. If you liked it, please leave a like. Thanks to James for being on the show and sharing interesting insights.

Episode 130: Unlocking AI Vector Databases with James Luan, Zilliz CPO

Stay Ahead with the AI Agents Podcast