How Instacart Built a Search for Billions of Products

Most teams pick a search provider by running a few test queries and hoping for the best – a recipe for hallucinations and unpredictable failures. This technical guide from You.com gives you access to an exact framework to evaluate AI search and retrieval. What you’ll get: A four-phase framework for evaluating AI search How to build a golden set of queries that predicts real-world performance Metrics and code for measuring accuracy Go from “looks good” to proven quality. In 2021, the Instacart search team faced a problem they could trace to their users’ typing habits. One group of shoppers searched for items like “pesto pasta sauce 8oz” and expected the exact product to appear. Another group searched for things like “healthy foods” and expected the system to understand what they meant. These are genuinely different problems. The first asks for precise keyword matching. The second asks the system to grasp intent from vague language. Solving both required two different retrieval systems running in parallel. One handled keyword search. The other handled semantic search, where meaning rather than exact words drives the result. The setup worked, and for a while it worked well. But maintaining two systems in sync, blending their results into a single ranked list, and keeping both fast enough for millions of daily queries became a tax that grew heavier each quarter. In this article, we will learn how Instacart’s search infrastructure evolved over the years and the challenges its engineering team faced. Disclaimer: This post is based on publicly shared details from the Instacart Engineering Team. Please comment if you notice any inaccuracies. Before looking at the technical choices, it helps to understand the workload Instacart’s search has to serve. A catalog with billions of items stretches across thousands of retailers. The system handles millions of search requests every day, with query volume swinging widely from hour to hour. What makes their problem genuinely unusual is the write side. Grocery items are fast-moving goods. Prices shift multiple times a day. Inventory availability changes as shelves empty and restock. Discounts come and go. As a result, the search database receives billions of writes per day. Those writes include catalog changes, pricing updates, availability data, ancillary tables for ranking and display, personalization signals, and replacement product data. This combination is crucial. In most search problems, you index once, maybe refresh occasionally, and queries run against a mostly stable dataset. Instacart’s situation inverts this. Their data changes constantly, and every change has to show up in search results within seconds. Two terms that we should know are precision and recall. Precision is the percentage of retrieved results that are actually relevant. Recall is the percentage of all relevant documents that the system manages to retrieve. A system with high precision returns mostly good results. A system with high recall catches most of the good results that exist. Tuning these tradeoffs is the core game of search, and the architecture determines how much control you have over each. Fine-tuning open-source models manually is a slow, tedious task. Pioneer is a fine-tuning agent that automates the entire process, allowing users to generate synthetic training data, perform LoRA and full fine-tuning runs, run custom evals, and deploy models to production through a chat interface or API. Once deployed, Pioneer autonomously diagnoses failures and retrains on live data in a continuous loop called adaptive inference. Fastino Labs recently released a technical report evaluating Pioneer’s adaptive inference across eight benchmarks, achieving improvements of up to +83.8 percentage points over base models, with each run completing in 8-12 hours at $12–55. Instacart’s original search was built on Elasticsearch, which was the industry-standard choice for full-text search at the time, and still is today. On paper, the fit looked ideal. Elasticsearch is purpose-built for keyword search at scale, uses a well-understood ranking algorithm called BM25, and has a mature ecosystem of tooling around it. The fit broke down because of how Elasticsearch wants data structured. Elasticsearch prefers denormalized documents, meaning one record per item that bundles together every relevant field. When a user searches, Elasticsearch can return complete documents quickly because everything is already in one place. The catch is that when any single field changes, the entire document has to be rewritten and re-indexed. For Instacart, this was catastrophic. A price change on a single product triggers a full document rewrite. Multiply that by billions of daily writes, and the indexing load becomes crushing. The system struggled so badly that fixing erroneous data could take days. Layering sophisticated ML features on top only made things worse, since those features also had to be indexed, further degrading read performance. The team’s response was unusual. Instead of moving to a more specialized search tool, they moved the search into Postgres. This was counterintuitive, since Postgres is a general-purpose relational database, meaning it organizes data into tables with structured columns and relationships between them. The rationale was practical. Their catalog data already lived in Postgres, the team already had deep operational experience running it at scale, and Postgres supports full-text search through GIN indexes (a type of index optimized for searching inside composite values) and a ranking function called ts_rank. The payoff was significant. A normalized data model, where prices, availability, and ML features live in separate joined tables instead of being stuffed into a single document, reduced their write workload by a factor of ten. Only the price table gets touched by a price change, leaving the rest of the system alone. They also gained the ability to store hundreds of gigabytes of ML features alongside documents, which unlocked more sophisticated retrieval models. The lesson here is easy to miss. Elasticsearch is an excellent tool. For read-heavy, append-only workloads like log analytics, it remains the right choice. The problem was the mismatch between Elasticsearch’s data model assumptions and Instacart’s write patterns. Consolidating text search onto Postgres solved the indexing problem, but it left a different gap. A search for “pesto pasta sauce 8oz” is a straightforward matching exercise. The words are specific, the match criteria are clear, and the keyword search handles it beautifully. Ambiguous queries are where keyword search hits its ceiling. When someone searches for “healthy foods,” the query words and the product titles barely overlap. You want results that match the meaning of the query, since exact word matching falls short here. This is where semantic search enters the picture. The idea is to convert text into vectors, which are simply lists of numbers (typically a few hundred long), so that texts with similar meanings end up with similar vectors. Once you have vectors for every product and for the query, finding relevant results becomes a geometry problem. You look for vectors near the query vector in this numerical space. This lookup is called approximate nearest neighbor search, or ANN. Clever index structures let ANN find close matches quickly, rather than comparing every single vector in the database against the query. In 2021, Instacart added semantic search to its stack. Since Postgres at the time had yet to gain native ANN support, they built a standalone service using FAISS, a vector search library from Meta. For every incoming query, the system made parallel calls. One call went to Postgres for keyword retrieval. The other went to the FAISS service for semantic retrieval. The application layer merged the two result sets using a weighted ranking step, then passed the top candidates onward to downstream ranking stages th…

create your storyflo · everywhere you listen.

How Instacart Built a Search for Billions of Products