Show HN: Lance – Alternative to Parquet for ML data

gjreda · on May 31, 2023

I recently prototyped out a "chat over PDF documents" project.[1] I opted to use LanceDB for vector (embeddings) storage and retrieval and found it really nice to use.

I'm working on using it in a large project now.

[1] - https://github.com/gjreda/scratch-pdf-bot

d4rkp4ttern · on June 1, 2023

Interesting. Curious what you found better about lance compared to any of the other vecdbs like qdrant, chroma or others.

gjreda · on June 1, 2023

I initially built this same "chat with PDFs" prototype with LangChain and qdrant. I then rebuilt it from scratch for the sake of learning and comparison.

Some context: I've been a jack-of-all-trades data scientist / machine learning engineer for the past 15 years (officially titled as an MLE the last four years).

I share that only because I think it plays a role in how I'm typically accustomed to working.

1. I found LangChain to be overkill for this use-case. While it might allow some to move more quickly when building, I found it to be cumbersome. My suspicion is this is largely because of my background - I understand how to build much of what's "under the hood" in LangChain. Because of this, I think it felt overly abstracted and I found the docs difficult to navigate and sometimes incomplete.

2. I used Qdrant via their docker image and it was simple to setup and start using. I didn't try to push the limits with it, so I can't say anything about performance. Because Qdrant runs as an http service, I found that it didn't fit well into my workflow - I'm accustomed to being able to visually inspect my data inside the interpreter, debugging, trying out commands, interacting and experimenting with my results, etc. Again, my suspicion is this is my own bias in how I typically work. Qdrant otherwise seemed very nice.

3. LanceDB felt powerful yet lightweight, and fit well into my workflow. It was far more intuitive for me. It was as if sqlite, the python data ecosystem, and a vector database had a child and named it LanceDB. Under the hood, it's built on Apache Arrow and integrates nicely with pandas, allowing me to seamlessly go from LanceDB table on disk, to pandas dataframe, and into some analysis or investigation of my LanceDB query results. This line [1] is a great example of why I liked it. This feels nicer to me than the world of API params and HTTP requests.

1. https://github.com/gjreda/scratch-pdf-bot/blob/main/gpt_pdf_...

d4rkp4ttern · on June 2, 2023

Thank you for elaborating. I concur about langchain and Qdrant.

With langchain it was struggle to figure out what was going on under the hood, I had to pull together multiple pieces from multiple notebooks simply to see what the Conversational Retriever Chain does. And then it was trivial to implement a variant of it myself with all pieces transparently in one place.

I like the Qdrant interface and docs and seems to work well so far. For local testing I used their python client rather than docker and it was seamless to switch to their cloud. My usecase doesn’t involve pandas (maybe I wanted a breather from years of pandas-wrangling!); I think the OpenAI cookbook repo has examples of using pandas in combination with Qdrant ( and many others).

taeric · on June 1, 2023

Meanwhile, far too many shops still use csv. Hard not to see a new entry as slowing the move.

I am curious on the random access benefits. Seems most ML workloads will naturally scan the data sets fairly linearly. Does this maintain parity on that?

eddyxu · on June 1, 2023

The use cases for such random access was from our experiences to maintain large-scale training data, with the needs to debugging model performance against to data, which requires fast slicing and dicing over the dataset, filtering into subset of dataset that satisfy certain distributions, and visualize them on internal debugging UI interactively.

A few of users use random access to do shuffling and training on subsets as well.

Many of them migrated from a system where you can store asset URL in your format of choice (say parquet/tfrecord), to Lance, partially because putting asset (i.e., image, large tensors, lidar point cloud) physically together can lead to better scan perf than loading a lot of small files from S3 or on-prem object store / file systems, due to much less metadata server load over the directory / key&value structures. (i.e. Some of our users see similar issues like this decade old article https://blog.cloudera.com/the-small-files-problem/)

One motivation to design lance is to avoid creating a new copy of training dataset just for one model / training iteration. This is one copy of data for maintenance (update/schema evolution/deletion), analytics & visualization, and training.

For scanning, Lance has been proved slightly faster than Tfrecord and parquet over object store (S3 and GCS). I'd contribute the scan performance just to Rust and tokio async I/O. It is not necessary a better design in scan as we just try to scan "no more data than parquet" when we designed the layouts / encoding algorithms.

taeric · on June 1, 2023

Apologies for not getting back on this yesterday. Made the mistake of posting from my phone right before evening plans took over.

Looking at this, it seems you build a few indexes, such that I'm guessing those are the main drivers on the benefits? Makes sense, does it add to the space at all? As I said, most teams I work with are still on CSV, so even if this adds, I'm sure it is well below that.

At any rate, thanks for the response. Looks really nice!

eddyxu · on June 1, 2023

No worries at all. For the teams which are happy on CSV/JSON, i'd admit that lance is not ideal alternative for them.

> It seems you build a few indexes, such that I'm guessing those are the main drivers on the benefits?

Yeah, we are building different indices into this columnar storage format, which is actually a happy side-effect of its good random-access performance. It does occur extra space for indices.

Thanks for your kind word too!

taeric · on June 1, 2023

I'd hazard that many of the teams aren't so much happy with CSV, as they are ignorant of its costs. I fought for a bit to get them to move to parquet, but all too often they insist on having it in a format that excel can open.

mulmen · on May 31, 2023

I’m confused. Parquet is a file format. The reference implementation is in Java but rust implementations exist. Is this faster because rust or because of the file format? Could this format offer benefits in a Java environment?

eddyxu · on May 31, 2023

Hey, co-author of Lance here. Lance is faster in random access because the layout / encodings were designed to be fast in both scan and random access case. We borrowed many ideas from Google's Procella paper, and Arrow's in-memory layout. Also we added a bunch of I/O exec plan optimizations with the assumption that it has large-blob columns (i.e., image, lidar point cloud) during scanning, which do not exist in traditional OLAP systems, because their workloads are different than ML training.

Re-implementing Lance in Java should have very similar I/O characteristics. There are actually some efforts to support Lance in JVM / Spark data sources.

jchonphoenix · on June 10, 2023

Hey Eddy, Arrow also allows you to serialize to disk and then utilize mmap. Compared to Parquet, the downside is that the design of Arrow makes it so storage requirements increase. If you're borrowing elements of Arrow's layout, does that not come with all the same downsides of just directly utilizing Arrow's serialization. And at that point, why not just use Arrow?

mulmen · on May 31, 2023

Thanks for the clarification! That’s very exciting. A JVM implementation that can drop in to Spark/JVM would be great because there’s so much inertia built up around the Apache ecosystem.

alphaursaemin · on May 31, 2023

Reminded me of deeplake. What is the comparative analysis?

eddyxu · on May 31, 2023

We have not done benchmarks against deeplake yet. Deeplake has some interesting concepts in their design, I'd be very interested to do a benchmark soon.

alphaursaemin · on May 31, 2023

getting bunch of 404s on the docs. for example https://eto-ai.github.io/lance/format.html (But this works: https://lancedb.github.io/lance/*)

Did you guys just pivot from eto-ai to lancedb?

chop · on May 31, 2023

Thanks for the call out, its been updated now! We haven't pivoted, just updated the Github organization recently.

barumrho · on June 1, 2023

Can .lance file be used without the directory structure? (For whatever reason, a single file is just much easier to work with.)

eddyxu · on June 1, 2023

That's a good idea. The format itself is self-contained. We need a little bit work to expose our file-level read/write API to public.

nemo44x · on May 31, 2023

It looks nice but why is the implementation language mentioned? Why does it matter? If anything it’s confusing as some one might assume it’s only for Rust programs.

dang · on May 31, 2023

Ok, we've taken Rust out of the title now.

Let's talk about the specifics of the project now please.