Large stored fields in Lucene

I’ve been thinking recently about a NoSQL use case for Lucene. It looks like Lucene already satisfies many needs that other NoSQL platforms also support, of course with the added benefit of robust search functionality, which is something that other NoSQL platforms are often missing.

One notable thing that is missing in Lucene is fine-grained updates, but this will be implemented sooner or later (LUCENE-3837). For now, yeah, you need to re-submit the whole document even if you want to update one tiny bit of it.

Another thing that is missing in Lucene is good handling of large stored fields. E.g. if I need to store megabyte-sized values in Lucene, currently both the indexing and retrieval performance will suffer.

I think the reason is not so much the underlying formats – they can be improved using the Codec API, and they are fairly efficient already. Rather the main problem lies in the Lucene API for stored fields, which assumes that:

  • on writing, a complete value of a large field will be submitted as a byte[]
  • on reading, a complete value of a large field will be retrieved as a byte[]

This design means that whole values have to be kept in memory (and they need to fit on heap!). If you have many large values per document then you will either run out of heap memory (OutOfMemoryException) or you will spend much of the time in garbage collector.

However, looking into the actual implementation of StoredFieldsWriter / StoredFieldsReader it appears that this doesn’t have to be the case – both writing and reading internally use IndexOutput / IndexInput streams, so it’s not strictly required that they need to operate on values represented as byte[]-s.

Furthermore, IndexInput.seek(long) allows us to position the file pointer so that we could retrieve a range of a stored value.

So it looks like with some changes in the Lucene API we should be able to support:

  • writing large field values using InputStream-s (the stored fields writer will read from this stream to write values to disk, so I think that, confusingly, the input data needs to be represented as an InputStream). Note: the underlying format in Lucene40StoredFieldWriter requires that the total length of a value be known in advance, and the amount of data in InputStream would have to match this precisely, otherwise an index corruption would result.
  • reading large field values using InputStream-s. (In the Lucene40StoredFields implementation the underlying IndexInput would be cloned so that values could be retrieved independently, and wrapped so that users couldn’t reposition the pointer outside the current value).
  • reading ranges of field values as InputStream-s or byte[]. This is a simple extension of the above – since we know the total length of on-disk data, we can reposition the pointer to a specific offset and read up to length bytes, returning it either as a stream or as byte[].

Let’s consider now an example: an application that indexes and searches files from a file system (for simplicity let’s consider java.io.File, but e.g. Hadoop exposes a similar stream-oriented API). Using this new API you could do something like this:

File file = new File(...);
Document doc = new Document();
doc.add(new Field("name", file.getAbsolutePath(), ...));
doc.add(new Field("content", file.length(), new FileInputStream(file), ...));
...

The “content” field would be treated as binary, only its content would be retrieved from the source stream, starting at the stream’s current position, and reading exactly file.length() bytes, and writing this content directly as a stored field, and after the stream is exhausted it would be closed.

Similarly, a streaming API for reading large values could look like the following:

InputStream readField(int docNum, String fieldName);
InputStream readField(int docNum, String fieldName, long offset, long length);

The returned InputStream would be a wrapper that limits the maximum range of positions reachable in the underlying IndexInput, to avoid crossing the field and document boundaries in the storage.

In fact, thinking more about this, we could even expose a richer interface of IndexInput, because the underlying source of data is an IndexInput, so for example it’s seek()-able.

There are some issues with this design that are immediately obvious, and I’m sure there are other issues that are not obvious ;)

  • in the current implementation of stored fields the total length of data has to be known a priori, and it’s assumed that all data is guaranteed to be available (since it’s a byte[]). In the stream-oriented API this wouldnt have to be the case, so either we would have to relax the contract to write the length at the end, or we would have to deal with situations where length is not known (buffer and write extents of known length?) or situations where IO error occurs (if we passed bytes directly to the StoredFieldsWriter this could easily result in a corrupt index).
  • writing data via InputStream seems backward, but Lucene would need to control the process of consuming bytes and not the application … or perhaps Lucene could notify the application that the app should start streaming data that corresponds to a particular field in particular doc?
  • on reading of multiple fields we would have to open several streams, this could be a performance / capacity issue.

About admin

Apache Lucene / Solr / Nutch / Hadoop hacker.
This entry was posted in Lucene. Bookmark the permalink.

Leave a Reply