Token attributes versus term attributes

As you add documents to IndexWriter the indexed fields end up sooner or later as TokenStream-s, and then tokens and their attributes are collected and inverted and added to TermsHash, which is an internal segment-like representation of new inverted document data, which later on will be turned into a proper segment on flush.

It dawned on me today, while reading FreqProxTermsWriter, that already at this stage Lucene discards all token attributes except for the following four: CharTermAttribute (which carries the term), PositionIncrementAttribute, OffsetAttribute and PayloadAttribute. The remaining ones (flags, type, keyword, and any custom attributes) are silently dropped.

The main reasons for this are that there could be many transient attributes that we don’t want to store, the total number of attributes could be large, and finally that we would have to store somewhere the information what class produced the data (since you could have your own Attribute implementations), and then instantiate this class back when reading postings. This could potentially lead to many “interesting” failure modes – e.g. different version of attribute impl. or missing classes on classpath, etc.

The conclusion is that if you want to store other data per position than the pre-defined three types (position, offsets and payload), you will have to encode it yourself and put it in payload. E.g. if I wanted to preserve token type I’d have to encode the type information as payload (incidentally, there’s already a TokenFilter that does exactly this – o.a.l.analysis.payloads.TypeAsPayloadTokenFilter).

About admin

Apache Lucene / Solr / Nutch / Hadoop hacker.
This entry was posted in Lucene. Bookmark the permalink.

Leave a Reply