Logic Pro tutorial by bufjap on YouTube

I found a nice tutorial on YouTube that covers some areas of Logic Pro functionality that I wasn’t aware of, by a guy named Richard (his site seems to be down, so that’s the only name I have).

The tutorial consists of 22 (!) parts and you can find the first part here.

The irony is that I’ve been using Logic for a while now, mostly fooling around with some smaller pieces, nothing serious, and mostly acoustic + sampled instruments, not synths – but still … I have the official Apple books for basic and advanced level, they should have covered pretty much everything (I know they don’t), and still there were quite a few things in the tutorial that I found useful and enlightening. This was also a bit frustrating … the tutorial is meant for beginners, I thought I already knew the basics.

I especially appreciated the fact that he went systematically through all instruments and explained the controls and how they are put together. Although I could find my way in EXS24, I was always puzzled by the synth instruments included with Logic. I’ve been staring into their tiny unreadable faces for hours, and I think I played with every knob I was able to recognize as a knob .. but the overall idea often still escaped me. Thanks to the tutorial it’s much clearer now.

Posted in music | Leave a comment

EastWest Quantum Leap Symphonic Orchestra

I treated myself to a Silver edition, good enough not to hurt my ears and affordable enough not to hurt my wallet ;)

Patches sound very well, with depth, realism and presence, and enough articulations are available to make the playing enjoyable – I used an older Garritan Orchestra before and several patches there (oboe, some strings) were plasticky, often with rubbery vibratos that you couldn’t get rid of.

So, in comparison the EWQL sounds very well indeed. It’s buggy, though… Both the standalone Player and the plugin in Logic crash frequently, especially when loading multiple instruments in a single plugin. Fortunately, I use multiple patches in a single plugin only when doodling – when I use multiple plugin instances each with its own patch the crashes are much less frequent .. but they still occur.

Now that I can listen to the instruments without cringing and ad-libing is enjoyable enough, I’m planning to go through the Rimsky-Korsakov’s Principles of Orchestration and some tutorials. I wonder what pieces would be good to start from – I’d like to try transcribing some short piece I know by ear and see how it goes.

Posted in music | Leave a comment

Token attributes versus term attributes

As you add documents to IndexWriter the indexed fields end up sooner or later as TokenStream-s, and then tokens and their attributes are collected and inverted and added to TermsHash, which is an internal segment-like representation of new inverted document data, which later on will be turned into a proper segment on flush.

It dawned on me today, while reading FreqProxTermsWriter, that already at this stage Lucene discards all token attributes except for the following four: CharTermAttribute (which carries the term), PositionIncrementAttribute, OffsetAttribute and PayloadAttribute. The remaining ones (flags, type, keyword, and any custom attributes) are silently dropped.

The main reasons for this are that there could be many transient attributes that we don’t want to store, the total number of attributes could be large, and finally that we would have to store somewhere the information what class produced the data (since you could have your own Attribute implementations), and then instantiate this class back when reading postings. This could potentially lead to many “interesting” failure modes – e.g. different version of attribute impl. or missing classes on classpath, etc.

The conclusion is that if you want to store other data per position than the pre-defined three types (position, offsets and payload), you will have to encode it yourself and put it in payload. E.g. if I wanted to preserve token type I’d have to encode the type information as payload (incidentally, there’s already a TokenFilter that does exactly this – o.a.l.analysis.payloads.TypeAsPayloadTokenFilter).

Posted in Lucene | Leave a comment

SIGIR 2012 paper

Together with gsingers and rmuir we’re submitting a paper to SIGIR-2012 on Lucene 4. This is pretty exciting – it’s my first “formal” paper submitted to such an important conference.

We realized that the IR community at large is no longer familiar with Lucene as it exists today – the name is being referenced, there were some (1 or 2 as far as I know) papers that presented the original design ca. Lucene 1.x, and there have been some evaluations. All of this stuff is pretty old, important from historical point of view, but completely at odds with the current architecture, so it may actually be doing Lucene a disservice in the eyes of IR researchers. And yet there’s plenty of cutting edge stuff in Lucene today, both in terms of applying modern IR algorithms and addressing engineering challenges.

So, we figured that it would be great to present Lucene as it is today (i.e. the 4.x branch), the motivations behind the re-design and do it in a way that is understandable to the academic IR community, in an attractive way.

Time is short, the deadline is July 2, but we’re off to a good start.

Posted in Lucene | Leave a comment

Large stored fields in Lucene

I’ve been thinking recently about a NoSQL use case for Lucene. It looks like Lucene already satisfies many needs that other NoSQL platforms also support, of course with the added benefit of robust search functionality, which is something that other NoSQL platforms are often missing.

One notable thing that is missing in Lucene is fine-grained updates, but this will be implemented sooner or later (LUCENE-3837). For now, yeah, you need to re-submit the whole document even if you want to update one tiny bit of it.

Another thing that is missing in Lucene is good handling of large stored fields. E.g. if I need to store megabyte-sized values in Lucene, currently both the indexing and retrieval performance will suffer.

I think the reason is not so much the underlying formats – they can be improved using the Codec API, and they are fairly efficient already. Rather the main problem lies in the Lucene API for stored fields, which assumes that:

  • on writing, a complete value of a large field will be submitted as a byte[]
  • on reading, a complete value of a large field will be retrieved as a byte[]

This design means that whole values have to be kept in memory (and they need to fit on heap!). If you have many large values per document then you will either run out of heap memory (OutOfMemoryException) or you will spend much of the time in garbage collector.

However, looking into the actual implementation of StoredFieldsWriter / StoredFieldsReader it appears that this doesn’t have to be the case – both writing and reading internally use IndexOutput / IndexInput streams, so it’s not strictly required that they need to operate on values represented as byte[]-s.

Furthermore, IndexInput.seek(long) allows us to position the file pointer so that we could retrieve a range of a stored value.

So it looks like with some changes in the Lucene API we should be able to support:

  • writing large field values using InputStream-s (the stored fields writer will read from this stream to write values to disk, so I think that, confusingly, the input data needs to be represented as an InputStream). Note: the underlying format in Lucene40StoredFieldWriter requires that the total length of a value be known in advance, and the amount of data in InputStream would have to match this precisely, otherwise an index corruption would result.
  • reading large field values using InputStream-s. (In the Lucene40StoredFields implementation the underlying IndexInput would be cloned so that values could be retrieved independently, and wrapped so that users couldn’t reposition the pointer outside the current value).
  • reading ranges of field values as InputStream-s or byte[]. This is a simple extension of the above – since we know the total length of on-disk data, we can reposition the pointer to a specific offset and read up to length bytes, returning it either as a stream or as byte[].

Let’s consider now an example: an application that indexes and searches files from a file system (for simplicity let’s consider java.io.File, but e.g. Hadoop exposes a similar stream-oriented API). Using this new API you could do something like this:

File file = new File(...);
Document doc = new Document();
doc.add(new Field("name", file.getAbsolutePath(), ...));
doc.add(new Field("content", file.length(), new FileInputStream(file), ...));

The “content” field would be treated as binary, only its content would be retrieved from the source stream, starting at the stream’s current position, and reading exactly file.length() bytes, and writing this content directly as a stored field, and after the stream is exhausted it would be closed.

Similarly, a streaming API for reading large values could look like the following:

InputStream readField(int docNum, String fieldName);
InputStream readField(int docNum, String fieldName, long offset, long length);

The returned InputStream would be a wrapper that limits the maximum range of positions reachable in the underlying IndexInput, to avoid crossing the field and document boundaries in the storage.

In fact, thinking more about this, we could even expose a richer interface of IndexInput, because the underlying source of data is an IndexInput, so for example it’s seek()-able.

There are some issues with this design that are immediately obvious, and I’m sure there are other issues that are not obvious ;)

  • in the current implementation of stored fields the total length of data has to be known a priori, and it’s assumed that all data is guaranteed to be available (since it’s a byte[]). In the stream-oriented API this wouldnt have to be the case, so either we would have to relax the contract to write the length at the end, or we would have to deal with situations where length is not known (buffer and write extents of known length?) or situations where IO error occurs (if we passed bytes directly to the StoredFieldsWriter this could easily result in a corrupt index).
  • writing data via InputStream seems backward, but Lucene would need to control the process of consuming bytes and not the application … or perhaps Lucene could notify the application that the app should start streaming data that corresponds to a particular field in particular doc?
  • on reading of multiple fields we would have to open several streams, this could be a performance / capacity issue.
Posted in Lucene | Leave a comment

Hello world!

I’m starting a blog. They tell me everybody should have one nowadays … and that old dogs can learn new tricks ;) Don’t expect high volume of traffic, though.

As the tagline says, the topic of this blog will be various things related to the Apache projects I’m involved in (Lucene and Solr, Nutch, Hadoop), information retrieval, Open Source and life in general.

I dislike writing (and reading) fluff, so I’ll try to be concise. That’s it for now, folks.

Posted in Site | Leave a comment