One popular approach for large-scale data processing systems is to dump the incoming data into a distributed filesystem like HDFS, run offline map-reduce queries over it, and place the results into a data store so your apps can read from it.
If you need to store all of your data and you need to execute queries which span large time-frames, or you don't know the queries up front, then batch mode is a great fit.
However, there are plenty of use cases where these parameters don't quite fit.
then a stream processing model can allow you to get at your answers in a much faster, and cheaper way.
Hands up, ever written a system made of queues+workers to process incoming data?
Then you'll know there are some challenges...
4,000 stars, >500 forks. most starred Java project on github.