August 8, 2007
Stream programming is a style where one describes a graph of interconnected actors that process a stream of data. See StreamIt. The data can be split and joined; therefore, it’s not the typical simple linear stream found in I/O libraries. There are three types of parallelism to consider:
- Task parallelism: two tasks running in parallel on separate data streams
- Data parallelism: a task that can be applied independently to every element in a data stream
- Pipeline parallelism: a linear set of tasks that can be merged together
I’d like to build a stream programming language that runs on top of Hadoop. This is a project at Apache that implements something like Google’s distributed filesystem, MapReduce and BigTable. MapReduce is, I think, a subset of stream programming (though I’m not sure about “reduce”). Anyway, I need a benchmark that is large enough to exercise a group of machines on Amazon’s EC2.