Spark is becoming a data processing giant, but it leaves much as an exercise for the user. Developers need to write specialized logic to move between batch and streaming modes, manually deal with late or out-of-order data, and explicitly wire complex flows together.
This talk looks at how we tackled these problems over a multi-petabyte dataset at Cerner. We start with how hand-written solutions to these problems evolved to prescriptive practices, opening up development of such systems to a wider audience. From there we look at how the emergence of Google’s Dataflow on Spark is helping us take the next step: the tradeoffs between correctness, latency, and cost are becoming a simple, easily changeable decision rather than a deep analysis for each new need. Finally, we look at challenges unique to doing processing in large organizations, such as making independent units of processing composable into large pipelines — and making them usable in both batch and stream modes.
Ryan’s talk is now available on the Chariot Solutions site.
Tags: big data, frameworks, spark, streaming
Location: Salon A
April 12th, 2016
2:45 PM - 3:45 PM