big data


Exploring Wikipedia with Apache Spark: A Live Coding Demo

Location: Salon D
April 12th, 2016
4:00 PM - 5:00 PM

The real power and value proposition of Apache Spark is in creating unified use cases combining batch analysis, stream analysis, SQL, machine learning, graph processing and visualizations. In this live coding demo, Sameer will use various Wikipedia datasets to build a dashboard about what is happening in the world during his talk. The application will connect to the live Edits stream of Wikipedia and join it with other Wikipedia datasets to derive interesting insights about what's trending on the planet.


Emergence of Real-Time Analytics: Real-time Analysis of Customer Financial Activities With Apache Flink

Location: Salon A
April 12th, 2016
1:30 PM - 2:30 PM

People's financial activities with banks are increasingly migrating to digital platforms. Banks, which are large institutions that move money are transforming into Software Engineering Companies. At the core of modern banks is a large network of systems and platforms that capture, collect, process and analyze the digital data. Collecting and analyzing customers' activities in real-time is critical for modern financial institutions to succeed. In this talk we present a business use case where Capital One needs to process customer activities real-time and react to events appropriately as needed. We then present our experience in building a real-time analytics application that
Read more  »

Srinivas Palthepu

Senior Manager Big Data Engineering, CapitalOne

NoLambda: A new architecture combining streaming, ad hoc, machine learning, and batch analytics

Location: Salon A
April 11th, 2016
1:30 PM - 2:30 PM

In today’s world of exploding big and fast data, developers who want both streaming analytics and ad hoc, OLAP-like analysis have often had to develop complex architectures such as Lambda—a path for fast streaming analytics using NoSQL stores such as Cassandra and HBase with a separate batch path involving HDFS and Parquet. While this approach works, it involves too many moving parts, too many technologies for ops, and too many engineering hours. Helena Edelson and Evan Chan highlight a much simpler approach to combine streaming and ad hoc/batch analysis using what they call the NoLambda stack (Apache Spark/Scala, Mesos, Akka,
Read more  »

Evan Chan

Creator, FiloDB

Interactive Computing with Jupyter: Past, Present, and Future

Location: Salon E
April 12th, 2016
2:45 PM - 3:45 PM

Jupyter (formerly part of IPython) is a popular tool for interactively exploring and sharing computational ideas. The Jupyter project provides a consistent environment for dozens of languages, including Python, Julia, and R. The Jupyter web notebook makes it easy to use interactive controls to manipulate and visualize computations and provides a widely-used format for documents that include both code and exposition. In this talk, we will give a brief overview of the Jupyter ecosystem and show examples of how Jupyter fosters interactive exploration and collaboration. We will also look at current and future developments in the platform, including the current
Read more  »

Jason Grout

Jupyter Core Developer, Bloomberg

Untangling Healthcare with Spark and Dataflow

Location: Salon A
April 12th, 2016
2:45 PM - 3:45 PM

Spark is becoming a data processing giant, but it leaves much as an exercise for the user. Developers need to write specialized logic to move between batch and streaming modes, manually deal with late or out-of-order data, and explicitly wire complex flows together. This talk looks at how we tackled these problems over a multi-petabyte dataset at Cerner. We start with how hand-written solutions to these problems evolved to prescriptive practices, opening up development of such systems to a wider audience. From there we look at how the emergence of Google's Dataflow on Spark is helping us take the next
Read more  »

Ryan Brush

Engineer, Cerner

Demystifying Stream Processing with Apache Kafka

Location: Salon A
April 12th, 2016
11:30 AM - 12:30 PM

The concept of stream processing has been around for a while and most software systems operate as simple stream processors at their core: they read data in, process it, and maybe emit some data out. So why are there so many stream processing frameworks, all with different terminology, and why does it seem so complex to get up and running? What benefits does each stream processing system provide, and more importantly, what are they missing? This presentation will start by abstracting away the individual frameworks and describe the key features and benefits that stream processing frameworks provide. These core features
Read more  »

Ewen Cheslack-Postava

Engineer, Confluent