Thomas Weise - From Batch to Streaming ET(L) with Apache Apex
Further information: https://berlinbuzzwords.de/17/session/batch-streaming-etl-apache-apex Stream data processing is increasingly required to support business needs for faster actionable insight with growing volume of information from more sources. Apache Apex is a true stream processing framework for low-latency, high-throughput and reliable processing of complex analytics pipelines on clusters. Apex is designed for quick time-to-production, and is used in production by large companies for real-time and batch processing at scale. This session will use an Apex production use case to walk through the incremental transition from a batch pipeline with hours of latency to an end-to-end streaming architecture with billions of events per day which are processed to deliver real-time analytical reports. The example is representative for many similar extract-transform-load (ETL) use cases with other data sets that can use a common library of building blocks. The transform (or analytics) piece of such pipelines varies in complexity and often involves business logic specific, custom components. Topics include: Pipeline functionality from event source through queryable state for real-time insights. API for application development and development process. Library of building blocks including connectors for sources and sinks such as Kafka, JMS, Cassandra, HBase, JDBC and how they enable end-to-end exactly-once results. Stateful processing with event time windowing. Fault tolerance with exactly-once result semantics, checkpointing, incremental recovery Scalability and low-latency, high-throughput processing with advanced engine features for auto-scaling, dynamic changes, compute locality. Who is using Apex in production, and roadmap. Following the session attendees will have a high level understanding of Apex and how it can be applied to use cases at their own organizations. Speaker: Thomas Weise https://twitter.com/thweise