Distributed Data Analytics (WT 2017/18) - tele-TASK

Distributed Data Analytics (WT 2017/18) - tele-TASK

14 episodes

The free lunch is over! Computer systems up until the turn of the century became constantly faster without any particular effort simply because the hardware they were running on increased its clock speed with every new release. This trend has changed and today's CPUs stall at around 3 GHz. The size of modern computer systems in terms of contained transistors (cores in CPUs/GPUs, CPUs/GPUs in compute nodes, compute nodes in clusters), however, still increases constantly. This caused a paradigm shift in writing software: instead of optimizing code for a single thread, applications now need to solve their given tasks in parallel in order to expect noticeable performance gains. Distributed computing, i.e., the distribution of work on (potentially) physically isolated compute nodes is the most extreme method of parallelization. Big Data Analytics is a multi-million dollar market that grows constantly! Data and the ability to control and use it is the most valuable ability of today's computer systems. Because data volumes grow so rapidly and with them the complexity of questions they should answer, data analytics, i.e., the ability of extracting any kind of information from the data becomes increasingly difficult. As data analytics systems cannot hope for their hardware getting any faster to cope with performance problems, they need to embrace new software trends that let their performance scale with the still increasing number of processing elements. In this lecture, we take a look a various technologies involved in building distributed, data-intensive systems. We discuss theoretical concepts (data models, encoding, replication, ...) as well as some of their practical implementations (Akka, MapReduce, Spark, ...). Since workload distribution is a concept which is useful for many applications, we focus in particular on data analytics.

Podcasts

Course Summary

Published: Feb. 7, 2018, 3:15 p.m.
Duration: 1 hour 2 minutes 3 seconds

Listed in: Education

Stream Processing

Published: Jan. 31, 2018, 3:15 p.m.
Duration: 1 hour 34 minutes 45 seconds

Listed in: Education

Spark Batch Processing

Published: Jan. 17, 2018, 3:15 p.m.
Duration: 49 minutes 14 seconds

Listed in: Education

Batch Processing

Published: Jan. 10, 2018, 3:15 p.m.
Duration: 1 hour 39 minutes 8 seconds

Listed in: Education

Consistency and Consensus

Published: Dec. 20, 2017, 3:15 p.m.
Duration: 1 hour 35 minutes 38 seconds

Listed in: Education

Distributed Systems

Published: Dec. 13, 2017, 3:15 p.m.
Duration: 1 hour 34 minutes 50 seconds

Listed in: Education

Partitioning & Transactions

Published: Dec. 6, 2017, 3:15 p.m.
Duration: 1 hour 31 minutes 12 seconds

Listed in: Education

Replication

Published: Nov. 29, 2017, 3:15 p.m.
Duration: 1 hour 23 minutes 21 seconds

Listed in: Education

Akka Actor Programming

Published: Nov. 22, 2017, 3:15 p.m.
Duration: 1 hour 23 minutes 26 seconds

Listed in: Education

Formats for Encoding Data & Models of Dataflow

Published: Nov. 15, 2017, 3:15 p.m.
Duration: 1 hour 32 minutes 38 seconds

Listed in: Education

Storage and Retrieval

Published: Nov. 8, 2017, 3:15 p.m.
Duration: 1 hour 6 minutes 8 seconds

Listed in: Education

The Document Data Model & The Graph Data Model

Published: Nov. 1, 2017, 3:15 p.m.
Duration: 1 hour 32 minutes

Listed in: Education

Foundations & Data Models and Query Languages

Published: Oct. 25, 2017, 3:15 p.m.
Duration: 1 hour 31 minutes 23 seconds

Listed in: Education

Introduction & Foundations

Published: Oct. 18, 2017, 3:15 p.m.
Duration: 1 hour 29 minutes 32 seconds

Listed in: Education