Apache Beam (7) Your organization wishes to add streaming capabilities to its big data analytics and processing
Question:
Apache Beam (7) Your organization wishes to add streaming capabilities to its big data analytics and processing stack. They have a few near real-time use cases that they wish to pursue and have an on-premises big data platform. This platform is underpinned by Cloudera and there is no plan to migrate to the cloud in the foreseeable future.
1. Apache Beam has been selected as the processing framework, but now a decision must be made as to which distributed backend to use. Select the runner that is best suited? (1)
1. Dataflow
2. Apache Spark Structured Streaming
3. Apache Flink
4. Apache Samza
5. Apache Spark (Dstreams/RDD)
2. Provide a brief motivation for your choice. (1)
3. Your DevOps teams have been discussing the roll-out of a Kubernetes cluster to support live scoring use cases, amongst others. You are aware that some high-value use cases have very low latency requirements (event time and processing time are nearly aligned). Select the best runner in these circumstances. (1)
1. Dataflow
2. Apache Spark Structured Streaming
3. Apache Flink
4. Apache Samza
5. Apache Spark
4. One of the use cases aims to make product discount offers to customers at the point of sale within your stores. The point of sale (POS) systems are currently delivering data to operational databases at the close of business every day. What best describes the characteristics of the current system (select one):
(1)
1. Batch updates with high latency
2. Streaming infrastructure with low latency
3. Batch updates with low latency
5. It has been determined that these data need to be processed in a more real-time fashion, i.e. when sales events occur in store they must be available immediately for analysis and processing. Bear in mind
that your organisation is using Beam and that your solutions are developed in Python. Furthermore, you may assume that there is a high variability of demand in the stores and that your organization is cost sensitive. Which architecture will work best? (2)
1. Deploy RabbitMQ in Kubernetes (K8S) and integrate POS terminals via RESTful API. Use Beam’s RabbitMQ connector to process the data on an appropriate runner.
2. Deploy Kafka and integrate POS terminals via a RESTful API. Use Beam’s Kafka connector to consume data and process it on an appropriate runner.
3. Utilise PubSub and integrate the terminals via its RESTful API. Use Beam’s PubSub connector
to consume data.
6. Suppose a point of sale in your store emits a payload of 64KB on average. Furthermore, suppose you have an average of 100,000 sales per day, where the majority of sales occur over lunchtime (60% of sales between the hours of 12h00-13h00). What is the minimum throughput that your streaming platform will need to be able to meet at any point in time (using these average amounts)? (1)
Analytics Data Science And Artificial Intelligence Systems For Decision Support
ISBN: 9781292341552
11th Global Edition
Authors: Ramesh Sharda, Dursun Delen, Efraim Turban