New Semester
Started
Get
50% OFF
Study Help!
--h --m --s
Claim Now
Question Answers
Textbooks
Find textbooks, questions and answers
Oops, something went wrong!
Change your search query and then try again
S
Books
FREE
Study Help
Expert Questions
Accounting
General Management
Mathematics
Finance
Organizational Behaviour
Law
Physics
Operating System
Management Leadership
Sociology
Programming
Marketing
Database
Computer Network
Economics
Textbooks Solutions
Accounting
Managerial Accounting
Management Leadership
Cost Accounting
Statistics
Business Law
Corporate Finance
Finance
Economics
Auditing
Tutors
Online Tutors
Find a Tutor
Hire a Tutor
Become a Tutor
AI Tutor
AI Study Planner
NEW
Sell Books
Search
Search
Sign In
Register
study help
computer science
principles of database management
Principles Of Database Management The Practical Guide To Storing Managing And Analyzing Big And Small Data 1st Edition Wilfried Lemahieu, Seppe Vanden Broucke, Bart Baesens - Solutions
Which of the following is not a characteristic of a data warehouse?a. Subject-oriented.b. Integrated.c. Time-variant.d. Volatile.
How is a data warehouse defined according to Bill Inmon? Elaborate on each of the characteristics and illustrate with examples.
In terms of data manipulation, a data warehouse focuses on…a. Insert/Update/Delete/Select statements.b. Insert/Select statements.c. Select/Update statements.d. Delete statements.
Discuss and contrast each of the following data warehouse schemas:• Star schema;• Snowflake schema;• Fact constellation.
What are surrogate keys? Why would you use them in a data warehouse instead of using the business keys from the operational systems?
Which statement is correct?a. A star schema has one large central dimension table which is connected to various smaller fact tables.b. The dimension tables of a star schema contain the criteria for aggregating the measurement data and will typically be used as constraints to answer queries.c. To
Discuss four approaches to deal with slowly changing dimensions in a data warehouse. Can any of these approaches be used to deal with rapidly changing dimensions?
Which statement is not correct?a. A snowflake schema normalizes the fact table of a star schema.b. A fact constellation schema has more than one fact table which can share dimension tables.c. Surrogate keys essentially buffer the data warehouse from the operational environment by making it
Consider the following OLAP Cube:• Give an example of a…• Roll-up operation;• Drill-down operation;• Slicing operation;• Dicing operation. Region Europe Africa Asia America B Sales 70/0 C / D Product Q4 Q3 Q2 Q1 Quarter
Explain and illustrate the following concepts:• Independent data mart;• Virtual data warehouse;• Operational data store;• Data lake.
What is windowing? Illustrate a query with windowing using the above table.Given the following table:Consider the following queries:What is the output of the above queries?Can you reformulate each query using other SQL OLAP constructs? PRODUCT
Which statement is not correct?a. Junk dimensions can be defined to efficiently accommodate lowcardinality attribute types such as flags or indicators.b. An outrigger table can be defined to store a set of attribute types of a dimension table which are uncorrelated, high in cardinality, and
Which statement about ETL is not correct?a. Some estimates state that the ETL step can consume up to 80% of all efforts needed to set up a data warehouse.b. To decrease the burden on both the operational systems and the data warehouse itself, it is recommended to start the ETL process by dumping
Which statement is not correct?a. A data mart is a scaled-down version of a data warehouse aimed at meeting the information needs of a homogeneous small group of endusers such as a department or business unit (e.g., marketing, finance, logistics, HR, etc.).b. Dependent data marts pull their data
Which statement is correct?a. A key distinguishing property of a data lake is that it stores raw data in its native format, which could be structured, unstructured, or semistructured.b. A data lake is targeted toward decision-makers at middle- and topmanagement level, whereas a data warehouse
Which statement is not correct?a. Query and reporting tools are an essential component of a comprehensive business intelligence solution.b. A pivot or cross-table is a popular data summarization tool. It essentially cross-tabulates a set of dimensions.c. A key disadvantage of OLAP is that it
Which statement is not correct?a. Multidimensional OLAP (MOLAP) stores the multidimensional data using a multidimensional DBMS (MDBMS) whereby the data are stored in a multidimensional array-based data structure optimized for efficient storage and quick access.b. Relational OLAP (ROLAP) stores
Which statement is correct?a. Roll-up (or drill-up) refers to aggregating the current set of fact values within or across one or more dimensions.b. Roll-down (or drill-down) de-aggregates the data by navigating from a lower level of detail to a higher level of detail.c. Slicing represents the
Ideally, data integration should include…a. Only data.b. Only processes.c. Both processes and data.
Give some examples of operational business intelligence.
Which statement is not correct?a. Analytics techniques are more and more used at the operational level as well by front-line employees.b. Analytics for tactical/strategic decision-making increasingly uses real-time operational data combined with the aggregated and historical data found in more
Conduct an illustrated SWOT analysis of data consolidation versus data integration versus data propagation.
Which statement is not correct?a. The essence of data consolidation as a data integration pattern is to capture the data from multiple, heterogeneous source systems and integrate it into a single persistent store (e.g., a data warehouse or data mart).b. An important disadvantage of the
What is data virtualization and what can it be used for? How does it differ from data consolidation, data federation, and data propagation?
The federation pattern typically follows…a. A pull approach.b. A push approach.
What is meant by “Data as a Service”? How does this relate to cloud computing? What kind of data-related services can be hosted in the cloud? Illustrate with examples.
Enterprise information integration (EII) is an example of…a. Data consolidation.b. Data integration.c. Data propagation.d. Data replication.
Discuss two types of dependencies that should be appropriately managed to guarantee the successful overall process execution. What patterns can be used to manage these dependencies?
Enterprise application integration (EAI) and enterprise data replication (EDR) are examples of…a. Data consolidation.b. Data federation.c. Data propagation.d. Data virtualization.
Discuss and contrast the following three service types: workflow services, activity services, and data services. Illustrate with an example.
Which statement is not correct?a. Data virtualization isolates applications and users from the actual (combinations of) data integration patterns used.b. Data virtualization extensively uses data consolidation techniques such as ETL.c. Contrary to a federated database as offered by basic EII,
Discuss how different data services can be realized according to different data integration patterns.
Which statement is not correct?a. Process integration is to integrate and harmonize the various business processes in an organization as much as possible.b. The control flow perspective of a business process specifies the correct sequencing of tasks (e.g., a loan offer can only be made when the
How can full-text documents be indexed? Illustrate with an example.
Process execution languages such as WS-BPEL aim at managing…a. Only the control flow.b. Only the data flow.c. Both the control and data flow.
How do web search engines work? Illustrate in the case of Google.
The choreography pattern to manage sequence and data dependencies is a…a. Centralized approach.b. Decentralized approach.
Discuss the impact of data lineage on data quality. Illustrate with examples.
Which statement is correct?a. The prevalent approach for indexing full-text documents is an inverted index.b. SQL is well suited to query structured collections of records as well as unstructured data such as text.c. It makes no sense to look at HTML markup when calculating the weight of a term
What is data governance and why is it important?
Which statement is not correct?a. Master data management (MDM) compromises a series of processes, policies, standards, and tools to help organizations define and provide multiple points of reference for all data that are “mastered”.b. The focus of MDM is on unifying company-wide reference
Discuss and contrast the following data governance frameworks: Total Data Quality Management (TDQM); Capability Maturity Model Integration (CMMI); Data Management Body of Knowledge (DMBOK); Control Objectives for Information and Related Technology (COBIT); and Information Technology Infrastructure
What do the 5 Vs of Big Data stand for?a. Volume, variety, velocity, veracity, value.b. Volume, visualization, velocity, variety, value.c. Volume, variety, velocity, variability, value.d. Volume, versatile, velocity, visualization, value.
Discuss some application areas where the usage of streaming analytics (such as provided by Spark Streaming) might be valuable. Consider Twitter, but also other contexts.
Which of the following statements is not correct?a. Velocity in Big Data refers to data “in movement”.b. Volume in Big Data refers to data “at rest”.c. Veracity in Big Data refers to data “in change”.d. Variety in Big Data refers to data “in many forms”.
Think about some examples of Big Data in industry. Try to focus on Vs other than the volume aspect of Big Data. Why do you think these examples qualify as Big Data?
Which components does the base Hadoop stack include?a. NDFS, MapReduce, and YARN.b. HDFS, MapReduce, and YARN.c. HDFS, Map, and Reduce.d. HDFS, Spark, and YARN.
Both Hortonworks (Hortonworks Hadoop Sandbox) and Cloudera (Cloudera QuickStart VM) offer virtual instances (for Docker, VirtualBox, and VMWare) providing a full Hadoop stack you can easily run contained in a virtual machine on a beefy computer. Try Googling for these and running these environments
Which of the following statements is correct?a. DataNodes in HDFS store a registry of metadata.b. The HDFS NameNode sends regular heartbeat messages to its DataNodes.c. HDFS is composed of a NameNode, DataNodes, and an optional SecondaryNameNode.d. Both the SecondaryNameNode and primary
Some analysts have argued that Big Data is fundamentally about data “plumbing”, and not about insights or deriving interesting patterns. It is argued that value (the fifth V) can just as easily be found in “small”, normal, or “weird” datasets (i.e., datasets that would not have been
Which of the following statements is not correct?a. A mapper in Hadoop maps each element in a collection to one or more output elements.b. A reducer in Hadoop reduces a collection of elements to one or more output elements.c. Reducer workers in Hadoop will start once all mapper workers have
If Spark’s GraphX library provides a number of interesting algorithms for graph-based analysis, do you think that graph-based NoSQL databases are still necessary? Why? If you’re interested, try searching the web on how to run Neo4j together with Spark – which roles do both serve in such an
Which of the following statements is not correct?a. Apart from handling MapReduce programs, YARN can also be used to manage other types of applications.b. YARN’s JobHistoryServer keeps a log of all finished jobs.c. NodeManagers in YARN are responsible for setting up containers on the node
Which of the following commands are not a part of HBase?a. Place.b. Put.c. Get.d. Describe.
Which of the following statements is correct?a. HBase can be considered as a NoSQL database.b. HBase offers an SQL engine to query its data.c. MapReduce programs cannot be used with HBase. Data are accessed using simple put and get commands instead.d. HBase works well on large clusters as well
Pig is…a. A programming language that can be used to query HDFS data.b. A project offering a programming language to provide more userfriendliness compared to MapReduce programs.c. A database that runs on Hadoop.d. An SQL engine that runs on top of Hadoop.
Which of the following statements is not correct?a. Hive offers an SQL engine to query Hadoop data.b. Hive’s query language is not as feature-complete as the full SQL standard.c. Hive offers a JDBC interface.d. Hive queries run much faster than hand-written MapReduce programs.
Which of the following schema-handling methods does Hive apply?a. Schema on write.b. Schema on load.c. Schema on read.d. Schema on query.
Which of the following statements is not correct?a. RDDs allow for two forms of operations: transformations and actions.b. RDDs represent an abstract, immutable data structure.c. RDDs are structured and represent a collection of columnar objects.d. RDDs offer failure protection by tracking the
Which of the following is not one of the reasons why Spark programs are generally faster than MapReduce operations?a. Because Spark tries to keep its RDDs in memory as long as possible.b. Because Spark uses a directed acyclic graph instead of MapReduce.c. Because RDD transformations are
Which of the following statements is not correct?a. Spark SQL exposes DataFrame and Dataset APIs which underlyingly use RDDs together with a performant SQL query engine.b. Spark SQL can be used from within Java, Python, Scala, and R.c. Spark SQL can be used through ODBC and JDBC
Which of the following statements is correct?a. One of the disadvantages of Spark is that it does not support streaming data.b. One of the disadvantages of Spark is that its streaming and machine learning APIs are still mostly RDD-based.c. One of the disadvantages of Spark is that it has no way
OLAP (on-line analytical processing) can help in which of the following steps of the analytics process?a. Data collection.b. Data visualization.c. Data transformation.d. Data denormalization.
Discuss the key activities when pre-processing data for credit scoring. Remember, credit scoring aims at distinguishing good payers from bad payers using application characteristics such as age, income, and employment status. Why is data pre-processing considered important?
The GIGO principle mainly relates to which aspect of the analytics process?a. Data selection.b. Data transformation.c. Data cleaning.d. All of the above.
Consider the following dataset of predicted scores and actual target values (you can assume higher scores should be assigned to the goods).• Calculate the classification accuracy, sensitivity, and specificity for a classification cutoff of 205.• Draw the ROC curve. How would you estimate the
What are the key differences between logistic regression and decision trees? Give examples of when to prefer one above the other.
Which of the following statements is correct?a. Missing values should always be replaced or removed.b. Outliers should always be replaced or removed.c. Missing values and outliers can potentially provide useful information and should be analyzed before they are removed/replaced.d. Missing
Which of the following strategies can be used to deal with missing values?a. Keep.b. Delete.c. Replace/impute.d. All of the above.
Discuss how association and sequence rules can be used to build recommender systems such as the ones adopted by Amazon, eBay, and Netflix. How would you evaluate the performance of a recommender system?
Explain k-means clustering using a small (artificial) dataset. What is the impact of k? What pre-processing steps are needed?
Outlying observations which represent erroneous data are treated using…a. Missing value procedures.b. Truncation or capping.
Examine the following decision tree:According to the decision tree, an applicant with Income > $50,000 and High Debt = Yes is classified as:a. Good risk.b. Bad risk. Yes No Good Risk Income> $50,000 Job >3 Years No Bad Risk Yes Yes High Debt Bad Risk No Good Risk
Discuss an example of social network analytics. How is it different from classical predictive or descriptive analytics?
The Internet of Things (IoT) refers to the network of interconnected things such as electronic devices, sensors, software, and IT infrastructure that create and add value by exchanging data with various stakeholders such as manufacturers, service providers, customers, other devices, etc., hereby
Decision trees can be used in the following applications:a. Credit risk scoring.b. Credit risk scoring and churn prediction.c. Credit risk scoring, churn prediction, and customer profile segmentation.d. Credit risk scoring, churn prediction, customer profile segmentation, and market basket
Many companies nowadays are investing in analytics. Also, for universities, there are plenty of opportunities to use analytics for streamlining and/or optimizing processes. Examples of applications where analytics may have a role to play are:• Analyzing student fail rates;• Timetabling of
Consider a dataset with a multiclass target variable as follows: 25% bad payers, 25% poor payers, 25% medium payers, and 25% good payers. In this case, the entropy will be…a. Minimal.b. Maximal.
Which of the following measures cannot be used to make the splitting decision in a regression tree?a. Mean squared error (MSE).b. ANOVA/F-test.c. Entropy.
Bootstrapping refers to…a. Drawing samples with replacement.b. Drawing samples without replacement.
Clustering, association rules, and sequence rules are examples of…a. Predictive analytics.b. Descriptive analytics.
Given the following five transactions:T1 {K, A, D, B}T2 {D, A, C, E, B}T3 {C, A, B, D}T4 {B, A, E}T5 {B, E, D},consider the association rule R: A ➔ BD.Which statement is correct?a. The support of R is 100% and the confidence is 75%.b. The support of R is 60% and the confidence is 100%.c. The
The aim of clustering is to come up with clusters such that the…a. Homogeneity within a cluster is minimized and the heterogeneity between clusters is maximized.b. Homogeneity within a cluster is maximized and the heterogeneity between clusters is minimized.c. Homogeneity within a cluster is
Which statement about the adjacency matrix representing a social network is not correct?a. It is a symmetric matrix.b. It is sparse since it contains a lot of non-zero elements.c. It can include weights.d. It has the same number of rows and columns.
Which statement is correct?a. The geodesic represents the longest path between two nodes.b. The betweenness counts the number of the times that a node or edge occurs in the geodesics of the network.c. The graph theoretic center is the node with the highest minimum distance to all other
Featurization refers to…a. Selecting the most predictive features.b. Adding more local features to the dataset.c. Making features (= inputs) out of the network characteristics.d. Adding more nodes to the network.
Which of the following activities are part of the post-processing step?a. Model interpretation and validation.b. Sensitivity analysis.c. Model representation.d. All of the above.
Is the following statement true or false? “All given success factors of an analytical model, i.e., relevance, performance, interpretability, efficiency, economical cost, and regulatory compliance, are always equally important.”a. True.b. False.
Which role does a database designer have according to the RACI matrix?a. Responsible.b. Accountable.c. Support.d. Consulted.e. Informed.
Which of the following costs should be included in a total cost of ownership (TCO) analysis?a. Acquisition costs.b. Ownership and operation costs.c. Post-ownership costs.d. All of the above.
Which of the following statements is not correct?a. ROI analysis offers a common firm-wide language to compare multiple investment opportunities and decide which one(s) to go for.b. For companies like Facebook, Amazon, Netflix, and Google, a positive ROI is obvious since they essentially thrive
Which of the following is not a risk when outsourcing analytics?a. The fact that all analytical activities need to be outsourced.b. The exchange of confidential information.c. Continuity of the partnership.d. Dilution of competitive advantage due to, e.g., mergers and acquisitions.
Which of the following is not an advantage of open-source software for analytics?a. It is available for free.b. A worldwide network of developers can work on it.c. It has been thoroughly engineered and extensively tested, validated, and completely documented.d. It can be used in combination
Which of the following statements is correct?a. When using on-premises solutions, maintenance or upgrade projects may even go by unnoticed.b. An important advantage of cloud-based solutions concerns the scalability and economies of scale offered. More capacity (e.g., servers) can be added on the
Which of the following are interesting data sources to consider to boost the performance of analytical models?a. Network data.b. External data.c. Unstructured data such as text data and multimedia data.d. All of the above.
Which of the following statements is correct?a. Quality of data is key to the success of any analytical exercise since it has a direct and measurable impact on the quality of the analytical model and hence its economic value.b. Data pre-processing activities such as handling missing values,
To guarantee maximum independence and organizational impact of analytics, it is important that…a. The chief data officer (CDO) or chief analytics officer (CAO) reports to the CIO or CFO.b. The CIO takes care of all analytical responsibilities.c. A chief data officer or chief analytics officer
What is the correct ranking of the following analytics applications in terms of maturity?a. Marketing analytics (most mature), risk analytics (medium mature), HR analytics (least mature).b. Risk analytics (most mature), marketing analytics (medium mature), HR analytics (least mature).c. Risk
Showing 300 - 400
of 398
1
2
3
4
Step by Step Answers