Get Latest Exam Updates, Free Study materials and Tips

[MCQ]-Data warehouse and Data mining

Module 1

1. Data scrubbing is which of the following?
A. A process to reject data from the data warehouse and to create the necessary indexes
B. A process to load the data in the data warehouse and to create the necessary indexes
C. A process to upgrade the quality of data after it is moved into a data warehouse
D. A process to upgrade the quality of data before it is moved into a data warehouse

Answer: Option D

2. The @active data warehouse architecture includes which of the following?
A. At least one data mart
B. Data that can be extracted from numerous internal and external sources
C. Near real-time updates
D. All of the above

Answer: Option D

3. A goal of data mining includes which of the following?
A.To explain some observed event or condition
B.To confirm that data exists
C.To analyze data for expected relationships
D.To create a new data warehouse

Answer: Option A

4. An operational system is which of the following?
A.A system that is used to run the business in real-time and is based on historical data.
B.A system that is used to run the business in real-time and is based on current data.
C.A system that is used to support decision-making and is based on current data.
D.A system that is used to support decision-making and is based on historical data.

Answer: Option B

5. A data warehouse is which of the following?
A.Can be updated by end-users.
B.Contains numerous naming conventions and formats.
C.Organized around important subject areas.
D.Contains only current data.

Answer: Option C

6. A snowflake schema is which of the following types of tables?
A.Fact
B.Dimension
C.Helper
D.All of the above

Answer: Option D

7. The generic two-level data warehouse architecture includes which of the following?
A.At least one data mart
B.Data that can be extracted from numerous internal and external sources
C.Near real-time updates
D.All of the above

Answer: Option B

8. Fact tables are which of the following?
A.Completely denoralized
B.Partially denoralized
C.Completely normalized
D.Partially normalized

Answer: Option C

9. Data transformation includes which of the following?
A.A process to change data from a detailed level to a summary level
B.A process to change data from a summary level to a detailed level
C.Joining data from one source into various sources of data
D.Separating data from one source into various sources of data

Answer: Option A

10. Reconciled data is which of the following?
A.Data stored in the various operational systems throughout the organization.
B.Current data intended to be the single source for all decision support systems.
C.Data stored in one operational system in the organization.
D.Data that has been selected and formatted for end-user support applications.

Answer: Option B

11. The load and index is which of the following?
A.A process to reject data from the data warehouse and to create the necessary indexes
B.A process to load the data in the data warehouse and to create the necessary indexes
C.A process to upgrade the quality of data after it is moved into a data warehouse
D.A process to upgrade the quality of data before it is moved into a data warehouse

Answer: Option B

12. The extract process is which of the following?
A.Capturing all of the data contained in various operational systems
B.Capturing a subset of the data contained in various operational systems
C.Capturing all of the data contained in various decision support systems
D.Capturing a subset of the data contained in various decision support systems

Answer: Option B

13. A star schema has what type of relationship between a dimension and fact table?
A.Many-to-many
B.One-to-one
C.One-to-many
D.All of the above

Answer: Option C

14. Transient data is which of the following?
A.Data in which changes to existing records cause the previous version of the records to be eliminated
B.Data in which changes to existing records do not cause the previous version of the records to be eliminated
C.Data that are never altered or deleted once they have been added
D.Data that are never deleted once they have been added

Answer: Option A

15. A multifield transformation does which of the following?
A.Converts data from one field into multiple fields
B.Converts data from multiple fields into one field
C.Converts data from multiple fields into multiple fields
D.All of the above

Answer: Option D

16. A data mart is designed to optimize the performance for well-defined and predicable uses.
A. True
B. False

Answer: Option A

17. Successful data warehousing requires that a formal program in total quality management (TQM) be implemented.
A. True
B. False

Answer: Option A

18. Data in operational systems are typically fragmented and inconsistent.
A. True
B. False

Answer: Option A

19. Most operational systems are based on the use of transient data.
A. True
B. False

Answer: Option A

20. Independent data marts are often created because an organization focuses on a series of short-term business objectives.
A. True
B. False

Answer: Option A

21. Joining is the process of partitioning data according to predefined criteria.
A. True
B. False

Answer: Option B

22. The role of the ETL process is to identify erroneous data and to fix them.
A. True
B. False

Answer: Option B

23. Data in the data warehouse are loaded and refreshed from operational systems.
A. True
B. False

Answer: Option A

24. Star schema is suited to online transaction processing and therefore is generally used in operational systems, operational data stores, or an EDW.
A. True
B. False

Answer: Option B

25. Periodic data are data that are physically altered once added to the store.
A. True
B. False

Answer: Option B

26. Both status data and event data can be stored in a database.
A. True
B. False

Answer: Option A

27. Static extract is used for ongoing warehouse maintenance.
A. True
B. False

Answer: Option b

28. Data scrubbing can help upgrade data quality;it is not a long-term solution to the data quality problem.
A. True
B. False

Answer: Option A

29. Every key used to join the fact table with a dimensional table should be a surrogate key.
A. True
B. False

Answer: Option A

30. Derived data are detailed, current data intended to be single, authoritative source for all decision suport applications.
A. True
B. False

Answer: Option B

Module 2

1. All data in flat file is in this format.
A. Sort
B. ETL
C. Format
D. String

Ans: D

2. It is used to push data into a relation database table. This control will be the destination for most fact table data flows.
A. Web Scraping
B. Data inspection
C. OLE DB Source
D. OLE DB Destination

Ans: D

3. Logical Data Maps
A. These are used to identify which fields from which sources are going to with destinations. It allows the ETL developer to identify if there is a need to do a data type change or aggregation prior to beginning coding of an ETL process.
B. These can be used to flag an entire file-set that is ready for processing by the ETL process. It contains no meaningful data bu the fact it exists is the key to the process.
C. Data is pulled from multiple sources to be merged into one or more destinations.
D. It is used to massage data in transit between the source and destination.

Ans: A

4. Data access methods.
A. Pull Method
B. Push and Pull
C. Load in Parallel
D. Union all

Ans: B

5. OLTP
A. Process to move data from a source to destination.
B. Transactional database that is typically attached to an application. This source provides the benefit of known data types and standardized access methods. This system enforces data integrity.
C. All data in flat file is in this format.
D. This control can be used to add columns to the stream or make modifications to data within the stream. Should be used for simple modifications.

Ans: B

6. COBOL
A. Process to move data from a source to destination.
B. The easiest to consume from the ETL standpoint.
C. Two methods to ensure data integrity.
D. Many routines of the Mainframe system are written in this.

Ans: D

7. What ETL Stands for
A. Data inspection
B. Transformation
C. Extract, Transform, Load
D. Data Flow

Ans: C

8. The source system initiates the data transfer for the ETL process. This method is uncommon in practice, as each system would have to move the data to the ETL process individually.
A. Custom
B. Automation
C. Pull Method
D. Push Method

Ans: D

9. Sentinel Files
A. These are used to identify which fields from which sources are going to with destinations. It allows the ETL developer to identify if there is a need to do a data type change or aggregation prior to beginning coding of an ETL process.
B. These can be used to flag an entire file-set that is ready for processing by the ETL process. It contains no meaningful data bu the fact it exists is the key to the process.
C. ETL can be used to automate the movement of data between two locations. This standardizes the process so that the load is done the same way every run.
D. This is used to create multiple streams within a data flow from a single stream. All records in the stream are sent down all paths. Typically uses a merge-join to recombine the streams later in the data flow.

Ans: B

10. Checkpoints
A. Similar to “break up processes”, checkpoints provide markers for what data has been processed in case an error occurs during the ETL process.
B. Similar to XML’s structured text file.
C. Many routines of the Mainframe system are written in this.
D. It is used to import text files for ETL processing.

Ans: A

11. Mainframe systems use this. This requires a conversion to the more common ASCII format.
A. ETL
B. XML
C. Sort
D. EBCDIC

Ans: D

12. Ultimate flexibility, unit testing is available, usually poor documentation.
A. ETL
B. Custom
C. OLTP
D. Sort

Ans: B

13. Conditional Split
A. Many routines of the Mainframe system are written in this.
B. Data is pulled from multiple sources to be merged into one or more destinations.
C. It allows multiple streams to be created from a single stream. Only rows that match the criteria for a given path are sent down that path.
D. This is used to create multiple streams within a data flow from a single stream. All records in the stream are sent down all paths. Typically uses a merge-join to recombine the streams later in the data flow.

Ans: C

14. Flat files
A. The easiest to consume from the ETL standpoint.
B. Three components of data flow.
C. Three common usages of ETL.
D. Two methods to ensure data integrity.

Ans: A

15. This is used to create multiple streams within a data flow from a single stream. All records in the stream are sent down all paths. Typically uses a merge-join to recombine the streams later in the data flow.
A. OLTP
B. Mainframe
C. EBCDIC
D. Multicast

Ans: D

16. There are little to no benefits to the ETL developer when accessing these types of systems and many detriments. The ability to access these systems is very limited and typically FTP of text files is used to facilitate access.
A. Mainframe
B. Union all
C. File Name
D. Multicast

Ans: A

17. Shows the path to the file to be imported.
A. File Name
B. Mainframe
C. Format
D. Union all

Ans: A

18. Wheel is already invented, documented, good support.
A. Format
B. COBOL
C. Tool Suite
D. Flat files

Ans: C

19. Similar to XML’s structured text file.
A. Data Scrubbing
B. EBCDIC
C. String
D. Web Scraping

Ans: D

20. Flat file control
A. Three components of data flow.
B. It is used to import text files for ETL processing.
C. The easiest to consume from the ETL standpoint.
D. Shows the path to the file to be imported.

Ans: B

21. Two methods to ensure data integrity.
A. Sources, Transformation, Destination
B. Data inspection
C. Row Count Inspection, Data Inspection
D. Row Count Inspection

Ans: C

22. Transformation
A. Data is pulled from multiple sources to be merged into one or more destinations.
B. It is used to import text files for ETL processing.
C. Process to move data from a source to destination.
D. It is used to massage data in transit between the source and destination.

Ans: D

23. Three common usages of ETL.
A. Data Scrubbing
B. Sources, Transformation, Destination
C. Merging Data
D. Merging Data, Data Scrubbing, Automation

Ans: D

24. Load in Parallel
A. A value of delimited shou;d be selected for delimited files.
B. Data is pulled from multiple sources to be merged into one or more destinations.
C. This will reduce the run time of ETL process and reduce the window for hardware failure to affect the process.
D. this should be check if column name have been included in the first row of the file.

Ans: C

25. This can be computationally expensive excluding SSD.
A. Hard Drive I/O
B. Mainframe
C. Tool Suite
D. Data Scrubbing

Ans: A

26. A value of delimited shou;d be selected for delimited files.
A. Sort
B. Format
C. String
D. OLTP

Ans: B

27. this should be check if column name have been included in the first row of the file.
A. Row Count Inspection, Data Inspection
B. Format of the Date
C. Column names in the first data row checkbox
D. Do most work in transformation phase

Ans: C

28. OLAP stands for
a) Online analytical processing
b) Online analysis processing
c) Online transaction processing
d) Online aggregate processing

Answer: a

29. Data that can be modeled as dimension attributes and measure attributes are called _______ data.
a) Multidimensional
b) Single Dimensional
c) Measured
d) Dimensional

Answer: a

30. The generalization of cross-tab which is represented visually is ____________ which is also called as data cube.
a) Two dimensional cube
b) Multidimensional cube
c) N-dimensional cube
d) Cuboid

Answer: a

31. The process of viewing the cross-tab (Single dimensional) with a fixed value of one attribute is
a) Slicing
b) Dicing
c) Pivoting
d) Both Slicing and Dicing

Answer: a

32. The operation of moving from finer-granularity data to a coarser granularity (by means of aggregation) is called a ________
a) Rollup
b) Drill down
c) Dicing
d) Pivoting

Answer: a

33. In SQL the cross-tabs are created using
a) Slice
b) Dice
c) Pivot
d) All of the mentioned

Answer: a

34.{ (item name, color, clothes size), (item name, color), (item name, clothes size), (color, clothes size), (item name), (color), (clothes size), () }
This can be achieved by using which of the following ?
a) group by rollup
b) group by cubic
c) group by
d) none of the mentioned

Answer: d

35. What do data warehouses support?
a) OLAP
b) OLTP
c) OLAP and OLTP
d) Operational databases

Answer: a

36.SELECT item name, color, clothes SIZE, SUM(quantity)
FROM sales
GROUP BY rollup(item name, color, clothes SIZE);
How many grouping is possible in this rollup?
a) 8
b) 4
c) 2
d) 1

Answer: b

37. Which one of the following is the right syntax for DECODE?
a) DECODE (search, expression, result [, search, result]… [, default])
b) DECODE (expression, result [, search, result]… [, default], search)
c) DECODE (search, result [, search, result]… [, default], expression)
d) DECODE (expression, search, result [, search, result]… [, default])

Answer: d

Module 3

1. Data mining refers to ______
a) Special fields for database
b) Knowledge discovery from large database
c) Knowledge base for the database
d) Collections of attributes

Answer: B

2. An attribute is a ____
a) Normalization of Fields
b) Property of the class
c) Characteristics of the object
d) Summarise value

Answer: C

3. Which are not related to Ratio Attributes?
a) Age Group 10-20, 30-50, 35-45 (in Years)
b) Mass 20-30 kg, 10-15 kg
c) Areas 10-50, 50-100 (in Kilometres)
d) Temperature 10°-20°, 30°-50°, 35°-45°

Answer: D

4. The mean is the ________ of a dataset.
a) Average
b) Middle
c) Central
d) Ordered

Answer: A

5. The number that occurs most often within a set of data called as ______
a) Mean
b) Median
c) Mode
d) Range

Answer: C

6. Find the range for given data 40, 30, 43, 48, 26, 50, 55, 40, 34, 42, 47, 50
a) 19
b) 29
c) 35
d) 49

Answer: B

7. Which are not the part of the KDD process from the following
a) Selection
b) Pre-processing
c) Reduction
d) Summation

Answer: D

8. _______ is the output of KDD Process.
a) Query
b) Useful Information
c) Information
d) Data

Answer: B

9. Data mining turns a large collection of data into _____
a) Database
b) Knowledge
c) Queries
d) Transactions

Answer: B

10. In KDD Process, where data relevant to the analysis task are retrieved from the database means _____
a) Data Selection
b) Data Collection
c) Data Warehouse
d) Data Mining

Answer: A

11. In KDD Process, data are transformed and consolidated into appropriate forms for mining by performing summary or aggregation operations is called as _____
a) Data Selection
b) Data Transformation
c) Data Reduction
d) Data Cleaning

Answer: B

12. What kinds of data can be mined?
a) Database data
b) Data Warehouse data
c) Transactional data
d) All of the above

Answer:D

13. Data selection is _____
a) The actual discovery phase of a knowledge discovery process
b) The stage of selecting the right data for a KDD process
c) A subject-oriented integrated time-variant non-volatile collection of data in support of management
d) Record oriented classes finding

Answer: B

14. To remove noise and inconsistent data ____ is needed.
a) Data Cleaning
b) Data Transformation
c) Data Reduction
d) Data Integration

Answer: A

15. Multiple data sources may be combined is called as _____
a) Data Reduction
b) Data Cleaning
c) Data Integration
d) Data Transformation

Answer: C

16. A _____ is a collection of tables, each of which is assigned a unique name which uses the entity-relationship (ER) data model.
a) Relational database
b) Transactional database
c) Data Warehouse
d) Spatial database

Answer: A

17. Relational data can be accessed by _____ written in a relational query language.
a) Select
b) Queries
c) Operations
d) Like

Answer: B

18. _____ studies the collection, analysis, interpretation or explanation, and presentation of data.
a) Statistics
b) Visualization
c) Data Mining
d) Clustering

Answer: A

19. ______ investigates how computers can learn (or improve their performance) based on data.
a) Machine Learning
b) Artificial Intelligence
c) Statistics
d) Visualization

Answer: A

20. _____ is the science of searching for documents or information in documents.
a) Data Mining
b) Information Retrieval
c) Text Mining
d) Web Mining

Answer: B

21. Data often contain _____
a) Target Class
b) Uncertainty
c) Methods
d) Keywords

Answer: B

22. The data mining process should be highly ______
a) On Going
b) Active
c) Interactive
d) Flexible

Answer: C

23. In real world multidimensional view of data mining, The major dimensions are data, knowledge, technologies, and _____
a) Methods
b) Applications
c) Tools
d) Files

Answer: B

24. An _____ is a data field, representing a characteristic or feature of a data object.
a) Method
b) Variable
c) Task
d) Attribute

Answer: D

25. The values of a _____ attribute are symbols or names of things.
a) Ordinal
b) Nominal
c) Ratio
d) Interval

Answer:B

26. “Data about data” is referred to as _____
a) Information
b) Database
c) Metadata
d) File

Answer: C

27. ______ partitions the objects into different groups.
a) Mapping
b) Clustering
c) Classification
d) Prediction

Answer:B

28. In _____, the attribute data are scaled so as to fall within a smaller range, such as -1.0 to 1.0, or 0.0 to 1.0.
a) Aggregation
b) Binning
c) Clustering
d) Normalization

D

29. Normalization by ______ normalizes by moving the decimal point of values of attributes.
a) Z-Score
b) Z-Index
c) Decimal Scaling
d) Min-Max Normalization

Answer: C

30._______ is a top-down splitting technique based on a specified number of bins.
a) Normalization
b) Binning
c) Clustering
d) Classification

Answer: B

Module 4

1. How many terms are required for building a bayes model?
a) 1
b) 2
c) 3
d) 4

Answer: c

2. What is needed to make probabilistic systems feasible in the world?
a) Reliability
b) Crucial robustness
c) Feasibility
d) None of the mentioned

Answer: b

3. Where does the bayes rule can be used?
a) Solving queries
b) Increasing complexity
c) Decreasing complexity
d) Answering probabilistic query

Answer: d

4. What does the bayesian network provides?
a) Complete description of the domain
b) Partial description of the domain
c) Complete description of the problem
d) None of the mentioned

Answer: a

5. How the entries in the full joint probability distribution can be calculated?
a) Using variables
b) Using information
c) Both Using variables & information
d) None of the mentioned

Answer: b

6. How the bayesian network can be used to answer any query?
a) Full distribution
b) Joint distribution
c) Partial distribution
d) All of the mentioned

Answer: b

7. How the compactness of the bayesian network can be described?
a) Locally structured
b) Fully structured
c) Partial structure
d) All of the mentioned

Answer: a

8. To which does the local structure is associated?
a) Hybrid
b) Dependant
c) Linear
d) None of the mentioned

Answer: c

9. Which condition is used to influence a variable directly by all the others?
a) Partially connected
b) Fully connected
c) Local connected
d) None of the mentioned

Answer: b

10. What is the consequence between a node and its predecessors while creating bayesian network?
a) Functionally dependent
b) Dependant
c) Conditionally independent
d) Both Conditionally dependant & Dependant

Answer: c

11. A _________ is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility.
a) Decision tree
b) Graphs
c) Trees
d) Neural Networks

Answer: a

12. Decision Tree is a display of an algorithm.
a) True
b) False

Answer: a

13. What is Decision Tree?
a) Flow-Chart
b) Structure in which internal node represents test on an attribute, each branch represents outcome of test and each leaf node represents class label
c) Flow-Chart & Structure in which internal node represents test on an attribute, each branch represents outcome of test and each leaf node represents class label
d) None of the mentioned

Answer: c

14. Decision Trees can be used for Classification Tasks.
a) True
b) False

Answer: a

15. Choose from the following that are Decision Tree nodes?
a) Decision Nodes
b) End Nodes
c) Chance Nodes
d) All of the mentioned

Answer: d

16. Decision Nodes are represented by ____________
a) Disks
b) Squares
c) Circles
d) Triangles

Answer: b

17. Chance Nodes are represented by __________
a) Disks
b) Squares
c) Circles
d) Triangles

Answer: c

18. End Nodes are represented by __________
a) Disks
b) Squares
c) Circles
d) Triangles

Answer: d

19. Which of the following are the advantage/s of Decision Trees?
a) Possible Scenarios can be added
b) Use a white box model, If given result is provided by a model
c) Worst, best and expected values can be determined for different scenarios
d) All of the mentioned

Answer: d

20. Which of the following is the valid component of the predictor?
a) data
b) question
c) algorithm
d) all of the mentioned

Answer: d

21. Point out the wrong statement.
a) In Sample Error is also called generalization error
b) Out of Sample Error is the error rate you get on the new dataset
c) In Sample Error is also called resubstitution error
d) All of the mentioned

Answer: a

22. Which of the following is correct order of working?
a) questions->input data ->algorithms
b) questions->evaluation ->algorithms
c) evaluation->input data ->algorithms
d) all of the mentioned

Answer: a

23. Which of the following shows correct relative order of importance?
a) question->features->data->algorithms
b) question->data->features->algorithms
c) algorithms->data->features->question
d) none of the mentioned

Answer: b

24. Point out the correct statement.
a) In Sample Error is the error rate you get on the same dataset used to model a predictor
b) Data have two parts-signal and noise
c) The goal of predictor is to find signal
d) None of the mentioned

Answer: d

25. Which of the following is characteristic of best machine learning method?
a) Fast
b) Accuracy
c) Scalable
d) All of the mentioned

Answer: d

26. True positive means correctly rejected.
a) True
b) False

Answer: b

27. Which of the following trade-off occurs during prediction?
a) Speed vs Accuracy
b) Simplicity vs Accuracy
c) Scalability vs Accuracy
d) None of the mentioned

Answer: d

28. Which of the following expression is true?
a) In sample error < out sample error
b) In sample error > out sample error
c) In sample error = out sample error
d) All of the mentioned

Answer: a

29. Backtesting is a key component of effective trading-system development.
a) True
b) False

Answer: a

30. Which of the following is correct use of cross validation?
a) Selecting variables to include in a model
b) Comparing predictors
c) Selecting parameters in prediction function
d) All of the mentioned

Answer: d

31. Point out the wrong combination.
a) True negative=correctly rejected
b) False negative=correctly rejected
c) False positive=correctly identified
d) All of the mentioned

Answer: c

32. Which of the following is a common error measure?
a) Sensitivity
b) Median absolute deviation
c) Specificity
d) All of the mentioned

Answer: d

33. Which of the following is not a machine learning algorithm?
a) SVG
b) SVM
c) Random forest
d) None of the mentioned

Answer: a

34. Point out the wrong statement.
a) ROC curve stands for receiver operating characteristic
b) Foretime series, data must be in chunks
c) Random sampling must be done with replacement
d) None of the mentioned

Answer: d

35. Which of the following is a categorical outcome?
a) RMSE
b) RSquared
c) Accuracy
d) All of the mentioned

Answer: c

36. For k cross-validation, larger k value implies more bias.
a) True
b) False

Answer: b

37. Which of the following method is used for trainControl resampling?
a) repeatedcv
b) svm
c) bag32
d) none of the mentioned

Answer: a

38. Which of the following can be used to create the most common graph types?
a) qplot
b) quickplot
c) plot
d) all of the mentioned

Answer: a

39. For k cross-validation, smaller k value implies less variance.
a) True
b) False

Answer: a

40. Predicting with trees evaluate _____________ within each group of data.
a) equality
b) homogeneity
c) heterogeneity
d) all of the mentioned

Answer: b

41. Point out the wrong statement.
a) Training and testing data must be processed in different way
b) Test transformation would mostly be imperfect
c) The first goal is statistical and second is data compression in PCA
d) All of the mentioned

Answer: a

42. Which of the following method options is provided by train function for bagging?
a) bagEarth
b) treebag
c) bagFDA
d) all of the mentioned

Answer: d

43. Which of the following is correct with respect to random forest?
a) Random forest are difficult to interpret but often very accurate
b) Random forest are easy to interpret but often very accurate
c) Random forest are difficult to interpret but very less accurate
d) None of the mentioned

Answer: a

44. Point out the correct statement.
a) Prediction with regression is easy to implement
b) Prediction with regression is easy to interpret
c) Prediction with regression performs well when linear model is correct
d) All of the mentioned

Answer: d

45. Which of the following library is used for boosting generalized additive models?
a) gamBoost
b) gbm
c) ada
d) all of the mentioned

Answer: a

46. The principal components are equal to left singular values if you first scale the variables.
a) True
b) False

Answer: b

47. Which of the following is statistical boosting based on additive logistic regression?
a) gamBoost
b) gbm
c) ada
d) mboost

Answer: a

48. Which of the following is one of the largest boost subclass in boosting?
a) variance boosting
b) gradient boosting
c) mean boosting
d) all of the mentioned

Answer: b

49. PCA is most useful for non linear type models.
a) True
b) False

Answer: b

50. Which of the following clustering type has characteristic shown in the below figure?
data-science-questions-answers-clustering-q1
a) Partitional
b) Hierarchical
c) Naive bayes
d) None of the mentioned

Answer: b

51. Point out the correct statement.
a) The choice of an appropriate metric will influence the shape of the clusters
b) Hierarchical clustering is also called HCA
c) In general, the merges and splits are determined in a greedy manner
d) All of the mentioned

Answer: d

52. Which of the following is finally produced by Hierarchical Clustering?
a) final estimate of cluster centroids
b) tree showing how close things are to each other
c) assignment of each point to clusters
d) all of the mentioned

Answer: b

53. Which of the following is required by K-means clustering?
a) defined distance metric
b) number of clusters
c) initial guess as to cluster centroids
d) all of the mentioned

Answer: d

54. Point out the wrong statement.
a) k-means clustering is a method of vector quantization
b) k-means clustering aims to partition n observations into k clusters
c) k-nearest neighbor is same as k-means
d) none of the mentioned

Answer: c

55. Which of the following combination is incorrect?
a) Continuous – euclidean distance
b) Continuous – correlation similarity
c) Binary – manhattan distance
d) None of the mentioned

Answer: d

56. Hierarchical clustering should be primarily used for exploration.
a) True
b) False

Answer: a

57. Which of the following function is used for k-means clustering?
a) k-means
b) k-mean
c) heatmap
d) none of the mentioned

Answer: a

58. Which of the following clustering requires merging approach?
a) Partitional
b) Hierarchical
c) Naive Bayes
d) None of the mentioned

Answer: b

59. K-means is not deterministic and it also consists of number of iterations.
a) True
b) False

Answer: a

60. Hierarchical clustering should be mainly used for exploration.
a) True
b) False

Answer: a

61. K-means clustering consists of a number of iterations and not deterministic.
a) True
b) False

Answer: a

62. Which is needed by K-means clustering?
a) defined distance metric
b) number of clusters
c) initial guess as to cluster centroids
d) all of these

Answer: d

63. Which function is used for k-means clustering?
a) k-means
b) k-mean
c) heatmap
d) none of the mentioned

Answer: a

64. Which is conclusively produced by Hierarchical Clustering?
a) final estimation of cluster centroids
b) tree showing how nearby things are to each other
c) assignment of each point to clusters
d) all of these

Answer: b

65. Which clustering technique requires a merging approach?
a) Partitional
b) Hierarchical
c) Naive Bayes
d) None of the mentioned

Answer: b

Module 5

1. A collection of one or more items is called as _____
a) Itemset
b) Support
c) Confidence
d) Support Count

Answer: A

2. Frequency of occurrence of an itemset is called as _____
a) Support
b) Confidence
c) Support Count
d) Rules

Answer: C

3. An itemset whose support is greater than or equal to a minimum support threshold is ______
a) Itemset
b) Frequent Itemset
c) Infrequent items
d) Threshold values

Answer: B

4. What does FP growth algorithm do?
a) It mines all frequent patterns through pruning rules with lesser support
b) It mines all frequent patterns through pruning rules with higher support
c) It mines all frequent patterns by constructing a FP tree
d) It mines all frequent patterns by constructing an itemsets

Answer: C

5. What techniques can be used to improve the efficiency of apriori algorithm?
a) Hash-based techniques
b) Transaction Increases
c) Sampling
d) Cleaning

Answer: A

6. What do you mean by support(A)?
a) Total number of transactions containing A
b) Total Number of transactions not containing A
c) Number of transactions containing A / Total number of transactions
d) Number of transactions not containing A / Total number of transactions

Answer: C

7. How do you calculate Confidence (A -> B)?
a) Support(A #
# B) / Support (A)
b) Support(A #
# B) / Support (B)
c) Support(A #
# B) / Support (A)
d) Support(A #
# B) / Support (B)

Answer: A

8. Which of the following is the direct application of frequent itemset mining?
a) Social Network Analysis
b) Market Basket Analysis
c) Outlier Detection
d) Intrusion Detection

Answer: B

9. What is not true about FP growth algorithms?
a) It mines frequent itemsets without candidate generation
b) There are chances that FP trees may not fit in the memory
c) FP trees are very expensive to build
d) It expands the original database to build FP trees

Answer: D

10. When do you consider an association rule interesting?
a)If it only satisfies min_support
b) If it only satisfies min_confidence
c) If it satisfies both min_support and min_confidence
d) There are other measures to check so

Answer: C

11. What is the relation between a candidate and frequent itemsets?
a) A candidate itemset is always a frequent itemset
b) A frequent itemset must be a candidate itemset
c) No relation between these two
d) Strong relation with transactions

Answer:B

12. Which of the following is not a frequent pattern mining algorithm?
a) Apriori
b) FP growth
c) Decision trees
d) Eclat

Answer: C

13. Which algorithm requires fewer scans of data?
a)Apriori
b)FP Growth
c)Naive Bayes
d)Decision Trees

Answer: B

14. For the question given below consider the data Transactions :

I1, I2, I3, I4, I5, I6
I7, I2, I3, I4, I5, I6
I1, I8, I4, I5
I1, I9, I10, I4, I6
I10, I2, I4, I11, I5
With support as 0.6 find all frequent itemsets?

a) <I1>, <I2>, <I4>, <I5>, <I6>, <I1, I4>, <I2, I4>, <I2, I5>, <I4, I5>, <I4, I6>, <I2, I4, I5>

b) <I2>, <I4>, <I5>, <I2, I4>, <I2, I5>, <I4, I5>, <I2, I4, I5>

c) <I11>, <I4>, <I5>, <I6>, <I1, I4>, <I5, I4>, <I11, I5>, <I4, I6>, <I2, I4, I5>

d) <I1>, <I4>, <I5>, <I6>

Answer: A

15. What will happen if support is reduced?
a) Number of frequent itemsets remains the same
b) Some itemsets will add to the current set of frequent itemsets.
c) Some itemsets will become infrequent while others will become frequent
d) Can not say

Answer: B

16. What is association rule mining?
a) Same as frequent itemset mining
b) Finding of strong association rules using frequent itemsets
c) Using association to analyze correlation rules
d) Finding Itemsets for future trends

Answer: B

17. A definition or a concept is ______ if it classifies any examples as coming within the concept
a) Concurrent
b) Consistent
c) Constant
d) Compete

Answer: B