## [MCQ’s] Machine Learning

#### Module 01

1. What is true about Machine Learning?
A. Machine Learning (ML) is that field of computer science
B. ML is a type of artificial intelligence that extract patterns out of raw data by using an algorithm or method.
C. The main focus of ML is to allow computer systems learn from experience without being explicitly programmed or human intervention.
D. All of the above
Explanation: All statement are true about Machine Learning.

2. ML is a field of AI consisting of learning algorithms that?
A. Improve their performance
C. Over time with experience
D. All of the above
Explanation: ML is a field of AI consisting of learning algorithms that : Improve their performance (P), At executing some task (T), Over time with experience (E).

3.  p → 0q is not a?
A. hack clause
B. horn clause
C. structural clause
D. system clause
Explanation: p → 0q is not a horn clause.

4. The action _______ of a robot arm specify to Place block A on block B.
A. STACK(A,B)
B. LIST(A,B)
C. QUEUE(A,B)
D. ARRAY(A,B)
Explanation: The action ‘STACK(A,B)’ of a robot arm specify to Place block A on block B.

5. A__________ begins by hypothesizing a sentence (the symbol S) and successively predicting lower level constituents until individual preterminal symbols are written.
A. bottow-up parser
B. top parser
C. top-down parser
D. bottom parser
Explanation: A top-down parser begins by hypothesizing a sentence (the symbol S) and successively predicting lower level constituents until individual preterminal symbols are written.

6. A model of language consists of the categories which does not include ________.
A. System Unit
B. structural units.
C. data units
D. empirical units
Explanation: A model of language consists of the categories which does not include structural units.

7. Different learning methods does not include?
A. Introduction
B. Analogy
C. Deduction
D. Memorization
Explanation: Different learning methods does not include the introduction.

8. The model will be trained with data in one single batch is known as ?
A. Batch learning
B. Offline learning
C. Both A and B
D. None of the above
Ans : C
Explanation: we have end-to-end Machine Learning systems in which we need to train the model in one go by using whole available training data. Such kind of learning method or algorithm is called Batch or Offline learning.

9. Which of the following are ML methods?
A. based on human supervision
B. supervised Learning
C. semi-reinforcement Learning
D. All of the above
Ans : A
Explanation: The following are various ML methods based on some broad categories : Based on human supervision, Unsupervised Learning, Semi-supervised Learning and Reinforcement Learning

10. In Model based learning methods, an iterative process takes place on the ML models that are built based on various model parameters, called ?
A. mini-batches
B. optimizedparameters
C. hyperparameters
D. superparameters
Explanation: In Model based learning methods, an iterative process takes place on the ML models that are built based on various model parameters, called hyperparameters.

Learn Machine Learning with Python from Scratch

11. Which of the following is a widely used and effective machine learning algorithm based on the idea of bagging?
A. Decision Tree
B. Regression
C. Classification
D. Random Forest
Explanation: The Radom Forest algorithm builds an ensemble of Decision Trees, mostly trained with the bagging method.

12. To find the minimum or the maximum of a function, we set the gradient to zero because:
A. The value of the gradient at extrema of a function is always zero
B. Depends on the type of problem
C. Both A and B
D. None of the above
Explanation: The gradient of a multivariable function at a maximum point will be the zero vector of the function, which is the single greatest value that the function can achieve.

13. Which of the following is a disadvantage of decision trees?
A. Factor analysis
B. Decision trees are robust to outliers
C. Decision trees are prone to be overfit
D. None of the above
Explanation: Allowing a decision tree to split to a granular degree makes decision trees prone to learning every point extremely well to the point of perfect classification that is overfitting.

14. How do you handle missing or corrupted data in a dataset?
A. Drop missing rows or columns
B. Replace missing values with mean/median/mode
C. Assign a unique category to missing values
D. All of the above
Explanation: All of the above techniques are different ways of imputing the missing values.

15. When performing regression or classification, which of the following is the correct way to preprocess the data?
A. Normalize the data -> PCA -> training
B. PCA -> normalize PCA output -> training
C. Normalize the data -> PCA -> normalize PCA output -> training
D. None of the above
Explanation: You need to always normalize the data first. If not, PCA or other techniques that are used to reduce dimensions will give different results.

16. Which of the following statements about regularization is not correct?
A. Using too large a value of lambda can cause your hypothesis to underfit the data.
B. Using too large a value of lambda can cause your hypothesis to overfit the data
C. Using a very large value of lambda cannot hurt the performance of your hypothesis.
D. None of the above
Explanation: A large value results in a large regularization penalty and therefore, a strong preference for simpler models, which can underfit the data.

17. Which of the following techniques can not be used for normalization in text mining?
A. Stemming
B. Lemmatization
C. Stop Word Removal
D. None of the above
Explanation: Lemmatization and stemming are the techniques of keyword normalization.

18. In which of the following cases will K-means clustering fail to give good results?
1) Data points with outliers
2) Data points with different densities
3) Data points with nonconvex shapes

A. 1 and 2
B. 2 and 3
C. 1 and 3
D. All of the above

Explanation: K-means clustering algorithm fails to give good results when the data contains outliers, the density spread of data points across the data space is different, and the data points follow nonconvex shapes.

19. Which of the following is a reasonable way to select the number of principal components “k”?
A. Choose k to be the smallest value so that at least 99% of the varinace is retained.
B. Choose k to be 99% of m (k = 0.99*m, rounded to the nearest integer).
C. Choose k to be the largest value so that 99% of the variance is retained.
D. Use the elbow method.
Explanation: This will maintain the structure of the data and also reduce its dimension.

20. What is a sentence parser typically used for?
A. It is used to parse sentences to check if they are utf-8 compliant.
B. It is used to parse sentences to derive their most likely syntax tree structures.
C. It is used to parse sentences to assign POS tags to all tokens.
D. It is used to check if sentences can be parsed into meaningful tokens.
Explanation: Sentence parsers analyze a sentence and automatically build a syntax tree.

Learn Machine Learning with Python from Scratch

21. Which of the following is a widely used and effective machine learning algorithm based on the idea of bagging?
A. Decision Tree
B. Regression
C. Classification
D. Random Forest

22. To find the minimum or the maximum of a function, we set the gradient to zero because:
A. The value of the gradient at extrema of a function is always zero
B. Depends on the type of problem
C. Both A and B
D. None of the above

23.The most widely used metrics and tools to assess a classification model are:
A. Confusion matrix
B. Cost-sensitive accuracy
C. Area under the ROC curve
D. All of the above

24. Which of the following is a good test dataset characteristic?
A. Large enough to yield meaningful results
B. Is representative of the dataset as a whole
C. Both A and B
D. None of the above

25. Which of the following is a disadvantage of decision trees?
A. Factor analysis
B. Decision trees are robust to outliers
C. Decision trees are prone to be overfit
D. None of the above

26. How do you handle missing or corrupted data in a dataset?
A. Drop missing rows or columns
B. Replace missing values with mean/median/mode
C. Assign a unique category to missing values
D. All of the above

27. What is the purpose of performing cross-validation?
A. To assess the predictive performance of the models
B. To judge how the trained model performs outside the sample on test data
C. Both A and B

28. Why is second order differencing in time series needed?
A. To remove stationarity
B. To find the maxima or minima at the local point
C. Both A and B
D. None of the above

29. When performing regression or classification, which of the following is the correct way to preprocess the data?
A. Normalize the data → PCA → training
B. PCA → normalize PCA output → training
C. Normalize the data → PCA → normalize PCA output → training
D. None of the above

30. Which of the following is an example of feature extraction?
A. Constructing bag of words vector from an email
B. Applying PCA projects to a large high-dimensional data
C. Removing stopwords in a sentence
D. All of the above

Learn Machine Learning with Python from Scratch

31. What is pca.components_ in Sklearn?
A. Set of all eigen vectors for the projection space
B. Matrix of principal components
C. Result of the multiplication matrix
D. None of the above options

32. Which of the following is true about Naive Bayes ?
A. Assumes that all the features in a dataset are equally important
B. Assumes that all the features in a dataset are independent
C. Both A and B
D. None of the above options

33. Which of the following statements about regularization is not correct?
A. Using too large a value of lambda can cause your hypothesis to underfit the data.
B. Using too large a value of lambda can cause your hypothesis to overfit the data.
C. Using a very large value of lambda cannot hurt the performance of your hypothesis.
D. None of the above

34. How can you prevent a clustering algorithm from getting stuck in bad local optima?
A. Set the same seed value for each run
B. Use multiple random initializations
C. Both A and B
D. None of the above

35. Which of the following techniques can be used for normalization in text mining?
A. Stemming
B. Lemmatization
C. Stop Word Removal
D. Both A and B

36. In which of the following cases will K-means clustering fail to give good results? 1) Data points with outliers 2) Data points with different densities 3) Data points with nonconvex shapes
A. 1 and 2
B. 2 and 3
C. 1, 2, and 3
D. 1 and 3

37. Which of the following is a reasonable way to select the number of principal components “k”?
A. Choose k to be the smallest value so that at least 99% of the varinace is retained.
B. Choose k to be 99% of m (k = 0.99*m, rounded to the nearest integer).
C. Choose k to be the largest value so that 99% of the variance is retained.
D. Use the elbow method

38. You run gradient descent for 15 iterations with a=0.3 and compute J(theta) after each iteration. You find that the value of J(Theta) decreases quickly and then levels off. Based on this, which of the following conclusions seems most plausible?
A. Rather than using the current value of a, use a larger value of a (say a=1.0)
B. Rather than using the current value of a, use a smaller value of a (say a=0.1)
C. a=0.3 is an effective choice of learning rate
D. None of the above

39. What is a sentence parser typically used for?
A. It is used to parse sentences to check if they are utf-8 compliant.
B. It is used to parse sentences to derive their most likely syntax tree structures.
C. It is used to parse sentences to assign POS tags to all tokens.
D. It is used to check if sentences can be parsed into meaningful tokens.

40. Suppose you have trained a logistic regression classifier and it outputs a new example x with a prediction ho(x) = 0.2. This means
A. Our estimate for P(y=1 | x)
B. Our estimate for P(y=0 | x)
C. Our estimate for P(y=1 | x)
D. Our estimate for P(y=0 | x)

Learn Machine Learning with Python from Scratch

41. What is Machine learning?
a) The autonomous acquisition of knowledge through the use of computer programs
b) The autonomous acquisition of knowledge through the use of manual programs
c) The selective acquisition of knowledge through the use of computer programs
d) The selective acquisition of knowledge through the use of manual programs
Explanation: Machine learning is the autonomous acquisition of knowledge through the use of computer programs.

42. Which of the factors affect the performance of learner system does not include?
a) Representation scheme used
b) Training scenario
c) Type of feedback
d) Good data structures
Explanation: Factors that affect the performance of learner system does not include good data structures.

43. Different learning methods does not include?
a) Memorization
b) Analogy
c) Deduction
d) Introduction
Explanation: Different learning methods does not include the introduction.

44. In language understanding, the levels of knowledge that does not include?
a) Phonological
b) Syntactic
c) Empirical
d) Logical
Explanation: In language understanding, the levels of knowledge that does not include empirical knowledge.

45. A model of language consists of the categories which does not include?
a) Language units
b) Role structure of units
c) System constraints
d) Structural units
Explanation: A model of language consists of the categories which does not include structural units.

46. What is a top-down parser?
a) Begins by hypothesizing a sentence (the symbol S) and successively predicting lower level constituents until individual preterminal symbols are written
b) Begins by hypothesizing a sentence (the symbol S) and successively predicting upper level constituents until individual preterminal symbols are written
c) Begins by hypothesizing lower level constituents and successively predicting a sentence (the symbol S)
d) Begins by hypothesizing upper level constituents and successively predicting a sentence (the symbol S)
Explanation: A top-down parser begins by hypothesizing a sentence (the symbol S) and successively predicting lower level constituents until individual preterminal symbols are written.

47. Among the following which is not a horn clause?
a) p
b) Øp V q
c) p → q
d) p → Øq
Explanation: p → Øq is not a horn clause.

48. The action ‘STACK(A, B)’ of a robot arm specify to _______________
a) Place block B on Block A
b) Place blocks A, B on the table in that order
c) Place blocks B, A on the table in that order
d) Place block A on block B
Explanation: The action ‘STACK(A,B)’ of a robot arm specify to Place block A on block B.

#### Module 02

1. Why do we need biological neural networks?
a) to solve tasks like machine vision & natural language processing
b) to apply heuristic search methods to find solutions of problem
c) to make smart human interactive & user friendly system
d) all of the mentioned
Explanation: These are the basic aims that a neural network achieve.

2. What is the trend in software nowadays?
a) to bring computer more & more closer to user
b) to solve complex problems
d) to be versatile
Explanation: Software should be more interactive to the user, so that it can understand its problem in a better fashion.

3. What’s the main point of difference between human & machine intelligence?
a) human perceive everything as a pattern while machine perceive it merely as data
b) human have emotions
c) human have more IQ & intellect
d) human have sense organs
Explanation: Humans have emotions & thus form different patterns on that basis, while a machine(say computer) is dumb & everything is just a data for him.

4. What is auto-association task in neural networks?
a) find relation between 2 consecutive inputs
b) related to storage & recall task
c) predicting the future inputs
d) none of the mentioned
Explanation: This is the basic definition of auto-association in neural networks.

5. Does pattern classification belongs to category of non-supervised learning?
a) yes
b) no
Explanation: Pattern classification belongs to category of supervised learning.

6. In pattern mapping problem in neural nets, is there any kind of generalization involved between input & output?
a) yes
b) no
Explanation: The desired output is mapped closest to the ideal output & hence there is generalisation involved.

7. What is unsupervised learning?
a) features of group explicitly stated
b) number of groups may be known
c) neither feature & nor number of groups is known
d) none of the mentioned
Explanation: Basic definition of unsupervised learning.

8. Does pattern classification & grouping involve same kind of learning?
a) yes
b) no
Explanation: Pattern classification involves supervised learning while grouping is an unsupervised one.

9. Does for feature mapping there’s need of supervised learning?
a) yes
b) no
Explanation: Feature mapping can be unsupervised, so it’s not a sufficient condition.

10. Example of a unsupervised feature map?
a) text recognition
b) voice recognition
c) image recognition
d) none of the mentioned
Explanation: Since same vowel may occur in different context & its features vary over overlapping regions of different vowels.

Learn Machine Learning with Python from Scratch

11. Who was the inventor of the first neurocomputer?
A. Dr. John Hecht-Nielsen
B. Dr. Robert Hecht-Nielsen
C. Dr. Alex Hecht-Nielsen
D. Dr. Steve Hecht-Nielsen
Explanation: The inventor of the first neurocomputer, Dr. Robert Hecht-Nielsen.

12. How many types of Artificial Neural Networks?
A. 2
B. 3
C. 4
D. 5
Explanation: There are two Artificial Neural Network topologies : FeedForward and Feedback.

13. In which ANN, loops are allowed?
A. FeedForward ANN
B. FeedBack ANN
C. Both A and B
D. None of the Above
Explanation: FeedBack ANN loops are allowed. They are used in content addressable memories.

14. What is the full form of BN in Neural Networks?
A. Bayesian Networks
B. Belief Networks
C. Bayes Nets
D. All of the above
Explanation: The full form BN is Bayesian networks and Bayesian networks are also called Belief Networks or Bayes Nets.

15. What is the name of node which take binary values TRUE (T) and FALSE (F)?
A. Dual Node
B. Binary Node
C. Two-way Node
D. Ordered Node
Explanation: Boolean nodes : They represent propositions, taking binary values TRUE (T) and FALSE (F).

16. What is an auto-associative network?
A. a neural network that contains no loops
B. a neural network that contains feedback
C. a neural network that has only one loop
D. a single layer feed-forward neural network with pre-processing
Explanation: An auto-associative network is equivalent to a neural network that contains feedback. The number of feedback paths(loops) does not have to be one.

17. What is Neuro software?
A. A software used to analyze neurons
B. It is powerful and easy neural network
C. Designed to aid experts in real world
D. It is software used by Neurosurgeon
Explanation: Neuro software is powerful and easy neural network.

18. Neural Networks are complex ______________ with many parameters.
A. Linear Functions
B. Nonlinear Functions
C. Discrete Functions
D. Exponential Functions
Explanation: Neural networks are complex linear functions with many parameters.

19. Which of the following is not the promise of artificial neural network?
A. It can explain result
B. It can survive the failure of some nodes
C. It has inherent parallelism
D. It can handle noise
Explanation: The artificial Neural Network (ANN) cannot explain result.

20. The output at each node is called_____.
A. node value
B. Weight
C. neurons
D. axons
Explanation: The output at each node is called its activation or node value.

Learn Machine Learning with Python from Scratch

21. What is full form of ANNs?
A. Artificial Neural Node
B. AI Neural Networks
C. Artificial Neural Networks
D. Artificial Neural numbers
Explanation: Artificial Neural Networks is the full form of ANNs.

22. In Feed Forward ANN, information flow is _________.
A. unidirectional
B. bidirectional
C. multidirectional
D. All of the above
Explanation: Feed Forward ANN the information flow is unidirectional.

23. Which of the following is not an Machine Learning strategies in ANNs?
A. Unsupervised Learning
B. Reinforcement Learning
C. Supreme Learning
D. Supervised Learning
Explanation: Supreme Learning is not an Machine Learning strategies in ANNs.

24. Which of the following is an Applications of Neural Networks?
A. Automotive
B. Aerospace
C. Electronics
D. All of the above
Explanation: All above are appliction of Neural Networks.

25. What is perceptron?
A. a single layer feed-forward neural network with pre-processing
B. an auto-associative neural network
C. a double layer auto-associative neural network
D. a neural network that contains feedback
Explanation: The perceptron is a single layer feed-forward neural network.

26. A 4-input neuron has weights 1, 2, 3 and 4. The transfer function is linear with the constant of proportionality being equal to 2. The inputs are 4, 3, 2 and 1 respectively. What will be the output?
A. 30
B. 40
C. 50
D. 60
Explanation: The output is found by multiplying the weights with their respective inputs, summing the results and multiplying with the transfer function. Therefore: Output = 2 * (1*4 + 2*3 + 3*2 + 4*1) = 40.

27. What is back propagation?
A. It is another name given to the curvy function in the perceptron
B. It is the transmission of error back through the network to adjust the inputs
C. It is the transmission of error back through the network to allow weights to be adjusted so that the network can learn
D. None of the Above
Explanation: Back propagation is the transmission of error back through the network to allow weights to be adjusted so that the network can learn.

28. The network that involves backward links from output to the input and hidden layers is called _________
A. Self organizing map
B. Perceptrons
C. Recurrent neural network
D. Multi layered perceptron
Ans : C
Explanation: RNN (Recurrent neural network) topology involves backward links from output to the input and hidden layers.

29. The BN variables are composed of how many dimensions?
A. 2
B. 3
C. 4
D. 5
Explanation: The BN variables are composed of two dimensions : Range of prepositions and Probability assigned to each of the prepositions.

30. The first artificial neural network was invented in _____.
A. 1957
B. 1958
C. 1959
D. 1960
Ans : B
Explanation: The first artificial neural network was invented in 1958.

Learn Machine Learning with Python from Scratch

31. Back propagation is a learning technique that adjusts weights in the neural network by propagating weight changes.
a. Forward from source to sink
b. Backward from sink to source
c. Forward from source to hidden nodes
d. Backward from sink to hidden nodes
Explanation: Backward from sink to source

32. Identify the following activation function :
φ(V) = Z + (1/ 1 + exp (– x * V + Y) ),
Z, X, Y are parameters
a. Step function
b. Ramp function
c. Sigmoid function
Explanation: Sigmoid function

33. An artificial neuron receives n inputs x1, x2, x3…………xn with weights w1, w2, ……….wn attached to the input links. The weighted sum_________________ is computed to be passed on to a non-linear filter Φ called activation function to release the output.
a. Σ wi
b. Σ xi
c. Σ wi + Σ xi
d. Σ wi* xi
Explanation: Σ wi* xi

34. Match the following knowledge representation techniques with their applications:
List – I List – II

(a) Frames (i) Pictorial representation of objects, their attributes and relationships

(b) Conceptual dependencies (ii) To describe real world stereotype events

(c) Associative networks (iii) Record like structures for grouping closely related knowledge

(d) Scripts (iv) Structures and primitives to represent sentences
code:
a b c d

a. (iii) (iv) (i) (ii)
b. (iii) (iv) (ii) (i)
c. (iv) (iii) (i) (ii)
d. (iv) (iii) (ii) (i)
Explanation:(iii) (iv) (i) (ii)

35. In propositional logic P ⇔ Q is equivalent to (Where ~ denotes NOT):
a. ~ (P ˅ Q) ˄ ~ (Q ˅ P)
b. (~ P ˅ Q) ˄ (~ Q ˅ P)
c. (P ˅ Q) ˄ (Q ˅ P)
d. ~ (P ˅ Q) → ~ (Q ˅ P)
Explanation: (~ P ˅ Q) ˄ (~ Q ˅ P)

36. Slots and facets are used in
a. Semantic Networks
b. Frames
c. Rules
d. All of these
Explanation: Frames

37. A neuron with 3 inputs has the weight vector [0.2 -0.1 0.1]^T and a bias θ = 0. If the input vector is X = [0.2 0.4 0.2]^T then the total input to the neuron is:
a. 0.20
b. 1.0
c. 0.02
d. -1.0
Explanation: 0.02

38. Which of the following neural networks uses supervised learning?
(A) Multilayer perceptron
(B) Self organizing feature map
(C) Hopfield network
a. (A) only
b. (B) only
c. (A) and (B) only
d. (A) and (C) only
Explanation: (A) only

39. Consider the following statements:
(a) If primal (dual) problem has a finite optimal solution, then its dual (primal) problem has a finite optimal solution.
(b) If primal (dual) problem has an unbounded optimum solution, then its dual (primal) has no feasible solution at all.
(c) Both primal and dual problems may be infeasible.
Which of the following is correct?
a. (a) and (b) only
b. (a) and (c) only
c. (b) and (c) only
d. (a), (b) and (c)
Explanation:(a), (b) and (c)

40. Consider the following statements :
(a) Assignment problem can be used to minimize the cost.
(b) Assignment problem is a special case of transportation problem.
(c) Assignment problem requires that only one activity be assigned to each resource.
Which of the following options is correct?
a. (a) and (b) only
b. (a) and (c) only
c. (b) and (c) only
d. (a), (b) and (c)
Explanation: (a), (b) and (c)

Learn Machine Learning with Python from Scratch

41. What is the name of the model in figure below?

a) Rosenblatt perceptron model
b) McCulloch-pitts model
d) None of the mentioned
Explanation: It is a general block diagram of McCulloch-pitts model of neuron.

42. What is nature of function F(x) in the figure?
a) linear
b) non-linear
c) can be either linear or non-linear
d) none of the mentioned
Explanation: In this function, the independent variable is an exponent in the equation hence non-linear.

43. What does the character ‘b’ represents in the above diagram?
a) bias
b) any constant value
c) a variable value
d) none of the mentioned
Explanation: More appropriate choice since bias is a constant fixed value for any circuit model.

44. If ‘b’ in the figure below is the bias, then what logic circuit does it represents?

a) or gate
b) and gate
c) nor gate
d) nand gate
Explanation: Form the truth table of above figure by taking inputs as 0 or 1.

45. When both inputs are 1, what will be the output of the above figure?
a) 0
b) 1
c) either 0 or 1
d) z
Explanation: Check the truth table of nor gate.

46. When both inputs are different, what will be the output of the above figure?
a) 0
b) 1
c) either 0 or 1
d) z
Explanation: Check the truth table of nor gate.

47. Which of the following model has ability to learn?
a) pitts model
b) rosenblatt perceptron model
c) both rosenblatt and pitts model
d) neither rosenblatt nor pitts
Explanation: Weights are fixed in pitts model but adjustable in rosenblatt.

48. When both inputs are 1, what will be the output of the pitts model nand gate ?
a) 0
b) 1
c) either 0 or 1
d) z
Explanation: Check the truth table of simply a nand gate.

49. When both inputs are different, what will be the logical output of the figure of question 4?
a) 0
b) 1
c) either 0 or 1
d) z
Explanation: Check the truth table of nor gate.

50. Does McCulloch-pitts model have ability of learning?
a) yes
b) no
Explanation: Weights are fixed.

#### Module 03

1. In descent methods, the particular choice of search direction does not matter so much.
a. True.
b. False.

2. In descent methods, the particular choice of line search does not matter so much.
a. True.
b. False.

3. When the gradient descent method is started from a point near the solution, it will converge very quickly.
a. True.
b. False.

4. Newton’s method with step size $h=1$ always works.
a. True.
b. False.

5. When Newton’s method is started from a point near the solution, it will converge very quickly.
a. True.
b. False.

6. Using Newton’s method to minimize $f(Ty)$, where $Ty=x$ and $T$ is nonsingular, can greatly improve the convergence speed when $T$ is chosen appropriately.
a. True.
b. False.

7. If $f$ is self-concordant, its Hessian is Lipschitz continuous.
a. True.
b. False.

8. If the Hessian of $f$ is Lipschitz continuous, then $f$ is self-concordant.
a. True.
b. False.

Learn Machine Learning with Python from Scratch

9. Newton’s method should only be used to minimize self-concordant functions.
a. True.
b. False.

10. $f(x) = \exp x$ is self-concordant.
a. True.
b. False.

11. $f(x) = -\log x$ is self-concordant.
a. True.
b. False.

12. Consider the problem of minimizing $f(x) = (c^Tx)^4 + \sum_{i=1}^n w_i \exp x_i,$ over $x \in \mathbf{R}^n$, where $w \succ 0$.
Newton’s method would probably require fewer iterations than the gradient method, but each iteration would be much more costly.
a. True.
b. False.

13. Newton’s method is seldom used in machine learning because
a. common loss functions are not self-concordant
b. Newton’s method does not work well on noisy data
c. machine learning researchers don’t really understand linear algebra
d. it is generally not practical to form or store the Hessian in such problems, due to large problem size

#### Module 04

1. In practice, Line of best fit or regression line is found when _____________
a) Sum of residuals (∑(Y – h(X))) is minimum
b) Sum of the absolute value of residuals (∑|Y-h(X)|) is maximum
c) Sum of the square of residuals ( ∑ (Y-h(X))2) is minimum
d) Sum of the square of residuals ( ∑ (Y-h(X))2) is maximum
Explanation: Here we penalize higher error value much more as compared to the smaller one, such that there is a significant difference between making big errors and small errors, which makes it easy to differentiate and select the best fit line.

2. If Linear regression model perfectly first i.e., train error is zero, then _____________________
a) Test error is also always zero
b) Test error is non zero
c) Couldn’t comment on Test error
d) Test error is equal to Train error
Explanation: Test Error depends on the test data. If the Test data is an exact representation of train data then test error is always zero. But this may not be the case.

3. Which of the following metrics can be used for evaluating regression models?
i) R Squared ii) Adjusted R Squared iii) F Statistics iv) RMSE / MSE / MAE
a) ii and iv
b) i and ii
c) ii, iii and iv
d) i, ii, iii and iv
Explanation: These (R Squared, Adjusted R Squared, F Statistics, RMSE / MSE / MAE) are some metrics which you can use to evaluate your regression model.

4. How many coefficients do you need to estimate in a simple linear regression model (One independent variable)?
a) 1
b) 2
c) 3
d) 4
Explanation: In simple linear regression, there is one independent variable so 2 coefficients (Y=a+bx+error).

5. In a simple linear regression model (One independent variable), If we change the input variable by 1 unit. How much output variable will change?
a) by 1
b) no change
c) by intercept
d) by its slope
Explanation: For linear regression Y=a+bx+error. If neglect error then Y=a+bx. If x increases by 1, then Y = a+b(x+1) which implies Y=a+bx+b. So Y increases by its slope.

6. Function used for linear regression in R is __________
a) lm(formula, data)
b) lr(formula, data)
c) lrm(formula, data)
d) regression.linear(formula, data)
Explanation: lm(formula, data) refers to a linear model in which formula is the object of the class “formula”, representing the relation between variables. Now this formula is on applied on the data to create a relationship model.

7. In syntax of linear model lm(formula,data,..), data refers to ______
a) Matrix
b) Vector
c) Array
d) List
Explanation: Formula is just a symbol to show the relationship and is applied on data which is a vector. In General, data.frame are used for data.

8. In the mathematical Equation of Linear Regression Y = β1 + β2X + ϵ, (β1, β2) refers to __________
a) (X-intercept, Slope)
b) (Slope, X-Intercept)
c) (Y-Intercept, Slope)
d) (slope, Y-Intercept)
Explanation: Y-intercept is β1 and X-intercept is – (β1 / β2). Intercepts are defined for axis and formed when the coordinates are on the axis.

9) Looking at above two characteristics, which of the following option is the correct for Pearson correlation between V1 and V2?
If you are given the two variables V1 and V2 and they are following below two characteristics.
1. If V1 increases then V2 also increases
2. If V1 decreases then V2 behavior is unknown
A) Pearson correlation will be close to 1
B) Pearson correlation will be close to -1
C) Pearson correlation will be close to 0
D) None of these

10) Suppose Pearson correlation between V1 and V2 is zero. In such case, is it right to conclude that V1 and V2 do not have any relation between them?
A) TRUE
B) FALSE

Learn Machine Learning with Python from Scratch

11) Which of the following offsets, do we use in linear regression’s least square line fit? Suppose horizontal axis is independent variable and vertical axis is dependent variable.

A) Vertical offset
B) Perpendicular offset
C) Both, depending on the situation
D) None of above

12) True- False: Overfitting is more likely when you have huge amount of data to train?
A) TRUE
B) FALSE

13) We can also compute the coefficient of linear regression with the help of an analytical method called “Normal Equation”. Which of the following is/are true about Normal Equation?
We don’t have to choose the learning rate
It becomes slow when number of features is very large
Thers is no need to iterate
A) 1 and 2
B) 1 and 3
C) 2 and 3
D) 1,2 and 3

14) Which of the following statement is true about sum of residuals of A and B?
Below graphs show two fitted regression lines (A & B) on randomly generated data. Now, I want to find the sum of residuals in both cases A and B.
Note:
Scale is same in both graphs for both axis.
X axis is independent variable and Y-axis is dependent variable.

A) A has higher sum of residuals than B
B) A has lower sum of residual than B
C) Both have same sum of residuals
D) None of these

15) Choose the option which describes bias in best manner.
A) In case of very large x; bias is low
B) In case of very large x; bias is high
C) We can’t say about bias
D) None of these

16) What will happen when you apply very large penalty?
A) Some of the coefficient will become absolute zero
B) Some of the coefficient will approach zero but not absolute zero
C) Both A and B depending on the situation
D) None of these

17) What will happen when you apply very large penalty in case of Lasso?
A) Some of the coefficient will become zero
B) Some of the coefficient will be approaching to zero but not absolute zero
C) Both A and B depending on the situation
D) None of these

18) Which of the following statement is true about outliers in Linear regression?
A) Linear regression is sensitive to outliers
B) Linear regression is not sensitive to outliers
C) Can’t say
D) None of these

19) Suppose you plotted a scatter plot between the residuals and predicted values in linear regression and you found that there is a relationship between them. Which of the following conclusion do you make about this situation?
A) Since the there is a relationship means our model is not good
B) Since the there is a relationship means our model is good
C) Can’t say
D) None of these

20) What will happen when you fit degree 4 polynomial in linear regression?
A) There are high chances that degree 4 polynomial will over fit the data
B) There are high chances that degree 4 polynomial will under fit the data
C) Can’t say
D) None of these

Learn Machine Learning with Python from Scratch

21) What will happen when you fit degree 2 polynomial in linear regression?
A) It is high chances that degree 2 polynomial will over fit the data
B) It is high chances that degree 2 polynomial will under fit the data
C) Can’t say
D) None of these

22) In terms of bias and variance. Which of the following is true when you fit degree 2 polynomial?
A) Bias will be high, variance will be high
B) Bias will be low, variance will be high
C) Bias will be high, variance will be low
D) Bias will be low, variance will be low

23) Suppose l1, l2 and l3 are the three learning rates for A,B,C respectively. Which of the following is true about l1,l2 and l3?

A) l2 < l1 < l3
B) l1 > l2 > l3
C) l1 = l2 = l3
D) None of these

24) Now we increase the training set size gradually. As the training set size increases, what do you expect will happen with the mean training error?
A) Increase
B) Decrease
C) Remain constant
D) Can’t Say

25) What do you expect will happen with bias and variance as you increase the size of training data?
A) Bias increases and Variance increases
B) Bias decreases and Variance increases
C) Bias decreases and Variance decreases
D) Bias increases and Variance decreases
E) Can’t Say False

26) What would be the root mean square training error for this data if you run a Linear Regression model of the form (Y = A0+A1X)?

A) Less than 0
B) Greater than zero
C) Equal to 0
D) None of these

Question Context 27-28:
Suppose you have been given the following scenario for training and validation error for Linear Regression.

 Scenario Learning Rate Number of iterations Training Error Validation Error 1 0.1 1000 100 110 2 0.2 600 90 105 3 0.3 400 110 110 4 0.4 300 120 130 5 0.4 250 130 150

27) Which of the following scenario would give you the right hyper parameter?
A) 1
B) 2
C) 3
D) 4

28) Suppose you got the tuned hyper parameters from the previous question. Now, Imagine you want to add a variable in variable space such that this added feature is important. Which of the following thing would you observe in such case?
A) Training Error will decrease and Validation error will increase
B) Training Error will increase and Validation error will increase
C) Training Error will increase and Validation error will decrease
D) Training Error will decrease and Validation error will decrease
E) None of the above

29) In such situation which of the following options would you consider?
Start introducing polynomial degree variables
Remove some variables
A) 1 and 2
B) 2 and 3
C) 1 and 3
D) 1, 2 and 3

30) Now situation is same as written in previous question(under fitting).Which of following regularization algorithm would you prefer?
A) L1
B) L2
C) Any
D) None of these

Learn Machine Learning with Python from Scratch

31) Which of the following evaluation metrics can not be applied in case of logistic regression output to compare with target?
A) AUC-ROC
B) Accuracy
C) Logloss
D) Mean-Squared-Error

32) One of the very good methods to analyze the performance of Logistic Regression is AIC, which is similar to R-Squared in Linear Regression. Which of the following is true about AIC?
A) We prefer a model with minimum AIC value
B) We prefer a model with maximum AIC value
C) Both but depend on the situation
D) None of these

33) [True-False] Standardisation of features is required before training a Logistic Regression.
A) TRUE
B) FALSE

34) Which of the following algorithms do we use for Variable Selection?
A) LASSO
B) Ridge
C) Both
D) None of these

Context: 35-36
Consider a following model for logistic regression: P (y =1|x, w)= g(w0 + w1x)
where g(z) is the logistic function.
In the above equation the P (y =1|x; w) , viewed as a function of x, that we can get by changing the parameters w.

35) What would be the range of p in such case?
A) (0, inf)
B) (-inf, 0 )
C) (0, 1)
D) (-inf, inf)

36) In above question what do you think which function would make p between (0,1)?
A) logistic function
B) Log likelihood function
C) Mixture of both
D) None of them

37) Suppose you have been given a fair coin and you want to find out the odds of getting heads. Which of the following option is true for such a case?
A) odds will be 0
B) odds will be 0.5
C) odds will be 1
D) None of these

38) The logit function(given as l(x)) is the log of odds function. What could be the range of logit function in the domain x=[0,1]?
A) (– ∞ , ∞)
B) (0,1)
C) (0, ∞)
D) (- ∞, 0)

39) Which of the following option is true?
A) Linear Regression errors values has to be normally distributed but in case of Logistic Regression it is not the case
B) Logistic Regression errors values has to be normally distributed but in case of Linear Regression it is not the case
C) Both Linear Regression and Logistic Regression error values have to be normally distributed
D) Both Linear Regression and Logistic Regression error values have not to be normally distributed

40) Which of the following is true regarding the logistic function for any value “x”?
Note:
Logistic(x): is a logistic function of any number “x”
Logit(x): is a logit function of any number “x”
Logit_inv(x): is a inverse logit function of any number “x”
A) Logistic(x) = Logit(x)
B) Logistic(x) = Logit_inv(x)
C) Logit_inv(x) = Logit(x)
D) None of these

Learn Machine Learning with Python from Scratch

41. A _________ is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility.
a) Decision tree
b) Graphs
c) Trees
d) Neural Networks
Explanation: Refer the definition of Decision tree.

42. Decision Tree is a display of an algorithm.
a) True
b) False
Explanation: None.

43. What is Decision Tree?
a) Flow-Chart
b) Structure in which internal node represents test on an attribute, each branch represents outcome of test and each leaf node represents class label
c) Flow-Chart & Structure in which internal node represents test on an attribute, each branch represents outcome of test and each leaf node represents class label
d) None of the mentioned
Explanation: Refer the definition of Decision tree.

44. Decision Trees can be used for Classification Tasks.
a) True
b) False
Explanation: None.

45. Choose from the following that are Decision Tree nodes?
a) Decision Nodes
b) End Nodes
c) Chance Nodes
d) All of the mentioned
Explanation: None.

46. Decision Nodes are represented by ____________
a) Disks
b) Squares
c) Circles
d) Triangles
Explanation: None.

47. Chance Nodes are represented by __________
a) Disks
b) Squares
c) Circles
d) Triangles
Explanation: None.

48. End Nodes are represented by __________
a) Disks
b) Squares
c) Circles
d) Triangles
Explanation: None.

49. Which of the following are the advantage/s of Decision Trees?
a) Possible Scenarios can be added
b) Use a white box model, If given result is provided by a model
c) Worst, best and expected values can be determined for different scenarios
d) All of the mentioned
Explanation: None.

#### Module 05

1. Instead of representing knowledge in a relatively declarative, static way (as a bunch of things that are true), rule-based system represent knowledge in terms of___________ that tell you what you should do or what you could conclude in different situations.
a) Raw Text
b) A bunch of rules
c) Summarized Text
d) Collection of various Texts
Explanation: None.

2. A rule-based system consists of a bunch of IF-THEN rules.
a) True
b) False
Explanation: None.

3. In a backward chaining system you start with the initial facts, and keep using the rules to draw new conclusions (or take certain actions) given those facts.
a) True
b) False
Explanation: Refer the definition of backward chaining.

4. In a backward chaining system, you start with some hypothesis (or goal) you are trying to prove, and keep looking for rules that would allow you to conclude that hypothesis, perhaps setting new sub-goals to prove as you go.
a) True
b) False
Explanation: None.

5. Forward chaining systems are _____________ where as backward chaining systems are ___________
a) Goal-driven, goal-driven
b) Goal-driven, data-driven
c) Data-driven, goal-driven
d) Data-driven, data-driven
Explanation: None.

6. A Horn clause is a clause with _______ positive literal.
a) At least one
b) At most one
c) None
d) All
Explanation: Refer to the definition of Horn Clauses.

7. ___________ trees can be used to infer in Horn clause systems.
a) Min/Max Tree
b) And/Or Trees
c) Minimum Spanning Trees
d) Binary Search Trees
Explanation: Take the analogy using min/max trees in game theory.

8. An expert system is a computer program that contains some of the subject-specific knowledge of one or more human experts.
a) True
b) False
Explanation: None.

9. A knowledge engineer has the job of extracting knowledge from an expert and building the expert system knowledge base.
a) True
b) False
Explanation: None.

10. What is needed to make probabilistic systems feasible in the world?
a) Reliability
b) Crucial robustness
c) Feasibility
d) None of the above
Explanation: On a model-based knowledge provides the crucial robustness needed to make probabilistic system feasible in the real world.

Learn Machine Learning with Python from Scratch

11. How many terms are required for building a bayes model?
a) 1
b) 2
c) 3
d) 4
Explanation: The three required terms are a conditional probability and two unconditional probability.

12. What is needed to make probabilistic systems feasible in the world?
a) Reliability
b) Crucial robustness
c) Feasibility
d) None of the mentioned
Explanation: On a model-based knowledge provides the crucial robustness needed to make probabilistic system feasible in the real world.

13. Where does the bayes rule can be used?
a) Solving queries
b) Increasing complexity
c) Decreasing complexity
Explanation: Bayes rule can be used to answer the probabilistic queries conditioned on one piece of evidence.

14. What does the bayesian network provides?
a) Complete description of the domain
b) Partial description of the domain
c) Complete description of the problem
d) None of the mentioned
Explanation: A Bayesian network provides a complete description of the domain.

15. How the entries in the full joint probability distribution can be calculated?
a) Using variables
b) Using information
c) Both Using variables & information
d) None of the mentioned
Explanation: Every entry in the full joint probability distribution can be calculated from the information in the network.

16. How the bayesian network can be used to answer any query?
a) Full distribution
b) Joint distribution
c) Partial distribution
d) All of the mentioned
Explanation: If a bayesian network is a representation of the joint distribution, then it can solve any query, by summing all the relevant joint entries.

17. How the compactness of the bayesian network can be described?
a) Locally structured
b) Fully structured
c) Partial structure
d) All of the mentioned
Explanation: The compactness of the bayesian network is an example of a very general property of a locally structured system.

18. To which does the local structure is associated?
a) Hybrid
b) Dependant
c) Linear
d) None of the mentioned
Explanation: Local structure is usually associated with linear rather than exponential growth in complexity.

19. Which condition is used to influence a variable directly by all the others?
a) Partially connected
b) Fully connected
c) Local connected
d) None of the mentioned
Explanation: None.

20. What is the consequence between a node and its predecessors while creating bayesian network?
a) Functionally dependent
b) Dependant
c) Conditionally independent
d) Both Conditionally dependant & Dependant
Explanation: The semantics to derive a method for constructing bayesian networks were led to the consequence that a node can be conditionally independent of its predecessors.

Learn Machine Learning with Python from Scratch

21. Which algorithm is used for solving temporal probabilistic reasoning?
a) Hill-climbing search
b) Hidden markov model
c) Depth-first search
Explanation: Hidden Markov model is used for solving temporal probabilistic reasoning that was independent of transition and sensor model.

22. How does the state of the process is described in HMM?
a) Literal
b) Single random variable
c) Single discrete random variable
d) None of the mentioned
Explanation: An HMM is a temporal probabilistic model in which the state of the process is described by a single discrete random variable.

23. What are the possible values of the variable?
a) Variables
b) Literals
c) Discrete variable
d) Possible states of the world
Explanation: The possible values of the variables are the possible states of the world.

a) Temporal model
b) Reality model
c) Probability model
d) All of the mentioned
Explanation: Additional state variables can be added to a temporal model while staying within the HMM framework.

25. Which allows for a simple and matrix implementation of all the basic algorithm?
a) HMM
b) Restricted structure of HMM
c) Temporary model
d) Reality model
Explanation: Restricted structure of HMM allows for a very simple and elegant matrix implementation of all the basic algorithm.

26. Where does the Hidden Markov Model is used?
a) Speech recognition
b) Understanding of real world
c) Both Speech recognition & Understanding of real world
d) None of the mentioned
Explanation: None.

27. Which variable can give the concrete form to the representation of the transition model?
a) Single variable
b) Discrete state variable
c) Random variable
d) Both Single & Discrete state variable
Explanation: With a single, discrete state variable, we can give concrete form to the representation of the transition model.

28. Which algorithm works by first running the standard forward pass to compute?
a) Smoothing
b) Modified smoothing
c) HMM
d) Depth-first search algorithm
Explanation: The modified smoothing algorithm works by first running the standard forward pass to compute and then running the backward pass.

29. Which reveals an improvement in online smoothing?
a) Matrix formulation
b) Revelation
c) HMM
d) None of the mentioned
Explanation: Matrix formulation reveals an improvement in online smoothing with a fixed lag.

30. Which suggests the existence of an efficient recursive algorithm for online smoothing?
a) Matrix
b) Constant space
c) Constant time
d) None of the mentioned
Explanation: None.

Learn Machine Learning with Python from Scratch

31. The Expectation Maximization algorithm has been used to identify conserved domains in unaligned proteins only.
a) True
b) False
Explanation: This algorithm has been used to identify both conserved domains in unaligned proteins and protein-binding sites in unaligned DNA sequences (Lawrence and Reilly 1990), including sites that may include gaps (Cardon and Stormo 1992). Given are a set of sequences that are expected to have a common sequence pattern and may not be easily recognizable by eye.

32. Which of the following is untrue regarding Expectation Maximization algorithm?
a) An initial guess is made as to the location and size of the site of interest in each of the sequences, and these parts of the sequence are aligned
b) The alignment provides an estimate of the base or amino acid composition of each column in the site
c) The column-by-column composition of the site already available is used to estimate the probability of finding the site at any position in each of the sequences
d) The row-by-column composition of the site already available is used to estimate the probability
Explanation: The EM algorithm then consists of two steps, which are repeated consecutively. In step 1, the expectation step, the column-by-column composition of the site already available is used to estimate the probability of finding the site at any position in each of the sequences. These probabilities are used in turn to provide new information as to the expected base or amino acid distribution for each column in the site.

33. Out of the two repeated steps in EM algorithm, the step 2 is ________
a) the maximization step
b) the minimization step
c) the optimization step
d) the normalization step
Explanation: In step 2, the maximization step, the new counts of bases or amino acids for each position in the site found in step 1 are substituted for the previous set. Step 1 is then repeated using these new counts. The cycle is repeated until the algorithm converges on a solution and does not change with further cycles. At that time, the best location of the site in each sequence and the best estimate of the residue composition of each column in the site will be available.

34. In EM algorithm, as an example, suppose that there are 10 DNA sequences having very little similarity with each other, each about 100 nucleotides long and thought to contain a binding site near the middle 20 residues, based on biochemical and genetic evidence. the following steps would be used by the EM algorithm to find the most probable location of the binding sites in each of the ______ sequences.
a) 30
b) 10
c) 25
d) 20
Explanation: When examining the EM program MEME, the size and number of binding sites, the location in each sequence, and whether or not the site is present in each sequence do not necessarily have to be known. For the present example, the following steps would be used by the EM algorithm to find the most probable location of the binding sites in each of the 10 sequences.

35. In the initial step of EM algorithm, the 20-residue-long binding motif patterns in each sequence are aligned as an initial guess of the motif.
a) True
b) False
Explanation: The base composition of each column in the aligned patterns is then determined. The composition of the flanking sequence on each side of the site provides the surrounding base or amino acid composition for comparison. Each sequence is assumed to be the same length and to be aligned by the ends.

36. In the intermediate steps of EM algorithm, the number of each base in each column is determined and then converted to fractions.
a) True
b) False
Explanation: For example, that there are four Gs in the first column of the 10 sequences, then the frequency of G in the first column of the site, fSG = 4/10 = 0.4. This procedure is repeated for each base and each column.

37. For the 10-residue DNA sequence example, there are _______ possible starting sites for a 20-residue-long site.
a) 30
b) 21
c) 81
d) 60
Explanation: For the 10-residue DNA sequence example, there are 100 – 20 +1 possible starting sites for a 20-residue-long site. Where the first one is at position 1 in the sequence ending one at 20 and the last beginning at position 81 and ending at 100 (there is not enough sequence for a 20-residue-long site beyond position 81).

38. An alternative method is to produce an odds scoring matrix calculated by dividing each base frequency by the background frequency of that base.
a) True
b) False
Explanation: In this method, the probability of each location is then found by multiplying the odds scores from each column. An even simpler method is to use log odds scores in the matrix. The column scores are then simply added. In this case, the log odds scores must be converted to odds scores before position probabilities are calculated.

39. Which of the following about MEME is untrue?
a) It is a Web resource for performing local MSAs (Multiple Sequence Alignment) by the above expectation maximization method is the program MEME
b) It stands for Multiple EM for Motif Elicitation
c) It was developed at developed at the University of California at San Diego Supercomputing Center
d) The Web page has multiple versions for searching blocks by an EM algorithm
Explanation: The Web page for two versions of MEME, ParaMEME, a Web program that searches for blocks by an EM algorithm (Described below), and a similar program MetaMEME (which searches for profiles using HMMs, described below).The Motif Alignment and Search Tool (MAST) for searching through databases for matches to motifs.

40. Which of the following about the Gibbs sampler is untrue?
a) It is a statistical method for finding motifs in sequences
b) It is dissimilar to the principle of the EM method
c) It searches for the statistically most probable motifs
d) It can find the optimal width and number of given motifs in each sequence
Explanation: It is another statistical method for finding motifs in sequences is the Gibbs sampler. The method is similar in principle to the EM method described above, but the algorithm is different. A combinatorial approach of the Gibbs sampler and MOTIF may be used to make blocks at the BLOCKS Web site.

Learn Machine Learning with Python from Scratch

41. Bayesian Belief Network is also known as ?
A. belief network
B. decision network
C. Bayesian model
D. All of the above
Explanation: Bayesian Belief Network also called a Bayes network, belief network, decision network, or Bayesian model.

42. Bayesian Network consist of ?
A. 2
B. 3
C. 4
D. 5
Explanation: Bayesian Network can be used for building models from data and experts opinions, and it consists of two parts: Directed Acyclic Graph and Table of conditional probabilities.

43. The generalized form of Bayesian network that represents and solve decision problems under uncertain knowledge is known as an?
A. Directed Acyclic Graph
B. Table of conditional probabilities
C. Influence diagram
D. None of the above
Explanation: The generalized form of Bayesian network that represents and solve decision problems under uncertain knowledge is known as an Influence diagram

44. How many component does Bayesian network have?
A. 2
B. 3
C. 4
D. 5
Explanation: The Bayesian network has mainly two components: Causal Component and Actual numbers

45. The Bayesian network graph does not contain any cyclic graph. Hence, it is known as a
A. DCG
B. DAG
C. CAG
D. SAG
Explanation: The Bayesian network graph does not contain any cyclic graph. Hence, it is known as a directed acyclic graph or DAG.

46. In a Bayesian network variable is?
A. continuous
B. discrete
C. Both A and B
D. None of the above
Explanation: Each node corresponds to the random variables, and a variable can be continuous or discrete.

47. If we have variables x1, x2, x3,….., xn, then the probabilities of a different combination of x1, x2, x3.. xn, are known as?
A. Table of conditional probabilities
B. Causal Component
C. Actual numbers
D. Joint probability distribution
Explanation: If we have variables x1, x2, x3,….., xn, then the probabilities of a different combination of x1, x2, x3.. xn, are known as Joint probability distribution.

48. The nodes and links form the structure of the Bayesian network, and we call this the ?
A. structural specification
B. multi-variable nodes
C. Conditional Linear Gaussian distributions
D. None of the above
Explanation: The nodes and links form the structure of the Bayesian network, and we call this the structural specification.

49. Which of the following are used for modeling times series and sequences?
A. Decision graphs
B. Dynamic Bayesian networks
C. Value of information
D. Parameter tuning
Explanation: Dynamic Bayesian networks (DBNs) are used for modeling times series and sequences.

50. How many terms are required for building a bayes model?
A. 1
B. 2
C. 3
D. 4
Explanation: The three required terms are a conditional probability and two unconditional probability.

Learn Machine Learning with Python from Scratch

51) Which of the following algorithms cannot be used for reducing the dimensionality of data?
A. t-SNE
B. PCA
C. LDA False
D. None of these

52) [ True or False ] PCA can be used for projecting and visualizing data in lower dimensions.
A. TRUE
B. FALSE

53) The most popularly used dimensionality reduction algorithm is Principal Component Analysis (PCA). Which of the following is/are true about PCA?
PCA is an unsupervised method
It searches for the directions that data have the largest variance
Maximum number of principal components <= number of features
All principal components are orthogonal to each other
A. 1 and 2
B. 1 and 3
C. 2 and 3
D. 1, 2 and 3
E. 1,2 and 4
F. All of the above

54) Suppose we are using dimensionality reduction as pre-processing technique, i.e, instead of using all the features, we reduce the data to k dimensions with PCA. And then use these PCA projections as our features. Which of the following statement is correct?
A. Higher ‘k’ means more regularization
B. Higher ‘k’ means less regularization
C. Can’t Say

55) In which of the following scenarios is t-SNE better to use than PCA for dimensionality reduction while working on a local machine with minimal computational power?
A. Dataset with 1 Million entries and 300 features
B. Dataset with 100000 entries and 310 features
C. Dataset with 10,000 entries and 8 features
D. Dataset with 10,000 entries and 200 features

56) Which of the following statement is true for a t-SNE cost function?
A. It is asymmetric in nature.
B. It is symmetric in nature.
C. It is same as the cost function for SNE.

57) Imagine you are dealing with text data. To represent the words you are using word embedding (Word2vec). In word embedding, you will end up with 1000 dimensions. Now, you want to reduce the dimensionality of this high dimensional data such that, similar words should have a similar meaning in nearest neighbor space.In such case, which of the following algorithm are you most likely choose?
A. t-SNE
B. PCA
C. LDA
D. None of these

58) [True or False] t-SNE learns non-parametric mapping.
A. TRUE
B. FALSE

59) Which of the following statement is correct for t-SNE and PCA?
A. t-SNE is linear whereas PCA is non-linear
B. t-SNE and PCA both are linear
C. t-SNE and PCA both are nonlinear
D. t-SNE is nonlinear whereas PCA is linear

60) In t-SNE algorithm, which of the following hyper parameters can be tuned?
A. Number of dimensions
B. Smooth measure of effective number of neighbours
C. Maximum number of iterations
D. All of the above

Learn Machine Learning with Python from Scratch

61) The minimum time complexity for training an SVM is O(n2). According to this fact, what sizes of datasets are not best suited for SVM’s?
A) Large datasets
B) Small datasets
C) Medium sized datasets
D) Size does not matter

62) The effectiveness of an SVM depends upon:
A) Selection of Kernel
B) Kernel Parameters
C) Soft Margin Parameter C
D) All of the above

63) Support vectors are the data points that lie closest to the decision surface.
A) TRUE
B) FALSE

64) The SVM’s are less effective when:
A) The data is linearly separable
B) The data is clean and ready to use
C) The data is noisy and contains overlapping points

65) Suppose you are using RBF kernel in SVM with high Gamma value. What does this signify?
A) The model would consider even far away points from hyperplane for modeling
B) The model would consider only the points close to the hyperplane for modeling
C) The model would not be affected by distance of points from hyperplane for modeling
D) None of the above

66) The cost parameter in the SVM means:
A) The number of cross-validations to be made
B) The kernel to be used
C) The tradeoff between misclassification and simplicity of the model
D) None of the above

67)Suppose you are building a SVM model on data X. The data X can be error prone which means that you should not trust any specific data point too much. Now think that you want to build a SVM model which has quadratic kernel function of polynomial degree 2 that uses Slack variable C as one of it’s hyper parameter. Based upon that give the answer for following question.
What would happen when you use very large value of C(C->infinity)?
Note: For small C was also classifying all data points correctly
A) We can still classify data correctly for given setting of hyper parameter C
B) We can not classify data correctly for given setting of hyper parameter C
C) Can’t Say
D) None of these

68) What would happen when you use very small C (C~0)?
A) Misclassification would happen
B) Data will be correctly classified
C) Can’t say
D) None of these

69) If I am using all features of my dataset and I achieve 100% accuracy on my training set, but ~70% on validation set, what should I look out for?
A) Underfitting
B) Nothing, the model is perfect
C) Overfitting

70) Which of the following are real world applications of the SVM?
A) Text and Hypertext Categorization
B) Image Classification
C) Clustering of News Articles
D) All of the above

Learn Machine Learning with Python from Scratch

Question Context: 71 – 72
Suppose you have trained an SVM with linear decision boundary after training SVM, you correctly infer that your SVM model is under fitting.

71) Which of the following option would you more likely to consider iterating SVM next time?
A) You want to increase your data points
B) You want to decrease your data points
C) You will try to calculate more variables
D) You will try to reduce the features

72) Suppose you gave the correct answer in previous question. What do you think that is actually happening?
1. We are lowering the bias
2. We are lowering the variance
3. We are increasing the bias
4. We are increasing the variance
A) 1 and 2
B) 2 and 3
C) 1 and 4
D) 2 and 4

73) In above question suppose you want to change one of it’s(SVM) hyperparameter so that effect would be same as previous questions i.e model will not under fit?
A) We will increase the parameter C
B) We will decrease the parameter C
C) Changing in C don’t effect
D) None of these

74) We usually use feature normalization before using the Gaussian kernel in SVM. What is true about feature normalization?
1. We do feature normalization so that new feature will dominate other
2. Some times, feature normalization is not feasible in case of categorical variables
3. Feature normalization always helps when we use Gaussian kernel in SVM
A) 1
B) 1 and 2
C) 1 and 3
D) 2 and 3

Question Context: 75
Suppose you are dealing with 4 class classification problem and you want to train a SVM model on the data for that you are using One-vs-all method. Now answer the below questions?

75) How many times we need to train our SVM model in such case?
A) 1
B) 2
C) 3
D) 4
Solution: D