Quiznetik

High Performance Computing (HPC) | Set 1

1. A CUDA program is comprised of two primary components: a host and a _____.

A. gpu??kernel

B. cpu??kernel

C. os

D. none of above

Correct : A. gpu??kernel

2. The kernel code is dentified by the ________qualifier with void return type

A. _host_

B. __global__??

C. _device_

D. void

Correct : B. __global__??

3. The kernel code is only callable by the host

A. true

B. false

Correct : A. true

4. The kernel code is executable on the device and host

A. true

B. false

Correct : B. false

5. Calling a kernel is typically referred to as _________.

A. kernel thread

B. kernel initialization

C. kernel termination

D. kernel invocation

Correct : D. kernel invocation

6. Host codes in a CUDA application can Initialize a device

A. true

B. false

Correct : A. true

7. Host codes in a CUDA application can Allocate GPU memory

A. true

B. false

Correct : A. true

8. Host codes in a CUDA application can not Invoke kernels

A. true

B. false

Correct : B. false

9. CUDA offers the Chevron Syntax to configure and execute a kernel.

A. true

B. false

Correct : A. true

10. the BlockPerGrid and ThreadPerBlock parameters are related to the ________ model supported by CUDA.

A. host

B. kernel

C. thread??abstraction

D. none of above

Correct : C. thread??abstraction

11. _________ is Callable from the device only

A. _host_

B. __global__??

C. _device_

D. none of above

Correct : C. _device_

12. ______ is Callable from the host

A. _host_

B. __global__??

C. _device_

D. none of above

Correct : B. __global__??

13. ______ is Callable from the host

A. _host_

B. __global__??

C. _device_

D. none of above

Correct : A. _host_

14. CUDA supports ____________ in which code in a single thread is executed by all other threads.

A. tread division

B. tread termination

C. thread abstraction

D. none of above

Correct : C. thread abstraction

15. In CUDA, a single invoked kernel is referred to as a _____.

A. block

B. tread

C. grid

D. none of above

Correct : C. grid

16. A grid is comprised of ________ of threads.

A. block

B. bunch

C. host

D. none of above

Correct : A. block

17. A block is comprised of multiple _______.

A. treads

B. bunch

C. host

D. none of above

Correct : A. treads

18. a solution of the problem in representing the parallelismin algorithm is

A. cud

B. pta

C. cda

D. cuda

Correct : D. cuda

19. Host codes in a CUDA application can not Reset a device

A. true

B. false

Correct : B. false

20. Host codes in a CUDA application can Transfer data to and from the device

A. true

B. false

Correct : A. true

21. Host codes in a CUDA application can not Deallocate memory on the GPU

A. true

B. false

Correct : B. false

22. Any condition that causes a processor to stall is called as _____.

A. hazard

B. page fault

C. system error

D. none of the above

Correct : A. hazard

23. The time lost due to branch instruction is often referred to as _____.

A. latency

B. delay

C. branch penalty

D. none of the above

Correct : C. branch penalty

24. _____ method is used in centralized systems to perform out of order execution.

A. scorecard

B. score boarding

C. optimizing

D. redundancy

Correct : B. score boarding

25. The computer cluster architecture emerged as an alternative for ____.

A. isa

B. workstation

C. super computers

D. distributed systems

Correct : C. super computers

26. NVIDIA CUDA Warp is made up of how many threads?

A. 512

B. 1024

C. 312

D. 32

Correct : D. 32

27. Out-of-order instructions is not possible on GPUs.

A. true

B. false

C. --

D. --

Correct : B. false

28. CUDA supports programming in ....

A. c or c++ only

B. java, python, and more

C. c, c++, third party wrappers for java, python, and more

D. pascal

Correct : C. c, c++, third party wrappers for java, python, and more

29. FADD, FMAD, FMIN, FMAX are ----- supported by Scalar Processors of NVIDIA GPU.

A. 32-bit ieee floating point instructions

B. 32-bit integer instructions

C. both

D. none of the above

Correct : A. 32-bit ieee floating point instructions

30. Each streaming multiprocessor (SM) of CUDA herdware has ------ scalar processors (SP).

A. 1024

B. 128

C. 512

D. 8

Correct : D. 8

31. Each NVIDIA GPU has ------ Streaming Multiprocessors

A. 8

B. 1024

C. 512

D. 16

Correct : D. 16

32. CUDA provides ------- warp and thread scheduling. Also, the overhead of thread creation is on the order of ----.

A. “programming-overhead”, 2 clock

B. “zero-overhead”, 1 clock

C. 64, 2 clock

D. 32, 1 clock

Correct : B. “zero-overhead”, 1 clock

33. Each warp of GPU receives a single instruction and “broadcasts” it to all of its threads. It is a ---- operation.

A. simd (single instruction multiple data)

B. simt (single instruction multiple thread)

C. sisd (single instruction single data)

D. sist (single instruction single thread)

Correct : B. simt (single instruction multiple thread)

34. Limitations of CUDA Kernel

A. recursion, call stack, static variable declaration

B. no recursion, no call stack, no static variable declarations

C. recursion, no call stack, static variable declaration

D. no recursion, call stack, no static variable declarations

Correct : B. no recursion, no call stack, no static variable declarations

35. What is Unified Virtual Machine

A. it is a technique that allow both cpu and gpu to read from single virtual machine, simultaneously.

B. it is a technique for managing separate host and device memory spaces.

C. it is a technique for executing device code on host and host code on device.

D. it is a technique for executing general purpose programs on device instead of host.

Correct : A. it is a technique that allow both cpu and gpu to read from single virtual machine, simultaneously.

36. ___ became the first language specifically designed by a GPU Company to facilitate general purpose computing on .

A. python, gpus.

B. c, cpus.

C. cuda c, gpus.

D. java, cpus.

Correct : C. cuda c, gpus.

37. The CUDA architecture consists of --------- for parallel computing kernels and functions.

A. risc instruction set architecture

B. cisc instruction set architecture

C. zisc instruction set architecture

D. ptx instruction set architecture

Correct : D. ptx instruction set architecture

38. CUDA stands for --------, designed by NVIDIA.

A. common union discrete architecture

B. complex unidentified device architecture

C. compute unified device architecture

D. complex unstructured distributed architecture

Correct : C. compute unified device architecture

39. The host processor spawns multithread tasks (or kernels as they are known in CUDA) onto the GPU device. State true or false.

A. true

B. false

C. ---

D. ---

Correct : A. true

40. The NVIDIA G80 is a ---- CUDA core device, the NVIDIA G200 is a ---- CUDA core device, and the NVIDIA Fermi is a ---- CUDA core device.

A. 128, 256, 512

B. 32, 64, 128

C. 64, 128, 256

D. 256, 512, 1024

Correct : A. 128, 256, 512

41. NVIDIA 8-series GPUs offer -------- .

A. 50-200 gflops

B. 200-400 gflops

C. 400-800 gflops

D. 800-1000 gflops

Correct : A. 50-200 gflops

42. IADD, IMUL24, IMAD24, IMIN, IMAX are ----------- supported by Scalar Processors of NVIDIA GPU.

A. 32-bit ieee floating point instructions

B. 32-bit integer instructions

C. both

D. none of the above

Correct : B. 32-bit integer instructions

43. CUDA Hardware programming model supports: a) fully generally data-parallel archtecture; b) General thread launch; c) Global load-store; d) Parallel data cache; e) Scalar architecture; f) Integers, bit operation

A. a,c,d,f

B. b,c,d,e

C. a,d,e,f

D. a,b,c,d,e,f

Correct : D. a,b,c,d,e,f

44. In CUDA memory model there are following memory types available: a) Registers; b) Local Memory; c) Shared Memory; d) Global Memory; e) Constant Memory; f) Texture Memory.

A. a, b, d, f

B. a, c, d, e, f

C. a, b, c, d, e, f

D. b, c, e, f

Correct : C. a, b, c, d, e, f

45. What is the equivalent of general C program with CUDA C: int main(void) { printf("Hello, World!\n"); return 0; }

A. int main ( void ) { kernel <<<1,1>>>(); printf("hello, world!\\n"); return 0; }

B. __global__ void kernel( void ) { } int main ( void ) { kernel <<<1,1>>>(); printf("hello, world!\\n"); return 0; }

C. __global__ void kernel( void ) { kernel <<<1,1>>>(); printf("hello, world!\\n"); return 0; }

D. __global__ int main ( void ) { kernel <<<1,1>>>(); printf("hello, world!\\n"); return 0; }

Correct : B. __global__ void kernel( void ) { } int main ( void ) { kernel <<<1,1>>>(); printf("hello, world!\\n"); return 0; }

46. Which function runs on Device (i.e. GPU): a) global void kernel (void ) { } b) int main ( void ) { ... return 0; }

A. a

B. b

C. both a,b

D. ---

Correct : A. a

47. A simple kernel for adding two integers: global void add( int a, int b, int c ) { c = a + b; } where global is a CUDA C keyword which indicates that:

A. add() will execute on device, add() will be called from host

B. add() will execute on host, add() will be called from device

C. add() will be called and executed on host

D. add() will be called and executed on device

Correct : A. add() will execute on device, add() will be called from host

48. If variable a is host variable and dev_a is a device (GPU) variable, to allocate memory to dev_a select correct statement:

A. cudamalloc( &dev_a, sizeof( int ) )

B. malloc( &dev_a, sizeof( int ) )

C. cudamalloc( (void**) &dev_a, sizeof( int ) )

D. malloc( (void**) &dev_a, sizeof( int ) )

Correct : C. cudamalloc( (void**) &dev_a, sizeof( int ) )

49. If variable a is host variable and dev_a is a device (GPU) variable, to copy input from variable a to variable dev_a select correct statement:

A. memcpy( dev_a, &a, size);

B. cudamemcpy( dev_a, &a, size, cudamemcpyhosttodevice );

C. memcpy( (void*) dev_a, &a, size);

D. cudamemcpy( (void*) &dev_a, &a, size, cudamemcpydevicetohost );

Correct : B. cudamemcpy( dev_a, &a, size, cudamemcpyhosttodevice );

50. Triple angle brackets mark in a statement inside main function, what does it indicates?

A. a call from host code to device code

B. a call from device code to host code

C. less than comparison

D. greater than comparison

Correct : A. a call from host code to device code

51. What makes a CUDA code runs in parallel

A. __global__ indicates parallel execution of code

B. main() function indicates parallel execution of code

C. kernel name outside triple angle bracket indicates excecution of kernel n times in parallel

D. first parameter value inside triple angle bracket (n) indicates excecution of kernel n times in parallel

Correct : D. first parameter value inside triple angle bracket (n) indicates excecution of kernel n times in parallel

52. In ___________, the number of elements to be sorted is small enough to fit into the process's main memory.

A. internal sorting

B. internal searching

C. external sorting

D. external searching

Correct : A. internal sorting

53. ______________ algorithms use auxiliary storage (such as tapes and hard disks) for sorting because the number of elements to be sorted is too large to fit into memory.

A. internal sorting

B. internal searching

C. external sorting

D. external searching

Correct : C. external sorting

54. ______ can be comparison-based or noncomparison-based.

A. searching

B. sorting

C. both a and b

D. none of above

Correct : B. sorting

55. The fundamental operation of comparison-based sorting is ________.

A. compare-exchange

B. searching

C. sorting

D. swapping

Correct : A. compare-exchange

56. The complexity of bubble sort is Θ(n2).

A. true

B. false

Correct : A. true

57. Bubble sort is difficult to parallelize since the algorithm has no concurrency.

A. true

B. false

Correct : A. true

58. Quicksort is one of the most common sorting algorithms for sequential computers because of its simplicity, low overhead, and optimal average complexity.

A. true

B. false

Correct : A. true

59. The performance of quicksort depends critically on the quality of the ______-.

A. non-pivote

B. pivot

C. center element

D. len of array

Correct : B. pivot

60. the complexity of quicksort is O(nlog n).

A. true

B. false

Correct : A. true

61. The main advantage of ______ is that its storage requirement is linear in the depth of the state space being searched.

A. bfs

B. dfs

C. a and b

D. none of above

Correct : B. dfs

62. _____ algorithms use a heuristic to guide search.

A. bfs

B. dfs

C. a and b

D. none of above

Correct : A. bfs

63. If the heuristic is admissible, the BFS finds the optimal solution.

A. true

B. false

Correct : A. true

64. The search overhead factor of the parallel system is defined as the ratio of the work done by the parallel formulation to that done by the sequential formulation

A. true

B. false

Correct : A. true

65. The critical issue in parallel depth-first search algorithms is the distribution of the search space among the processors.

A. true

B. false

Correct : A. true

66. Graph search involves a closed list, where the major operation is a _______

A. sorting

B. searching

C. lookup

D. none of above

Correct : C. lookup

67. Breadth First Search is equivalent to which of the traversal in the Binary Trees?

A. pre-order traversal

B. post-order traversal

C. level-order traversal

D. in-order traversal

Correct : C. level-order traversal

68. Time Complexity of Breadth First Search is? (V – number of vertices, E – number of edges)

A. o(v + e)

B. o(v)

C. o(e)

D. o(v*e)

Correct : A. o(v + e)

69. Which of the following is not an application of Breadth First Search?

A. when the graph is a binary tree

B. when the graph is a linked list

C. when the graph is a n-ary tree

D. when the graph is a ternary tree

Correct : B. when the graph is a linked list

70. In BFS, how many times a node is visited?

A. once

B. twice

C. equivalent to number of indegree of the node

D. thrice

Correct : C. equivalent to number of indegree of the node

71. Is Best First Search a searching algorithm used in graphs.

A. true

B. false

Correct : A. true

72. Which of the following is not a stable sorting algorithm in its typical implementation.

A. insertion sort

B. merge sort

C. quick sort

D. bubble sort

Correct : C. quick sort

73. Which of the following is not true about comparison based sorting algorithms?

A. the minimum possible time complexity of a comparison based sorting algorithm is o(nlogn) for a random input array

B. any comparison based sorting algorithm can be made stable by using position as a criteria when two elements are compared

C. counting sort is not a comparison based sorting algortihm

D. heap sort is not a comparison based sorting algorithm.

Correct : D. heap sort is not a comparison based sorting algorithm.

74. mathematically efficiency is

A. e=s/p

B. e=p/s

C. e*s=p/2

D. e=p+e/e

Correct : A. e=s/p

75. Cost of a parallel system is sometimes referred to____ of product

A. work

B. processor time

C. both

D. none

Correct : C. both

76. Scaling Characteristics of Parallel Programs Ts is

A. increase

B. constant

C. decreases

D. none

Correct : B. constant

77. Speedup tends to saturate and efficiency _____ as a consequence of Amdahl’s law.

A. increase

B. constant

C. decreases

D. none

Correct : C. decreases

78. Speedup obtained when the problem size is _______ linearly with the number of processing elements.

A. increase

B. constant

C. decreases

D. depend on problem size

Correct : A. increase

79. The n × n matrix is partitioned among n processors, with each processor storing complete ___ of the matrix.

A. row

B. column

C. both

D. depend on processor

Correct : A. row

80. cost-optimal parallel systems have an efficiency of ___

A. 1

B. n

C. logn

D. complex

Correct : A. 1

81. The n × n matrix is partitioned among n2 processors such that each processor owns a _____ element.

A. n

B. 2n

C. single

D. double

Correct : C. single

82. how many basic communication operations are used in matrix vector multiplication

A. 1

B. 2

C. 3

D. 4

Correct : C. 3

83. In DNS algorithm of matrix multiplication it used

A. 1d partition

B. 2d partition

C. 3d partition

D. both a,b

Correct : C. 3d partition

84. In the Pipelined Execution, steps contain

A. normalization

B. communication

C. elimination

D. all

Correct : D. all

85. the cost of the parallel algorithm is higher than the sequential run time by a factor of __

A. 2020-03-02 00:00:00

B. 2020-02-03 00:00:00

C. 3*2

D. 2/3+3/2

Correct : A. 2020-03-02 00:00:00

86. The load imbalance problem in Parallel Gaussian Elimination: can be alleviated by using a ____ mapping

A. acyclic

B. cyclic

C. both

D. none

Correct : B. cyclic

87. A parallel algorithm is evaluated by its runtime in function of

A. the input size,

B. the number of processors,

C. the communication parameters.

D. all

Correct : D. all

88. For a problem consisting of W units of work, p__W processors can be used optimally.

A. <=

B. >=

C. <

D. >

Correct : A. <=

89. C(W)__Θ(W) for optimality (necessary condition).

A. >

B. <

C. <=

D. equals

Correct : D. equals

90. many interactions in oractical parallel programs occur in _____ pattern

A. well defined

B. zig-zac

C. reverse

D. straight

Correct : A. well defined

91. efficient implementation of basic communication operation can improve

A. performance

B. communication

C. algorithm

D. all

Correct : A. performance

92. efficient use of basic communication operations can reduce

A. development effort and

B. software quality

C. both

D. none

Correct : A. development effort and

93. Group communication operations are built using_____ Messenging primitives.

A. point-to-point

B. one-to-all

C. all-to-one

D. none

Correct : A. point-to-point

94. one processor has a piece of data and it need to send to everyone is

A. one -to-all

B. all-to-one

C. point -to-point

D. all of above

Correct : A. one -to-all

95. the dual of one -to-all is

A. all-to-one reduction

B. one -to-all reduction

C. pnoint -to-point reducntion

D. none

Correct : A. all-to-one reduction

96. Data items must be combined piece-wise and the result made available at

A. target processor finally

B. target variable finatlalyrget receiver finally

Correct : A. target processor finally

97. wimpleat way to send p-1 messages from source to the other p-1 processors

A. algorithm

B. communication

C. concurrency

D. receiver

Correct : C. concurrency

98. In a eight node ring, node ____ is source of broadcast

A. 1

B. 2

C. 8

D. 0

Correct : D. 0

99. The processors compute ______ product of the vector element and the loval matrix

A. local

B. global

C. both

D. none

Correct : A. local

100. one to all broadcast use

A. recursive doubling

B. simple algorithm

C. both

D. none

Correct : A. recursive doubling

Quiznetik

High Performance Computing (HPC) | Set 1

1. A CUDA program is comprised of two primary components: a host and a _____.

2. The kernel code is dentified by the ________qualifier with void return type

3. The kernel code is only callable by the host

4. The kernel code is executable on the device and host

5. Calling a kernel is typically referred to as _________.

6. Host codes in a CUDA application can Initialize a device

7. Host codes in a CUDA application can Allocate GPU memory

8. Host codes in a CUDA application can not Invoke kernels

9. CUDA offers the Chevron Syntax to configure and execute a kernel.

10. the BlockPerGrid and ThreadPerBlock parameters are related to the ________ model supported by CUDA.

11. _________ is Callable from the device only

12. ______ is Callable from the host

13. ______ is Callable from the host

14. CUDA supports ____________ in which code in a single thread is executed by all other threads.

15. In CUDA, a single invoked kernel is referred to as a _____.

16. A grid is comprised of ________ of threads.

17. A block is comprised of multiple _______.

18. a solution of the problem in representing the parallelismin algorithm is

19. Host codes in a CUDA application can not Reset a device

20. Host codes in a CUDA application can Transfer data to and from the device

21. Host codes in a CUDA application can not Deallocate memory on the GPU

22. Any condition that causes a processor to stall is called as _____.

23. The time lost due to branch instruction is often referred to as _____.

24. _____ method is used in centralized systems to perform out of order execution.

25. The computer cluster architecture emerged as an alternative for ____.

26. NVIDIA CUDA Warp is made up of how many threads?

27. Out-of-order instructions is not possible on GPUs.

28. CUDA supports programming in ....

29. FADD, FMAD, FMIN, FMAX are ----- supported by Scalar Processors of NVIDIA GPU.

30. Each streaming multiprocessor (SM) of CUDA herdware has ------ scalar processors (SP).

31. Each NVIDIA GPU has ------ Streaming Multiprocessors

32. CUDA provides ------- warp and thread scheduling. Also, the overhead of thread creation is on the order of ----.

33. Each warp of GPU receives a single instruction and “broadcasts” it to all of its threads. It is a ---- operation.

34. Limitations of CUDA Kernel

35. What is Unified Virtual Machine

36. _______ became the first language specifically designed by a GPU Company to facilitate general purpose computing on ____.

37. The CUDA architecture consists of --------- for parallel computing kernels and functions.

38. CUDA stands for --------, designed by NVIDIA.

39. The host processor spawns multithread tasks (or kernels as they are known in CUDA) onto the GPU device. State true or false.

40. The NVIDIA G80 is a ---- CUDA core device, the NVIDIA G200 is a ---- CUDA core device, and the NVIDIA Fermi is a ---- CUDA core device.

41. NVIDIA 8-series GPUs offer -------- .

42. IADD, IMUL24, IMAD24, IMIN, IMAX are ----------- supported by Scalar Processors of NVIDIA GPU.

43. CUDA Hardware programming model supports: a) fully generally data-parallel archtecture; b) General thread launch; c) Global load-store; d) Parallel data cache; e) Scalar architecture; f) Integers, bit operation

44. In CUDA memory model there are following memory types available: a) Registers; b) Local Memory; c) Shared Memory; d) Global Memory; e) Constant Memory; f) Texture Memory.

45. What is the equivalent of general C program with CUDA C: int main(void) { printf("Hello, World!\n"); return 0; }

46. Which function runs on Device (i.e. GPU): a) __global__ void kernel (void ) { } b) int main ( void ) { ... return 0; }

47. A simple kernel for adding two integers: __global__ void add( int *a, int *b, int *c ) { *c = *a + *b; } where __global__ is a CUDA C keyword which indicates that:

48. If variable a is host variable and dev_a is a device (GPU) variable, to allocate memory to dev_a select correct statement:

49. If variable a is host variable and dev_a is a device (GPU) variable, to copy input from variable a to variable dev_a select correct statement:

50. Triple angle brackets mark in a statement inside main function, what does it indicates?

51. What makes a CUDA code runs in parallel

52. In ___________, the number of elements to be sorted is small enough to fit into the process's main memory.

53. ______________ algorithms use auxiliary storage (such as tapes and hard disks) for sorting because the number of elements to be sorted is too large to fit into memory.

54. ______ can be comparison-based or noncomparison-based.

55. The fundamental operation of comparison-based sorting is ________.

56. The complexity of bubble sort is Θ(n2).

57. Bubble sort is difficult to parallelize since the algorithm has no concurrency.

58. Quicksort is one of the most common sorting algorithms for sequential computers because of its simplicity, low overhead, and optimal average complexity.

59. The performance of quicksort depends critically on the quality of the ______-.

60. the complexity of quicksort is O(nlog n).

61. The main advantage of ______ is that its storage requirement is linear in the depth of the state space being searched.

62. _____ algorithms use a heuristic to guide search.

63. If the heuristic is admissible, the BFS finds the optimal solution.

64. The search overhead factor of the parallel system is defined as the ratio of the work done by the parallel formulation to that done by the sequential formulation

65. The critical issue in parallel depth-first search algorithms is the distribution of the search space among the processors.

66. Graph search involves a closed list, where the major operation is a _______

67. Breadth First Search is equivalent to which of the traversal in the Binary Trees?

68. Time Complexity of Breadth First Search is? (V – number of vertices, E – number of edges)

69. Which of the following is not an application of Breadth First Search?

70. In BFS, how many times a node is visited?

71. Is Best First Search a searching algorithm used in graphs.

72. Which of the following is not a stable sorting algorithm in its typical implementation.

73. Which of the following is not true about comparison based sorting algorithms?

74. mathematically efficiency is

75. Cost of a parallel system is sometimes referred to____ of product

76. Scaling Characteristics of Parallel Programs Ts is

77. Speedup tends to saturate and efficiency _____ as a consequence of Amdahl’s law.

78. Speedup obtained when the problem size is _______ linearly with the number of processing elements.

36. ___ became the first language specifically designed by a GPU Company to facilitate general purpose computing on .

46. Which function runs on Device (i.e. GPU): a) global void kernel (void ) { } b) int main ( void ) { ... return 0; }

47. A simple kernel for adding two integers: global void add( int a, int b, int c ) { c = a + b; } where global is a CUDA C keyword which indicates that: