Datastage Interview Questions and Answers
If you are interested in this IBM based Infosphere Information System and seek out to make a career in this, we have curated a list of the most frequently asked Datastage Interview Questions to help you crack the job interview easily.
DataStage is a leading ETL based product in the Business Intelligence industry. This tool allows users to integrate data across multiple systems while processing large volumes of data parallelly. Datastage has a user-friendly and easy to use interface which is used for designing jobs for managing, collecting, validating, transforming, and loading data from various sources.
Most Frequently Asked Datastage Interview Questions
Technically, Datastage is an ETL tool that is used in extracting data, transforming, and loading it from the source to the target. It is a Data integration component used in the IBM Infosphere Information System.
Uses of Datastage:
- Used to integrate different types of data
- Used in Big Data for the integration of data at rest or motion
- Used in Mainframe and Distributed Architectures for Data Integration
- Plays a major in Business Analysis by providing valuable information for Business Intelligence.
In Datastage, the Conductor Node is used for the primary process of starting jobs, determining resource assignments, and creating the section leader processes on various processing nodes. It acts as a single responder to coordinate the status and error messages while also managing the proper shutdown in the event of process completion or the occurrence of a fatal error. It is handled and run from the primary server.
Here are the features of the Datastage Flow Designer:
- It has the palette for easy dragging and dropping of connectors and operators onto the designer canvas.
- Here the nodes can be easily linked by selecting the previous node and dropping the next node or by merely drawing the link between the two nodes.
- Stage properties on the side-bar can be edited and to make changes to your schema in the Column Properties tab.
- There is an option to zoom in and zoom out using your mouse.
- There is no need to migrate jobs
- You don't need to upgrade servers or purchase virtualization technology licenses.
- You can easily highlight all compilation errors in one go.
Here are some necessary steps to set up a Merge in Datastage:
- Go to the Stage page Properties Tab
- Now, specify the key column or columns to be merged
- Inside the Stage page Properties Tab, you need to set the Unmatched Masters Mode, Warn on Reject Updates, in addition to setting the Warn on Unmatched Masters options or accept the defaults.
- After that, in the Stage page Link Ordering Tab, you need to check if your input links are correctly identified as "master" and "update(s)," and the output links are defined as "master" and "update reject."
- Most importantly, you need to ensure the required column metadata is specified
- Finally, go to the Output Page Mapping Tab and specify the order of mapping the columns from the input links map to the output columns.
|It is a processing stage that can have any number of input links, the same number of reject links, and one output link.||It is a processing stage used for copying multiple input data sets into a single output data set.|
|It is used for combining one master data set with multiple updated data sets.||It is useful in combining multiple datasets into one large data set.|
|Used when joining large tables.||They are used while doing a range lookup within a small reference dataset.||Used when multiple updates and reject links are required within a dataset.|
|The performance of Join is increased while key-sorting data based on input links.||It does not require data on the input or the reference link to be sorted out.||It can have any number of input links, but it has to be matched with the number of reject links.|
|The key columns must be the same in the tables.||The Key column names do not have to be the same in the primary and lookup tables.||Here, to ensure minimum memory requirements, users have to ensure that rows having the same fundamental column values are located in the same portion and are divided by the same node.|
In Datastage, duplicates can be removed using the four ways:
- Through the Duplicate Removal Stage
- Using the Hash File Stage.
- Using a sort stage and setting ALLOW DUPLICATES: false
- It can be done at any stage by doing a hash portion of the input data and checking options under Sort and Unique.
Job control in Datastage provides a method of controlling various jobs from a current job. In this, a set of one or multiple jobs can be validated, run, stopped, reset, and scheduled in almost the same way as the current job. Users can set up a job where the only function is to control the set of other various jobs.
Here are the steps to successfully kill a job in Datastage:
- First, go to the Job Resources dialog box and choose the range of processes to list by through the Show All or Show by job option present in the Processes area.
- Now, to kill all the processes associated with a particular job, click on Logout All.
- If you want to end a specific process only, then select the method in the list box, and then click on Logout.
- Now, you will have to wait for the process to end and the display to be updated.
Types of loops available in Datastage:
- Numeric Loop: It is a type of loop divided into three parts, From (the initialization value), Step (the increment value for each counter), and To (the final counter value).
- List Loop: It is a type of loop containing two parts, i.e., Delimited Values(which includes the values inside the list) and the Delimiter(space, comma, or any other character) in which you are required to perform a loop for each item present in a list.