Join and blend datasets

Join




You can join two datasets together using common columns using the Join transform. 

For example, consider a dataset which has the purchase data of a store and another dataset with their customer information. These datasets can be combined using a common column such as the customer ID to identify and match the records from both the datasets.

DataPrep offers all four types of joins: inner join, left join, right join, and outer join. Here's a quick video on how to use the Join transform.



To perform a join transform

1. In the DataPrep Studio page, click the Transform menu, click Combine , then select Join.

Info: You can also click the  icon or right-click the stage in the Pipeline builder and choose the Join transform.



2. In the Join dataset dialog, you can choose the dataset with which you are going to join the current dataset and the join type. 

3. Select the dataset that you want to join your current dataset with in the Choose a dataset to join drop-down. 

A new dataset will be created with the result of the join transform.



DataPrep automatically calculates and shows the Join potential between the current dataset and other datasets from your workspace. The join potential is shown in percentage, making it easier for you to choose a dataset for the Join transform. It is a calculation based on various factors like the amount of matching data in a column and matching column names between the datasets.

4. Choose the type of join using the Join type option. You can also change the type of join from the Join type menu in the Transform panel.



5.Enter a name for the new dataset in the New dataset name box in the Transform panel. 

6. Select the columns using which you want to join the two datasets i n the Matched columns section. 

Resolve repeated column names


7. Click the Preview button. The Resolve repeated column names dialog appears, if there are any columns with the same name in both the datasets that you're trying to join. You can rename or delete those columns and then proceed.

8. In the preview, you can unselect the columns that you do not want to include using the check boxes in the column headers.

9. Click Join to join the two datasets.



10. Once you've successfully joined the two datasets, you can click Open the joined dataset.

 

11. If you wish to view the join configuration, you can click open the Data source configuration from the Ruleset pane from the joined dataset. 


12. You can also click Edit join to update the join configuration.




Info
Important : When performing a join, ensure that your join criteria is valid and the columns used for joining do not contain too many duplicate values. If the join columns have a large number of duplicates, the resulting dataset can grow exponentially, leading to errors or performance issues.
To avoid this, verify the join conditions or deduplicate the join columns before applying the transform.


Joins are typically used to combine two different datasets on the common items. Let's look into each type of join with an example. 

Inner join

An inner join combines two datasets using common columns and discards all the unmatched rows. 

The inner join is best represented as: 



Let us take the example of the following datasets. 

Orders dataset 



Customers dataset 



Order ID  is the common column here.  Inner join  has combined the rows of the two datasets that shared values in the Order ID column and discarded the unmatched values from both  Orders  and  Customers  datasets. 



Left join


A left join combines the current dataset with another dataset using common columns, and discards all unmatched rows from the other dataset.

Left join can be represented as: 



Let's take the example of the following datasets:

Orders daaset



Customers dataset



Left join  has combined the rows of the two datasets that shared values in the Order ID column. The unmatched values were only discarded from the Customers dataset. 



Right join 


A right join combines the current dataset with another dataset using common columns and discards all unmatched rows from the current dataset.
Right join can be represented as:



Let's take the example of the following datasets: 

Orders dataset



Customers dataset



Right join has combined the rows of the two datasets that shared values in the Order ID column. The unmatched values were only discarded from the Orders dataset. 


Outer join

Outer join uses common columns to combine two datasets, including the unmatched rows from both the datasets. An outer join can be represented as:



Let's take the example of the following datasets.

Orders dataset



Customers dataset



Outer join  has combined the rows of the two datasets that shared values in the Order ID column. The unmatched values are kept from both Orders and Customers datasets. 



SEE ALSO