Valuable Simplicity - Pentaho Data Integration Functionality

Simply excellent – some basic but valuable functionality in Pentaho Data Integration

I’ve now been using Pentaho Data Integration (PDI) for around 1 year, building on my ETL experience which began with Oracle’s equivalent offering, Oracle Data Integrator (ODI). Both tools have their strengths and weaknesses, but rather than comparing them, this post focuses on a few pieces of PDI’s more simple out of the box functionality which struck me as valuable for an ETL developer. 

PDI comes with over 200 default steps available to be used within transformations, with around 100 more available as job entries. This is an extremely rich foundation of functionality. Each step has a ‘Help’ button available which links the user to online documentation, and often the online community can provide further insight into the logic of the step and example use cases. Furthermore, the PDI installation ships with a range of sample code written by developers to illustrate how specific steps work. So, first impressions – lots of options, plenty of help. 

Onto some examples of functionality which stood out to me as useful – not necessarily because they are the best features of PDI, or are specifically unique to the tool, but because I’d not seen them before in other tools and have seen use cases recently for which they are perfectly suited. 

 

Pre ETL checks 

It’s likely most ETL developers will have experienced process failures as a result of database connectivity, missing tables/columns within a database or missing/locked files. These issues are a fact of life, but how they impact us is something we can control.  

We might see an ETL process do half of its work – move some files, load some tables, only to fail on subsequent steps because of the aforementioned possibilities. At this point, we may need to unpick how far our process has progressed and get back to a state when we can either pick up from the point of failure or start over. This may involve deleting data from the current batch or moving files back to input directories. The point is, we could have work to do just to be able to ‘go again’ because we failed partway through. 

Instead, why not perform all of the database/file checks we need to before the ETL even starts? 

PDI comes with steps to achieve all of this. 

On the database side, we can check: 

  • If we can connect 
  • If a table exists 
  • If a column exists within a table 

With files, we can check: 

  • If the file(s) exists 
  • If the file is not locked 
  • Check if a directory is empty 

We can also confirm that a webservice is available, all before starting our main process. 

Simply adding these steps into our ETL as pre-requisites to any ‘heavy lifting’ of data can save us the headache of a mid-process failure. If any of the pre-requisite checks fail, we simply abort the ETL move onto a step for notifying administrators, e.g. sending an email. We can then fix any issues before kicking off a fresh ETL. 

A simple and proactive way to increase the robustness of an ETL process.  

 

Regular expression 

Pattern matching of strings is a useful feature, which is provided by the Regex Evaluation step. 

The use case I faced which required this step was the identification of UK postcodes from multi-line address data in which there was no specific postcode field, i.e. the postcode could be in any of 5 fields, and the requirement was to populate a new field with the postcode for each row. 

The Regex Evaluation step can be used to check each address field against the pattern for a valid UK postcode. If a field matches the pattern, we can populate this field as the postcode. 

An example transformation shows how the Regex step works: 

The regular expression to identify a UK postcode can be found here

The regex looks like this: 

^([A-Za-z][A-Ha-hK-Yk-y]?[0-9][A-Za-z0-9]? ?[0-9][A-Za-z]{2}|[Gg][Ii][Rr] ?0[Aa]{2})$ 

Once we have evaluated the strings, the below example filters to find the valid postcodes and separate these from the invalid into 2 streams of data. 

The ability to perform pattern matching of strings is great functionality, without which the use case mentioned above would be tricky to fulfil. 

 

Merge rows (diff) 

Quite often users will want to check whether records in one system/table/file exist in another system/table/file.  

The Merge Rows (diff) step was designed for just this purpose. Based on a key field(s), it can read in 2 streams of data and determine whether a record is new, deleted, changed or identical between the first and second stream. 

A perfect example would be to determine whether a record in a source system is in sync with the same record in a reporting system (i.e. a data warehouse). 

The example below is more simplified and is based on a use case where data needed to be joined across 2 different legacy systems and enriched before loading to a new target system. Records which were missing in either system needed to be flagged to the business. Here’s how we can identify these records, to begin with. 


In the 2 sources, we have only a single field to keep things simple. The data must be sorted before the join  in a production situation, performance could be improved by sorting data in the database before it is picked up by PDI. 

Within the Merge Rows (diff) step, we are joining on the key fields (in this case the only field) and producing a new field which is the evaluation of each record between the 2 sources. By default, this evaluation is one of: identical, new, deleted, changed. I’ve given this the more user-friendly output we can see above, namely “present only in sourcex”. 


This is a very simple feature which allows us to compare different systems without the need for much code. The main pitfalls we need to avoid are things like case sensitivity and whitespace in the fields making up the keys in our merge step. We can easily address this in PDI using e.g. a String Operations step to trim and UPPER the data. 

None of the features above are complicated to use, neither are they themselves reasons to choose PDI over another ETL tool. In a list of PDI’s key strengths, we’d see features like metadata injection and big data integration which I’ve not covered here. What I’ve outlined above are the bread and butter, but for me the appeal of PDI is that simple functionality which is quick and easy to use is the tool’s bread and butter. With such a wide range of out of the box steps, all documented and supported by the community user base, PDI is a tool made by ETL developers for ETL developers. I’m sure that when the time comes to use another technology, I’ll miss PDI’s rich set of steps which often feel tailor-made for every use case I can think of and more.

If you would like to explore the benefits Data integration tools like Pentaho's PDI and Oracles ODI for your organisation get in touch to discuss setting up a demo and let's start the conversation and unlock your business's potential.


Subscribe to our Newsletter

If you enjoyed this article why not get great insights straight to your inbox

Leave a comment