figure
8 November 2019

How to make use of Talend Open Studio in the medical industry?

The use of modern technologies in medicine is getting more and more popular. Paper patient records are becoming obsolete and are being replaced by electronic forms of data storage. The digitalisation process of the health service is under way! In what areas? The answer to this question can be found below in this article.

How do computer systems in hospital work

No one probably needs to be convinced that the health is important. Our condition depends on a large extent on ourselves. What if something does begin to fail? This is exactly the moment when doctors come to the rescue whose knowledge and experience help us to get better and even sometimes return to full physical fitness.
Probably, you have never really thought about it at all, but they do not work alone. The prevention or treatment process contains involvement of other people and elements which they would not be as efficient as we expect without them.

Thousands of patients arrive in hospitals or other medical facilities every day. Their disease or its complications must be diagnosed, they must be provided with recommendations concerning further treatment and advice on how to react to potential side effects of medicines etc. Finally, all this information is registered by the computer system in a digital form. The system enhances the treatment process which is becoming more and more efficient. The medical knowledge concerning a particular patient is stored in one place, so that it can be used in the following years by other specialists.

Medical industry

The medical industry uses data in several formats of varying sizes. It is a real challenge for data processing. This was the exact issue a US customer came to us with. Every day, he was receiving a huge number of files which were delivered later to the central company unit. They contained various interviews conducted by doctors as well as patients’ medical records which had to be used on a daily basis by medical staff and data administrators. An extra difficulty was that the enormous amount of data had to be available for each authorised staff member in real time.

Work with the Customer. Expectations

The task of our team was not so obvious. We had to obtain data from several sources, transform it and load it onto another system. It was a sort of data integration in which the data from medical facilities was supposed to go to a data warehouse with a 1 petabyte capacity. Understanding the data and learning the whole processing procedure was of major importance.

During the implementation of this solution, understanding the software from the functionality level and learning customer expectations was quite helpful as well. Working with a huge amount of data, it was a key element to choose proper techniques and tools which made the processing as short as possible. Efficiency was crucial here, as in the case of it being low, it would have resulted in a bad impression by the end user and finally their negative opinion of the system.

Another important element was collecting information of any undesirable actions and corrupted data. The customer requirements were stated quite clearly in this case. He expected from us to use particular ETL software – an open source tool called Talend Open Studio for Data Integration.

What is an ETL process?

The acronym ETL is an abbreviation of extraction, transformation and loading data. There is also validation of this data present during the whole migration process. Therefore, the acronym is sometimes extended to ETLV.
The first step in the process is to obtain data from one or several source systems. Then, in order to enable loading the data into another system, this time to the target one, the data must be converted to a correct format, relevant filtering must be applied and business rules should be implemented. During the whole process, the data cleansing procedure should take place, but it does not have to be implemented at the transformation stage. More likely, the action will be performed before the extraction and it could be implemented even after loading, although it happens much less often.
In practice, the software used for data transformation into business information is referred to as ETL as well.

What is Talend Open Studio?

Talend Open Studio is an open source platform for software integration which help to change complex data into comprehensible information that is used by people responsible for business matters. This simple and intuitive tool is got through in the US. It can easily compete with products from other huge players in this market. What is important, Talend works perfectly with cloud-based data warehouses from giants, such as Microsoft Azure or Amazon Web Services.

Talend

Its basic features are:

  • Over 900 ready components for connecting various data sources, i.e. RDBMS, Excel, SaaS Big Data, as well as connecting with applications and technologies such as SAP, CRM, Dropbox;
  • A metadata repository that facilitates connection management;
  • Automatic task conversion to Java code;
  • Intuitive transitions and a quite large community.

“Talended” data flow in AWS cloud

The described project was undoubtedly one of the largest projects implemented at Transition Technologies PSC focused on the use of Talend Open Studio. To ensure that everything works smoothly and reliably, we have paid a lot of attention to infrastructure at the preparation stage.

After consideration of pros and cons, we decided to use Amazon Web Services that provides on-demand cloud computing platforms. We applied Amazon S3 as storage (Simple Storage Service), mostly due to its huge capabilities. Unlimited capacity, easy access, and above all, high durability of data recording are just some of them. The popularity of this service can be reflected by the fact that it is the main storage for such giants as Netflix or Dropbox. Thus, ETL processes collected JSON files that were transformed to the data warehouse at the next stage.

A data warehouse is a specific database which integrates data from numerous sources. Its resources are most often regularly filled by data from production systems. Our warehouse has been based on Amazon Redshift service. It is a very popular solution used by over 15,000 customers around the world, including McDonald’s and Philips. The main advantages of Redshift are undoubtedly its efficiency and scalability. Although it is not a regular relational database, it uses a standard SQL query. During the analysis of large volumes of data, we noticed how effective and fast queries are in Amazon Redshift. The important thing is it has no indexes and uses the column-oriented method of data storage. Each column can have a different compression method, which brings a further cost reduction and saves storage space.

Amazon Redshift and S3 make a great combination designed for data storage. They work efficiently with Talend which contains components for such services. One of the main advantages of the Amazon Web Services cloud are almost “unlimited” resources. If necessary, its capacity can be extended just in few minutes. It is often very difficult to determine the required disk capacity at the beginning of the project. Unfortunately, conventional “on premises” solutions require such a declaration from us. The cloud offers us leeway in this regard, and even allows for configurations that automatically increase the size of the database alongside our growing needs.

Talend on S3 and Redshift

Another advantage is the certainty that our data stored will never be lost. Amazon S3 guarantees record durability of 99.999999999% per year. This means that if we stored ten million elements, there is a risk of losing one every 10,000 years.

Another important advantage of the cloud is the ability to easily ensure compliance with legal regulations. Storing our data in Amazon cloud, we can choose where the data is located. We can place it in one of the EU member states (e.g. Germany or Ireland). This way, we can be certain that we would comply with the General Data Protection Regulation. Furthermore, if we are based in Europe, we are sure to have much quicker access to the data. For the customer of TT PSC, locating the stored data in the United States was obvious – it was optimal with regard to costs and waiting time.

It is worth mentioning that the costs and availability of cloud services vary depending on the areas in which they are hosted. For example, Amazon S3 in the Northern Virginia is almost twice cheaper than the same service in the Sao Paulo region.
There are many more advantages of the cloud, although from the developer’s point of view, the most important is the intuitive and efficient use of resources available in AWS.

To sum up, the project, carried out for our foreign customer, has been successfully completed, fully matched up to his expectations and needs with solving existing problems. From the developer’s perspective, the work on the project itself (with the use of the tools earlier described) was just pleasure.

Are you thinking of digitising the processes in your company? Do not hesitate to contact us!

How useful was this post?

Click on a star to rate it!

Average rating / 5. Vote count:

Leave a comment (0 comments)

Write a review ...
If you violate the Regulations , your post will be deleted.
Your first and last name

    Contact

    Transition Technologies PSC Sp. z o.o.
    Poland, Lodz 90-361, Piotrkowska Street 276
    NIP 729-271-23-88

    tel.: +48 42 664 97 20
    fax: +48 42 664 97 30

    contact@ttpsc.com