Wednesday, April 30, 2008

SSIS - Updating rows from a FlatFile

Dear Friends,
Today I will describe a subset of an ETL I made and I think that could be useful for you and it will be useful for me if I receive your feedback.

The Problem
I have customer’s data (sent by an application by text files to a specific file system share), that must be imported to a SQL Server 2005 database, more precisely, to the customers database table.

All Fridays, all the data in the text file, must be extracted, transformed and loaded into the SQL Server 2005 database. After the extraction the file must be moved to another directory in the file system. (…\ProcessedFiles)

The text file is a fixed column type, and there is a column that describes the event to follow when ETL update the Customers table.
N (means new customer to insert)
A (means update existent customer)
D (means update/inactivate existent customer)

The Solution
The data will be extracted from the text file into a preTable (“preCustomer”) without any transformation/rules (to avoid risks). For each row imported to this preTable, I will have an extra column that is a foreign key to my audit table information. Next, I read all the records from the last extraction, transform and load into Customer table.

The end user, from an asp application, will be able to correct or ignore the redirected rows, and next time the ssis package executes, these errors will disappear. (Note: the warning and error rows will be stored in redirectCustomer table, until some action (correct or ignore) from the end user)

Logical Data Flow
I have several steps that could be useful to explain, but today I will focus only in the subset that I think is more interesting. Below you can see the logical data flow for the ssis package number 3, that transform and load the customer data from the preCustomer table into customer table.

SSIS Package
And finally the ssis package...
(Note that the schema of ETL_Customer is identically to the ETL_redirectCustomer, for this reason you can union all the two sources in the pipeline)

Operational Risk
Remember that in a business intelligence project 70% of the risks occurred in the ETL module. So pay attention to the ETL and always try to identify the existence of an operational risk. I use the ssis error logs only for the real unexpected errors, if you think that there is a probability of some task fail, don’t wait to see the error log, try to mitigate the error before it!!!

“Operational risk is defined as the risk of loss resulting from inadequate or failed internal processes, people and systems or from external events. This defi nition includes legal risk, but excludes strategic and reputational risk.”
The European Commission’s Directive 2006/48/EC9

Some tips for using the best practices
I selected some important tips that could be followed in order to implement a good ETL project, using the best pratices.
- Always comment you project
- Always create the control flow and dataflow logical model and always before implementation of ssis package.
- For all the documents created, insert each document inside a template with all the important information to share with all the teams involved in the project. It’s very important that when you look for a document, you could identify the subject and scope of the document fast. For example, for the template I created for Logical Data Flow, I added important information as project, package, dataflow, developer/team, date, version, purpose, reference…
- Use naming conventions for the tasks and components of the SSIS Packages
- Use SSIS log provider to catch error and warning events.
- Could be useful implement not only the SSIS log provider, but also some custom logs, using event handlers.
- Test you package! How?! Use SQL Data Generator to generate thousands of rows to your source data. This software is very good, for each column you can select all the possible values that you want to random. (These values could come from a database, text file…)

I hope this post could be useful for you, and I also hope that you can give some feedbak about this.
Next post I will try to write something about data mining!!!


Amin said...

Hi Pedro,
It was nice, and complete, but why didn't you make use of the Slowly Changing Dimension component that are available in SSIS?


Pedro said...

Hi Amin!!
Why I didn't use SCD transform?
- Because the dimension customer, is not a SCD type.
- This way, is more understandable, more flexible and with better performance. I'm redirecting the warning and error rows in the same dataflow, and in the same data flow I have the redirected rows from last execution as a source.

I think that I don't have much advantages using SCD transformation... but if you think, tell me how I can improve this dataflow using the SCD transformation.

Thanks for you post Amin!!
It's fantastic receive feedback!

Pedro said...

"Because the dimension customer, is not a SCD type2."

Amin said...

Dear Pedro,
I am so sorry about the post that I promised, but It did because of intense tasks that are assigned to me in the office.
For the SCD, I meant that with SCD component, it is possible for recognizing new, and edited rows, so it could reduce the error handling mechanisim, and loading data into memory; moreover, the SCD component is not merely for SCD type2.
In conclusion, you will must handle the inferred members as well, because some rows are redirected and are waiting for the user's decision, and what would be happen if the user couldn't make a decision for all the data on time (like the post that I must wrote :( ), could the stakeholders wait for the report for some days?
Also, I might be in a wrong way, because I do not know the situation percisely, and I think it is just a little form the whole, I might be in a wrong way.
By the way, It was another look to the problem ;).

Best Regards,

Pedro said...

Hi Amin,
Your tips make sense, and I understand, but this specific project is a little different. I receive from the textfiles, all teh information needed, and I'm not responsible for bad quality of that data. In the texfiles, there is a column that tell me exactly what i need to do for each row... to insert a new customer, update existent customer or inactivate existent customer.

The main goal of this project is to identify the operational risks and avoid them and let the end user control it.

The package will be in a while until the end user correct the error problems, and if he take too much time to validate, he is the responsible!

But what's my advantages of using the SCD that I dont have using this dataflow without it?

Note that I complicate a little bit the dataflow, for example if I try to update a customer that doesn't exist I could simple ignore it, but I'm considering a warning event, not an error event.

I identified 4 risks in this dataflow, 1 error and 3 warning, and only the error event could cause a serial problem.

And this dataflow has some extra info behind and that I'm not describing here. I dont want a post too big and boring! :-)

Update your blog with some subjects about data mining... I'm having a DataMining class in my master degree and I'm loving it!

So, update the blog and still come here giving me your important feedback!!!!
Cheers friend!!

Amin said...

Dear Pedro,
In this case, using SCD doesn't make sense, due to the fact that you are not responsible for user's response time, and the situation of the rows is defined previously.
I hope I can write more and frequently, but the problems some times don't allow us to do what we want,yet I try more.


Pedro said...

Thanks Amin, for all you feedback!
Relating to your blog, try to apply the subjects about your daily job to your blog!
I want to know something more about what are you doing, because seems to be very interesting!!

Amin said...

Hi Pedro,
If the name of my blog were not about business intelligence, I would definitely write about my daily activities, but I am a supervisor of a project, and what I do is about writing and validating RUP's artifacts, designs and etc. The softwares that I used are something like “IBM Business Modeler”, “IBM Rational Software Architect”, “Rational RequisitePro”, and so on.

Because these activities are not related to BI, I haven't written anything about them.
It is hard if your job is not related to your interest ;)

Best wishes,

Ella said...

Hi Pedro,

I haven’t visited in a while and I see I have missed quite a bit!
What an amazing post! I was completely blown away: the layout of the ETL package, the VISIO flow chart and the architecture diagram, not to mention the text!
Very comprehensive and well written.

Thank you for sharing the knowledge,

Pedro said...

Hi Ella!,
I'm very happy that you enjoyed my post! I hope to make better and better posts!! You blog is also very good!!!
Thanks for your visit!!

Anonymous said...

Hi Pedro,

I found your blog helpful, and specially "Some tips for using the best practices" coz at the moment we are defining processes in our organization...
Keep it up!



Related Posts Plugin for WordPress, Blogger...