StrataFrame Forum

Efficiency in data processing with BO's

http://forum.strataframe.net/Topic14426.aspx

By Keith Chisarik - 2/19/2008

What do you all consider the most efficient way to process through large amounts of data?



1) Load the BO with all data upfront and then "filter", modify data, move on

2) Load the BO upfront and then "seek" (can you seek for more than one record even?)

3) Load the BO with only the small subset you want to work with, inside a loop (many calls to the database), modify data, move on

4) Don't use a BO at all, use ADO or LINQ to SQL.

5) Other



I have to adjust pricing for a very large volume of records, needing to update or insert 1-6 at a time for a given combination of criteria from a total record count of about 200,000.



Just curious as to your opinions or past experience with this. I know I find myself having to worry about the most efficient way of doing things since I never have been able to make the .NET/SQL stuff as fast as VFP.
By Trent L. Taylor - 2/19/2008

What do you all consider the most efficient way to process through large amounts of data?

There is really on ONE best way...especially if you are going to be pulling from a number of different tables and it may require a number of different queries....a CLR stored procedure.  This let's you still program in .NET, but the work takes place at the server.  However, this will only work for SQL Server.  Otherwise you will have to create stored procs in the database language you are using.

Here is an example of performance increase, we have a program that deals with entering payments that come back on an EOB.  We released our first blush sitting on top of VFP and it cratered if the EOB was very large....not to mention slow.  We now have this totally on SQL Server and using CLR stored procs and it is basically instantaneous.  We have another scenario where the end-user is posting the claims, and there were literally 18+ queries that took place each time in order to check all the exceptions, etc. to get the correct fees, and so forth.  This totally died in VFP, but by using the CLR stored procs, it is literally as fast as you can type the code....it is incredible!

The last example I will give you relates to our prescription data, there are some tables with millions of records which have to be queried and compared to check for items such as drug-interactions.  Pulling this back to the client would never work, but by using the CLR stored procedure, it is basically as fast as you can type again. 

In the current beta, though there is no documentation yet, there is also a new feature that allows a single query to the server to bring back multiple result sets (assuming the database supports this [i.e. SQL Server]).  I have one example where I have a single query to the server but load 6 business objects...one trip...no stored procs...and FAST!

Dim cmd As New SqlCommand("SELECT * FROM Customers WHERE cs_pk = @cs_pk;" & _
                                         "SELECT * FROM CustomerOrders WHERE or_cs_pk = @cs_pk")

'-- Add the parms
cmd.Parameters.AddWithValue("@cs_pk", MyCustomerPk).SqlDbType = SqlDbType.Int

'-- Now execute the query and load the BOs
MicroFour.StrataFrame.Business.BusinessLayer.FillMultpleDataTables(cmd,CustomersBO, CustomerOrdersBO)

I know that you are using DB2 in certain conditions, so this is an issue and will definitely require you to "think out of the box."  DB2 is a more cumbersome and more limiting environment to work within.  I am not as familiar with it so I do not know if it supports the multiple result sets or not. 

All of your questions are valid, but trying to determine if ADO or LINQ to SQL will help is actually not the issue here.  First, we use ADO within the BO...secondly, LINQ to SQL is only going to strong-type your queries, and has nothing to do with performance really.  The whole issue with large data is generally getting it back from the server, the number of trips, and how the query is formulated.  Also, different scenarios require different solutions.  This is why I mentioned several options above.

have to adjust pricing for a very large volume of records, needing to update or insert 1-6 at a time for a given combination of criteria from a total record count of about 200,000.

In this case, you shouldn't be bringing all 200,000 records back to your side.  If you know what needs to be changed, you are better off working in a much smaller data set (i.e. the pk, code, and fee) then have the update take place back on the server.  If the end-user has to enter all 200,000 fees, then this is going to happen with one or a few at a time, not all 200,000.  So I would imagine that you already know what needs to be changed, so you could create a CLR stored procedure (or TSQL if you want pain Smile ) and then have all of this work take place on the server so the data never makes it to the client...just the results and the information that the end-user needs.

Whoa...that was a big post, but I hope it give you some ideas Smile

P.S. Using SQL Server versus VFP we have had in some cases 100 times better performance...SQL Server is WAY faster!  It all comes down to how you deal with the data.

By Greg McGuffey - 2/19/2008

Normally I'd consider using stored procedures, which would be the way to go if you were updating all 200,000 at once. But since you aren't...I'd be asking these questions?



- how do you determine what the 1-6 records are? How much user interaction is there? Is the logic complex?

- Should the process be cancellable/within a transaction?

- How fast is the connection to the db?


By Greg McGuffey - 2/19/2008

I really need to check if some else posted before I post...nice info Trent.
By Trent L. Taylor - 2/19/2008

I really need to check if some else posted before I post...nice info Trent.

LOL...I do it all the time.  Usually we are both posting at the same time when this happens BigGrin

By Keith Chisarik - 2/19/2008

I actually will be updating all 200,000 "at once" I have an EDI pricing update file, that from my initial test updates close to 95% of the existing records and then will be adding more. So this isn't a user doing entry, it is an automated process that pulls in files (actually many), parses them and then updates the DB2 pricing tables with the updates and adds for new products or date sensitive pricing.



I am using DB2 for this so I will have to see what capabilities I have there.



Thanks for the input.



Greg:

- how do you determine what the 1-6 records are? a concatenated key of 3 fields returns the records to be updated

- How much user interaction is there? Just a button press....hopefully

- Is the logic complex? - not really pricing is based on a three fields

- Should the process be cancellable/within a transaction? No, I want to be able to run it whenever the user like and it just pulls the new data files from the vendors and evaluates what does and does not need to change

- How fast is the connection to the db? Right now it is a local LAN, eventually via a VPN that isn't that great
By Greg McGuffey - 2/19/2008

Ah, I forgot that you are living the love with DB2.



I'd say that you will likely be be better served using DB2 to do the heavy lifting. It appears the DB2 has the equivalent of CLR stored procedures and they appear to be able to be written in VB.NET or C# too! It also has normal SQL sprocs too. The nice thing about the compiled sprocs is that they are efficient at managing complex rules.



Now, the next question is are these files and who processes them?



If the user will be indicating the files to process, then I might try to load the files locally into a BO (doing the parsing along the way), then load that data into a table on DB2, then call the sproc to do the work.



If the files are in some central repository, you could theoretically get DB2 to do the file parsing work also.



If it takes a while to do the processing, you could of course use another thread.



Just some ideas.



Good luck!
By Peter Jones - 2/19/2008

Hi Trent,

From your response I get the impression that you believe CLR is the way to go with SQL Server. Is this so and, if it is, what made you guys move away from standard SQL stored procs?

Cheers, Peter

By Trent L. Taylor - 2/19/2008

From your response I get the impression that you believe CLR is the way to go with SQL Server. Is this so and, if it is, what made you guys move away from standard SQL stored procs?

We still use standard TSQL stored procs for INSERT, UPDATE, and DELETEs (our CRUD settings).  But anything past that or that requires customization or custom queries on the server, we use CLR.  We haven't changed any of our standards or logic.  CLR stored procedures are much easier to work within and have more flexiblity (especially for .NET developers).  You get all of the benefits of TSQL (as it relates to server side performance [you do take a slight hit using CLR over TSQL but it is not really noticeable]) but get to write all of your code in C# or VB.NET.  Also, the DDT will deploy CLR assemblies for you as well which just adds to the ease of implementation.

By Keith Chisarik - 2/25/2008

I am still struggling with speed issues and require some help.



In breaking down the problem, I know that I will probably have to write my own insert sprocs on the DB2 side, for now I am ignoring that.



I am having speed issues before I ever save across the wire, just working with the BO. After much trial, it seems that the call to the add() function of the BO is my first bottleneck. I invested in a code profiler and it clearly shows that to be the issue (see attachment), and I can confirm that by replacing the add() line with just about any other function and things are fast.



I was under the assumption that working with the BO was essentially the same as working with the underlying datatable, what am I doing wrong that bo.add() in a loop is taking forever? To add 87k records to the BO is taking about 15 minutes, before BO.save() is ever called. The BO has 6 fields only, 4 strings, a self incrementing integer PK, and an integer row version field, primarykeyisautoincrement is true and row-versioning is set to optimistic row version.



Any thoughts?
By Trent L. Taylor - 2/25/2008

Is the 109 the amount of time or number of calls?  We too use code-profilers and they are great tools.  But in this example, there is a lot of stuff that can be happening once the BO.Add() is called.  What do you have as a filter?  What do you have in the SetDefaultValues event?  What do you have as a Sort?  There is something else in your code that is causing the slow down.  The fact that you are adding 87k records before a save is somewhat disconcerting, especially considering that this is all going to have to be sent across the wire.  I would be sending things across in smaller batches (wrap them in a transaction if you need to).  But this is really a side-issue.  Look in the areas that I mentioned to see what you find.
By Keith Chisarik - 2/26/2008

No filter, sort, default values, nothing you mentioned.



It is a very simple BO with a single fill all method.



The 109 represents seconds and just represents a very small portion of the operation for illustration purposes.



I could do 87k adds to a cursor or table in VFP in the blink of an eye. I must be doing something wrong here.



Could it be the fact that I am using DB2 in some way, I wouldn't think so since I am just doing the add in memory but at this point I am grasping. I am going to build a sample project with the same table structure/BO properties using SQL and see what happens.


By Trent L. Taylor - 2/26/2008

Keith,

I guess without getting my hands on it I cannot give you any more suggestions here.  For whatever it is worth, we ran a test today dealing with 400,000 records (medical transactions) on a combination of tables that may have as many as 50 fields.  We got the calculations and response down to under 1 second.  I know that this doesn't help you with your problem, but we have all been laughing around here lately because everything that we are doing was literally impossible with VFP....because we tried!  Even DB2 is a better database than VFP and can perform at a much better level simply due to the fact that it is a server, supports sprocs, doesn't stream the entire file over the network, etc.

The Add alone is not going to slow the world down.  To prove this, create a quick sample project that uses the SF sample database.  Then using the customers BO, add 80,000+ records dynamically (without committing them back to the server) to see what type of response that you get.  I am willing to bet that it will FAR exceed 109 seconds.  Also, if you are not needing to update the UI, use the NewRow method instead as this will not attempt to update any bounds controls...which can improve performance as well.  At this point the database is irrelevant since you are dealing with the BO alone.

By Keith Chisarik - 2/26/2008

newrow() was the fix, lightning fast now like I would expect. Total processing time for the 87k records is under 20 seconds and that is with round trips between threads to update a progress bar in the UI layer.



Now to just get it across the wire Smile



I must have a control somewhere bound to the BO, I haven't found it yet but it must be there.



Adds a new data row to the current data table and sets the CurrentRow to the new row (This method is recommended for programatically adding new rows. To add new rows through a user interface, use Add() as the editing state of the business object will be set and the Navigated event will be raised when the row is added.)
By Trent L. Taylor - 2/27/2008

Good deal.  One other thing, you know that you can update on more than one thread at a time, right?  On a business object, there is a property called DataLayerAccessThreads.  By default this is set to one.  However, in a situation like this, you may want to increase this to as high as 7.  If I recall, there are diminishing returns after 7...but you can test between 2 and 7 to see where you get the best performance.
By Keith Chisarik - 2/27/2008

Nope.



DataLayerAccessThreads sounds VERY interesting. I searched the documentation for that and didn't find much. Can you direct me to where I can learn more on that?



Thanks!
By Trent L. Taylor - 2/27/2008

It is really pretty simple.  This property is on the BO and by default is set to 1.  Simply change this property to another value and it will begin updating on additional threads.  This can be a very effective tool when updating so many records at the same time.  I know that it has probably been a while, but we covered this in class a little.  But we didn't go into great detail, so it is one of those items that doesn't matter...until you need it Smile

What will happen is records will begin to save on multiple threads, which helps to prevent an update bottleneck when updating large datasets.  This is especially important when updating over a VPN or slow connection.  The worse the bandwith, the more performance gain you will get....but this will still dramatically improve an update time when dealing with so many records.

FWIW...and a little sneak peek...we are setting up a knowledge base that will continuously have new articles, samples, and videos.  There will also be howto's, FAQ, etc.  This will make it SO much easier for us to put out new samples, help information, and the like versus requiring a new build, new help docs, and trying to squeeze more samples into the install.  This will be a more fluid environment, and this would be a good topic that may end up there. 

By Keith Chisarik - 2/27/2008

Excellent, thanks a lot.