StrataFrame Forum

Surrogate or natural key?

http://forum.strataframe.net/Topic23023.aspx

By ChanKK - 5/6/2009

Hi,

Shall anyone share his experience on using surrogate key in database design? I know that surrogate key is not extensible, but I found that is is very troublesome to do debugging. Beside, we always required to join multiple tables in order get some "meaningful" result. It could cause performance issue as well. Also, maintenance by technical support also not ease.



However, I see most of the vendor using surrogate key.



Anyone can share?

Thank you
By Michel Levy - 5/6/2009

Hi,

I use surrogate keys in all tables (except a few static metadata tables) in SQL Server, and in VFP since whe have autoincrement field.

Performances are dramatically increased, especially if you build a clustered index on it, and add Integer column for FK in related tables :

  • data stays physically in that index, so you read lines of the child table when the JOIN is running, and so your query involves less I/O on disk.
  • Index size decreases, both on PK and FK.
  • you don't need any cascade on update (declarative or by trigger), as these keys are never updated!

I don't see any trouble in debugging, on the contrary it seems easier for me:
RI errors involve only PK and FK, and Business rules errors never include PK or FK. The same rule is valid on writing SP or UDF.

By Edhy Rijo - 5/6/2009

Michel Levy (05/06/2009)
I don't see any trouble in debugging, on the contrary it seems easier for me:

RI errors involve only PK and FK, and Business rules errors never include PK or FK. The same rule is valid on writing SP or UDF.




Chankk,

I agree with Michael in all cases. In VFP I used to use GUID for my PK/FK since I did not trust its auto increment feature, but using surrogate at the end make my life easier with my designs. I also use xCase for data modeling which helps enforce the PK/FK. Now using SF/MS-SQL with the freedom provided by SF design by using Custom Field Properties you can overcome the need to join table for the purpose of display a description field, even though in some cases dealing with many records you may need to use a view or scalar function to grab the data faster.



As for making it easier for tech support to deal with the data, I would say that is not relevant, these days you will simply create a join query to debug the data record by record if needed and really there is no need to look at the PK/FK in this case. Natural keys, still useful in some cases, but I try to avoid them unless there is a MUST for me to use them.
By Greg McGuffey - 5/6/2009

Chan,



I tend to agree with Michael and Edhy. However, I found this interesting discussion at stackoverflow. It provides both sides, links to other articles and some good thoughts.



http://stackoverflow.com/questions/337503/whats-the-best-practice-for-primary-keys-in-tables



Something not mentioned here is that you should put either a unique index or a unique constraint on any natural keys.



In my own experience, I've never been real happy if I used a multi-column natural key...I've always ended up going back and refactoring the db (which isn't really all that fun) and the app to use a single column surrogate key.
By Peter Jones - 5/6/2009

Hi,



Another vote for surragate keys from me to. I always use GUID's - they simply makes life much easier in general and without them database replication becomes a real problem.



Cheers, Peter
By ChanKK - 5/6/2009

Michel Levy (05/06/2009)


Performances are dramatically increased, especially if you build a clustered index on it, and add Integer column for FK in related tables:
  • data stays physically in that index, so you read lines of the child table when the JOIN is running, and so your query involves less I/O on disk.




But more table join is required, doesn't it affect the performance as well? I have a tables that contains about 15 FKs. In order get them all, doesn't it kill the performance?



Edhy Rijo (05/06/2009)


Now using SF/MS-SQL with the freedom provided by SF design by using Custom Field Properties you can overcome the need to join table for the purpose of display a description field, even though in some cases dealing with many records you may need to use a view or scalar function to grab the data faster.





Custom field properties only apply to the application. What about query for reporting tool ? We might use flexible reporting tool such as FoxFire! or StoneFieldQuery which doesn't support SF BO. Also, as mentioned as above, it required more JOIN just to get a code / description field. How is the performance?



Edhy Rijo (05/06/2009)


As for making it easier for tech support to deal with the data, I would say that is not relevant, these days you will simply create a join query to debug the data record by record if needed and really there is no need to look at the PK/FK in this case.




We have more than 300 tables, it definitely require much more extra effort just to maintain these views. Beside, view is usually slower than underlying table AFAIK, unless it is indexed view. Beside, as mentioned, some table contains > 10 FKs, and this table actually contains up to few million records. JOIN will kill.



Somemore, using surrogate key is very difficult for data import via back end. We got to do extra manipulation for end user data (usually excel), add new column just for GUID, and them use some macro or function to "link" child FK to master PK.



Please advice on the issues that are "fighting" in my mind as above.



Thank you
By Edhy Rijo - 5/7/2009

ChanKK (05/06/2009)
Please advice on the issues that are "fighting" in my mind as above




Sorry ChanKK, I have nothing more to add at this time. You pretty much have your Pros/Cons about surrogate keys, so it is up to you and your application needs to decide what will work best for the application and the support team.
By Greg McGuffey - 5/7/2009

Chan,



One of the primary reasons for surrogate keys is performance. 15 FKs in a table isn't uncommon. A few million records isn't small, but it isn't huge either. 300 tables also isn't uncommon. The design of tables within a relational db is designed to allow relations. BigGrin I.e. FKs all over the place. One of the big reasons to use surrogate keys is to increase performance. It is faster to join one column between tables than to join multiple columns. It is faster to join an int column than any other datatype. Often using an int surrogate key (assuming you don't have a reason not use them, like replication) is a performance optimization.



If you having trouble with performance, then views and stored procedures are the ticket. Yes, they are slower than an indexed query on a single table, but they are potentially the fastest way to pull data from multiple tables. You likely need to look at our indexes and maybe even get good at checking out the query plan and providing hints. I've seen a many table view on a primary table with 30-50 million records return 1000s of results in seconds....using SQL Server 97 after some optimizing (the same query took several hours before optimization). Having a large database means managing indexes, optimizing views and sprocs.



Now, you might actually need to ask another question, which is how normalized should the database be. Whether you use surrogate or natural keys, the use of keys means joins, with int surrogate keys typically being the fastest (I believe SQL Server is optimized for them). However, if you denormalize the database a bit, then you won't have to do as many joins, independent of the type of keys you use. Of course as you denormalize the database, you also have the potential to increase issues if the values need to be updated (i.e. a category is renamed). Another potential way to denormalize is to include keys of ancestors beyond the parent. I.e. if you have a record in Table A with a parent in Table B, whose parent is in Table C, you could include the key of the grand-parent from Table C in for the record in Table A to avoid the extra join (hoped that made sense).



So, I'd say that the question of surrogate vs. natural keys is one that first assumes that your will have normalized (to some degree) data and that you'll be doing joins to get data. Natural keys are promoted as the way to go to keep the design more logical (the keys all means something to humans), while surrogate keys are used to reduce maintenance and to increase performance (they keys are easy for the machine to use).



The degree of normalization can have a big impact on how complex you're data is to work with and maintain. Too normalized and you will have more views and sprocs to deal with and the data schema will be harder to understand, you'll have to do more optimizations to keep performance good. Non-normalized data is a nightmare to maintain and keep accurate. Often slightly denormalized data is the way to go for the best of both worlds. How denormalized really depends on your app.



Finally, performance issues really have more to do with optimizations. Setting up the correct indexes, maintaining the indexes, managing views and sprocs. This can be significant work on large (data set wise), complex databases.



I hope that provided some more information for you or provided some new directions to investigate!
By Peter Jones - 5/7/2009

Hi Chan,

On balance surragate keys are likely to give better performance than complex natural PK's and are always easier to maintain and use.

Our main transaction table at one site has about 9 million rows and 17 Fk's. I've never wanted to return transaction details and data from all 17 FK's in one query but I will often extract data from 8 to 10 FK relationships for reporting purposes.

I've never seen any perfomance problems (SQL Server 2000 / W2K3 Standard / 4Gb Menory / Dual Xeon Processors (machine about 3 years old). I would expect a query that returns 10,000 records to run in a few seconds. However, that being said we always return data within a date range and the transaction table has a clustered index on date so getting at the transaction information itself with always by very quick. This is were your performance effort needs to go - getting at the transacion data quickly - SQL will always do a good job of associating the related FK info.

Cheers, Peter

By Trent L. Taylor - 5/11/2009

Thanks for the posts, guys!  Good stuff out here!
By ChanKK - 5/13/2009

Hi,

Thank you for all the informative replies.

As we all understand the pros / cons of natural/surrogate key, and ALL of us prefer surrogate rather than natural key.

I am very interested to know, how you all overcome the limitation of surrogate key?



For example, :



1. Too many join, difficult to optimize index especially using reporting tool like foxfire / stonefieldQuery which report is user defined, filter condition is so vary.



2. Troublesome for support team/developer to do troubleshooting as we always need to check from master table first, get the surrogate key then only able to query the rest. It complicated the troubleshooting process. Even though standard view/stored proc can be created (which join all FK table), but extra effort required.



3. We have our own generic Import module, which allow customer to prepare data in excel, defined corresponding schema in our app and import the excel data. For sure, customer will not care about the surrogate key; data is in natural key way. We still need to have additional effort lookup the corresponding surrogate from master table using natural key. It seem like defeat the purpose of surrogate key.



4. We have a lot processing, which again, still need natural key for lookup and searching. It also seem like defeat the purpose of surrogate key.



Based on the example I posted above, do you think surrogate key is suitable in our environment? What is your approach on these?



What lead you to decide to use surrogate?



Thank you for efforts on this sharing.
By Peter Jones - 5/13/2009

Hi Chan,



I guess the bottom line is that you use whatever is appropriate for your app. If your data and your requirements make natural keys the way to go then that's the way to go.



Cheers, Peter
By Charles R Hankey - 5/14/2009

I offer my .02 for free :



Never ever ever ever key on data. Data is something the user expects to be able to change without regard for under-the-hood consequences. In 25 years I have heard at least 100 variations on the theme of "Our part numbers/student id numbers/field office names will never change".



Surrogates are single fields that uniquely identify a row and if every other value in that row changes, it will still be the same row. The users don't even know they are there.



I have never seen anyone sorry they have used surrogate keys. I have been involved in many many situations in either mentoring or project rescue where there was a great deal of regret about not using them.
By Charles R Hankey - 5/14/2009

In my hurry to rant on a favorite topic I ignored your actual questions Smile



1. Too many join, difficult to optimize index especially using reporting tool like foxfire / stonefieldQuery which report is user defined, filter condition is so vary.



This is more an issue of normalization than what kind of keys you use. The number of joins is determined by the normalization, not the keys that join the tables.




2. Troublesome for support team/developer to do troubleshooting as we always need to check from master table first, get the surrogate key then only able to query the rest. It complicated the troubleshooting process. Even though standard view/stored proc can be created (which join all FK table), but extra effort required.



This again is an issue of normalization, not the type of PK, FK that are used. If you are talking about keying on actual redundant data between tables, you are compounding the problem of what happens if the user decides they are going to start using a different part number system etc.



3. We have our own generic Import module, which allow customer to prepare data in excel, defined corresponding schema in our app and import the excel data. For sure, customer will not care about the surrogate key; data is in natural key way. We still need to have additional effort lookup the corresponding surrogate from master table using natural key. It seem like defeat the purpose of surrogate key.



Your import module must be validating the customer's data before putting it into the app. You assign your keys there. Of course you need to use the natural keys, but you can have indexes other than PK indexes. Nothing wrong with having candidate keys that perform a similar function and give performance benefits for particular tasks.



4. We have a lot processing, which again, still need natural key for lookup and searching. It also seem like defeat the purpose of surrogate key.



Doesn't defeat the purpose of the surrogate key - which is to connect normalized tables and identify a row for inserts and updates. You can have other indexes to increase performance for lookups and searching by the user.



Based on the example I posted above, do you think surrogate key is suitable in our environment?



Absolutely ! BigGrin
By Alex Luyando - 5/14/2009

So no strong opinion either way, Charles? Smile
By Trent L. Taylor - 5/14/2009

Good comments, Charles! Smile
By Charles R Hankey - 5/14/2009

I do feel a little less strongly than my belief that Buffy the Vampire Slayer is one of the greatest TV shows of all time BigGrin
By ChanKK - 5/14/2009

Hi,

Thank you for valuable comments.

Before anything, I would like to ask, Are Department Code, Company Code surrogate key or natural key?
By Michel Levy - 5/14/2009

Hi Chankk,

if a key means anything, it's a natural key. The only signification of a surrogate key must be the ability to identify the row, not the data in the row.

so, if CompanyName='Microfour', the surrogate key would be an integer (or a GUI if you need replication), even if CompanyCode ='Micro4' or CompanyCode='SFteam' or CompanyCode= whatever you change

By ChanKK - 5/14/2009

Hi,

If companyCode is natural key, then why surrogate key not need more join? For example, if I have table as below:





Surrogate Key Approach

Table: Companies

PrimaryKey : CompanyId

CompanyId: Int

CompanyCode: CHAR(10)

CompanyName: VARCHAR(60)



Table: Employees

PrimaryKey: EmployeeId

EmployeeNo: CHAR(15)

EmployeeName: VARCHAR(100)

CompanyId: Int



Expected Query Result

EmployeeNo, EmployeeName, CompanyCode



SELECT EmployeeNo, EmployeeName, c.CompanyCode

FROM Employees e INNER JOIN Companies c ON e.CompanyId = c.CompanyId





Natural Key Approach

Table: Companies

PrimaryKey : CompanyCode

CompanyCode: CHAR(10)

CompanyName: VARCHAR(60)



Table: Employees

PrimaryKey: EmployeeNo

EmployeeNo: CHAR(15)

EmployeeName: VARCHAR(100)

CompanyCode: CHAR(10)



Expected Query Result

EmployeeNo, EmployeeName, CompanyCode



SELECT EmployeeNo, EmployeeName, c.CompanyCode

FROM Employees





As example above, surrogate key approach required JOIN, however natural key doesn't. This is why we think it required more JOIN.



Of course, we can use drop down control, specify CompanyId as ValueField, CompanyCode as DidsplayField to let user to choose company code. However, what about support team? Preparing view for them is additional maintenance effort.



Thank you for sharing.
By Michel Levy - 5/14/2009

Ah, I understand now.

yes, of course, you need a JOIN ; and you think that all these necessary joins will affect performances, and increase complexity in supporting.

But in your example, what happens when data is modified in natural key 'CompanyCode' of parent table 'Companies'. You need to update all child records in table 'Employees', and you probably have coded a trigger on that purpose, or even better, an constraint with on update cascade (trigger is procedural, and slower than declarative constraint).

With a surrogate key, you will never update child tables, but only one record in parent table . Keep in mind that writing on a physical disk (what you need with an update) is always slower than reading from memory (and your view will be in a cache plan in memory, if you have good indexes)

And the more you have foreign keys in the child table, the more performances are increased with surrogate keys.

Additional maintenance effort in preparing views? not sure. IMHO, less effort in development (both on server side and client side) and in maintenance.
as an example, let's say tomorrow you need to get company name with employee (new requirement of your main customer Tongue). You will need a view, ok? So, why not having this view today, and simply add a field from 'companies' table in it. Views bring you scalability.

By Charles R Hankey - 5/14/2009

Your example is about normalization. A company code should be part of the company record in the company table. If you store that code again in the region_office table you do not have normalized data. If you have twenty tables that use the company code, you have twenty things that are going to break when the use decides to change the company code system. If you are using the company_pk as the company_fk in all the other tables, he can change the data he can see ( company code) to anything he wants in one place - the company table - and every region_office screen etc will show the correct company code.



The joins you need in meantime are annoying, sure, but it really is better design. In SF you can make pulling those related values (company_code) part of the SQL select or sproc that gets the data and then use a custom field property to give you a strongly typed property.