This Blog Has Moved

Please update your links, feeds, bookmarks.

This blog is now residing at: http://irfanghaffar.net

Hope to see you all there.

Rolling your own identities

I must say that I am not a database guy and my SQL skills are limited. But, it is not an acceptable excuse, specially when we are working in a small team and there is no dedicated database guy. We have no choice but to hone in our database skills and get ready to switch between the roles. That is exactly what I had to do when one of the system users reported an issue where database generated the same ID for 2 different records. I delved into to the SP and found what in the database world is called “Rolling your own identities”. It is a technique (hack or shortcut or whatever you want to call it) to generate ID numbers by getting the MAX of a column and adding 1 to it. It is a widely known solution in the community for situations when displaying a primary key column (with its seed and increment values set) on the client interface is not an option. For example if it is a GUID column or it has an out of sequence ID numbers due to frequent delete and insert operations.

So, this is what I saw in the SP, which is actually a way of Rolling your own identities.


BEGIN TRANSACTION 
/* select query */ 
SELECT TOP 1 @ItemNumber = ItemNumber + 1
FROM ItemTable
WHERE {some_condition}
ORDER BY ItemNumber DESC

/* insert query */
INSERT INTO ItemTable (ItemNumber, {column2}, {column3})
VALUES (@ItemNumber, {some_value}, {some_value})
END TRANSACTION

The /*select query*/ above can also be written using MAX a function.


SELECT MAX(ItemNumber) FROM ItemTable
WHERE {some_condition}

What exactly the problem is

If you notice, select and insert queries above are within a transaction which is important in a distributed environment where multiple clients access the database (or at least this SP) at the same time. I always believed that BEGIN TRANSACTION statement is the ultimate savior and creates a critical section internally, thus preventing 2 transactions from accessing the same database resource at the same time, like a lock statement in C#. If that was true, then why it allowed 2 different transactions to execute the select query concurrently thus allowing them to have the same max ItemNumber?

Research ensues

My first clue was the transactions isolation level. It is a keyword which controls the default locking behavior. The default transaction isolation level in SQL Server is Read Committed, which means only data committed by other transaction can be read hence avoids dirty read. But, this is not what I wanted.

What do you want then?

I wanted a way to bar other users from accessing select query until transaction is finished executing select *AND* insert statements both. I think I was wanting to convert my transaction into a truly atomic unit of work in order to make sure every transaction gets a new ItemNumber whenever it runs its select statement. Since Read Committed doesn’t serve the purpose, I decided on reading more about other transaction isolation levels and locking behaviors.

Read Uncommitted:

As explained in many places on the internet, By using this level one can read the data which has been read by other transactions but not yet committed. In this case no shared and exclusive locks will be honored and dirty read will not be prevented. It can also result in Phantom data or nonrepeatable reads. I certainly didn’t want this as I needed more restrictive locking not less. I moved onto Repeatable Read.

Repeatable Read:

This one took a bit long to get into my head. This isolation level will not allow other users to update the data that has been read in the select query. But, why do they call it repeatable? Actually within a SAME transaction we would like to issue the same SELECT statement multiple times.

Transaction 1

 
SELECT ItemNumber, ItemDetail FROM ItemTable
WHERE ItemNumber < 10

Transaction 2


SET ItemDetail = {some_new_value}
WHERE ItemNumber = 5

Transaction 1 continues...

   
SELECT ItemNumber, ItemDetail FROM ItemTable
WHERE ItemNumber < 10

In order to make sure Transaction 2 doesn’t update the records we have selected between multiple reads (means we may repeat our read later in the transaction), SQL Server will maintain a lock on all the rows we have read until the transaction ends. This is certainly more restrictive locking than Read Committed. We get the ownership of the rows we read till the end of the transaction.

Unfortunately, even Repeatable Read can’t help create the kind of critical section I talked about earlier because of one reason. It does allow INSERTS. Yes, no other users/transactions can update the rows  in the transaction, but they can always insert new rows amid the rows we have already locked. The newly inserted rows are called Phantom rows. This is exactly what our problem is that we don’t want to let other users insert until transaction is finished. Hmm.. since I was desperate to find a solution, I moved on and read about the next isolation level.

Serializable:

This isolation level will place a range lock on all the data we have read, means whatever rows come in that range will not be allowed to be updated or deleted, and no insertion will also be possible within that range. This level does what I wanted as it is the most restrictive locking. We can use HOLDLOCK as it has the same effect as using Serializable on all tables in SELECT statement in a transaction.

/* select query */  
SELECT MAX(ItemNumber) FROM ItemTable WITH (HOLDLOCK)

But, why do they say “Serializable is prone to cause deadlock”? Should I be worried about it? I think yes, because now I am eyeing on the most optimized solution and it is fair to be threatened by every warning it gives. 

Serializable and Repeatable Read may cause deadlocks

I can’t explain this better than MSDN:

“The transaction reads data, acquiring a shared (S) lock on the resource (page or row), and then modifies the data, which requires lock conversion to an exclusive (X) lock. If two transactions acquire shared-mode locks on a resource and then attempt to update data concurrently, one transaction attempts the lock conversion to an exclusive (X) lock. The shared-mode-to-exclusive lock conversion must wait because the exclusive lock for one transaction is not compatible with the shared-mode lock of the other transaction; a lock wait occurs. The second transaction attempts to acquire an exclusive (X) lock for its update. Because both transactions are converting to exclusive (X) locks, and they are each waiting for the other transaction to release its shared-mode lock, a deadlock occurs.”

To avoid this potential deadlock problem, update (U) locks are used. Since HOLDLOCK only applies shared range locks, we can always complement this with UPDATE to turn into update range locks.


/* to see what locks HOLDLOCK applies – same can be repeated for UPDATE as well */
BEGIN TRANSACTION
SELECT MAX(ItemNumber) from Orders with(HOLDLOCK)
EXEC sp_lock @@SPID
ROLLBACK

So, that means our final query should look something like this:

 
/* select query */
BEGIN TRANSACTION 
SELECT MAX(ItemNumber) FROM ItemTable WITH (HOLDLOCK, UPDATE)

/* insert query */
INSERT INTO ItemTable (ItemNumber, {column2}, {column3})
VALUES (@ItemNumber, {some_value}, {some_value})
END TRANSACTION

We should now be able to generate our own *unique* identities in a distributed environment without worrying about deadlocks using appropriate isolation level and locking.


HTH,

Using Generics Judiciously

Generics is one of the important features C# offers. It was one of biggest changes when C#2.0. was announced. As it name suggests, It helps developers write generic code using generic (unknown) types that are replaced with actual types at JIT time.

Generic types and functions, of course, offer many advantages e.g. they enhance performance, make code more expressive, move a lot of safety from execution time to compile time etc. One of the advantages which I like most is that it avoids casting and duplicate code, something which we used to do before Generics were introduced. I have used Object type in the past in order to make the code generic when the actual (specific) types were not known in advance.

The thing which I learnt today is the fact that by just looking at the function signatures one can’t decide whether these functions should be converted into Generic. They may share the same name but not the same implementation. For example consider below functions I encountered:



Public Shared Function GetXSLTTransformedXML(ByVal xmlLocation As String, ByVal xsltLocation As String, ByVal args As XsltArgumentList) As String

Dim document As New Linq.XDocument()
Dim xsltTransformer As Xsl.XslCompiledTransform = New Xsl.XslCompiledTransform()
Dim transformedXML As String = String.Empty

Try
document = Linq.XDocument.Load(xmlLocation)
xsltTransformer.Load(xsltLocation, New Xsl.XsltSettings(False, True), New XmlUrlResolver())
transformedXML = GetXSLTTransformedXML(document, xsltTransformer, args)

Catch ex As Exception
Throw ex

End Try

Return transformedXML

End Function

Public Shared Function GetXSLTTransformedXML(ByVal document As Linq.XDocument, ByVal xslt As XslCompiledTransform, ByVal args As XsltArgumentList) As String

Dim memStream As New System.IO.MemoryStream()
Dim writer As XmlTextWriter
Dim streamReader As StreamReader
Dim transformedXML As String = String.Empty

Try
writer = New XmlTextWriter(memStream, System.Text.Encoding.UTF8)
xslt.Transform(document.CreateReader(), args, writer)
writer.Flush()
memStream.Position = 0
streamReader = New StreamReader(memStream)
transformedXML = streamReader.ReadToEnd()

Catch ex As Exception
Throw ex

Finally
memStream.Close()
writer.Close()
End Try

Return transformedXML

End Function

I got excited when I looked at them and thought let’s convert them into one generic function because both share the same signature and differ in parameters. Later, I realized that I will not get the advantage Generics bring to the code if both the functions do not share the same implementation as well.

For example in the above functions, I want to execute a different branch of code based upon the first parameter. Knowing this, If I go on converting them into one Generic function, I will end up type casting the first parameter to decide which code to execute. If I have to type cast it then it kills the whole purpose of using Generics. Not only that, it also kills the intuitiveness of the code also. My fellow developer can easily identify what function to call if he has an XML document location leaving the other implementation for those who have XML document loaded in the cache. If it were Generic, It would be difficult for people.

I don’t mean that Generics is bad in anyway. I believe it is just not best for this kind of situation. If I have a MakeList<T> kind of function, I would not think twice.


Public Function MakeList(of T)(ByVal first As T, ByVal second As T) As List
'Builds the same list containing parameters irrespective of their types.
End Function

HTH,