Saturday, December 28, 2013

Positional MSAccess OleDb Parameters

A few years back, I wrote an MS Access data provider for SubSonic 2, the DAL generation tool.

It was a rather thankless task - the V2 project was pretty much obsolete by then - but it was extremely educational.
I'd like to share some tricks I was forced to come up with that I don't think I've ever seen published.
Surprisingly, JET SQL called from .NET can come up with the goods in many ways you might not expect.

If you want to check out the DataProvider code directly, it's 'AccessDataProvider.cs' on GitHub here.

Issue #1 - Positional Parameters

When connecting to an MS Access database in .NET, it's best to use an OleDb connection, since these are cited by Microsoft as being optimal for MS Access, and give us the widest range of functionality.
It is well known that parameters in OleDb queries are assigned by position only, and the names are disregarded entirely.
For example:
SELECT CategoryName, [comp 2] as x1, [cat 1] as x2, [comp 2] as x3 From Categories WHERE CategoryName=[cat 1];
would actually expect four individual parameters which would be assigned in order despite parameters 1 and 3, and 2 and 4 having the same names, effectively giving

SELECT CategoryName, ? as x1, ? as x2, ? as x3 From Categories WHERE CategoryName=?;

This makes it very difficult to construct the automated SQL statements required by a query tool such as SubSonic's.

After quite a bit of digging, what I found was that you actually CAN use named, positional parameters in access: the SQL just needs to be formatted in a precise way.
To return to the above example, the following ad-hoc SQL:

PARAMETERS [cat 1] Text(255), [comp 2] Text(255);
SELECT CategoryName, [comp 2] as x1, [cat 1] as x2, [comp 2] as x3 From Categories WHERE CategoryName=[cat 1];

will expect two parameters, and will re-use the values in the specified locations.
The types for the parameters (above: Text(255) ) are standard DDL types, as catalogued by Allen Browne here.

The parameter names must be standard JET SQL identifiers and as such may not contain the @ character, unlike in MS SQL Server's Transact SQL where the @ is mandatory.

When saved as a query in Access, SQL with parameter declarations are marked as a 'Stored Procedure' by OleDb, a different schema category from normal queries which are marked as a 'View' (see examples below on schema data retrieval).

Issue #2 - Fetching Database Metadata

In order to construct a DAL for an MS Access database, it is necessary to retrieve schema information about database objects.
The OleDbConnection offers a GetOleDbSchemaTable() method that returns most comprehensive schema data of any of the connection types.

DataTable dt = autoOleDbConn.GetOleDbSchemaTable(OleDbSchemaGuid.Columns, null);

However there are still two problems with the schema data retrieved.

(1) There is actually no way to determine whether a column is an Autonumber (the equivalent of a MSSQL identity).
An urban myth is circulating that COLUMN_FLAGS = 90 and DATA_TYPE = 3 (Int) means an Autonumber, but actually these conditions are met for any non-nullable Long field (of which an Autonumber is a subset).

(2) MS Access also has the 'Allow Zero Length String' property, which is unique among DBMS's as far as I know. This is not picked up by the schema data.

In the dataprovider, the way I solved this problem was to open the database object using DAO and use the DAO field attributes:
// Use DAO to extract parameters 
DAO.DBEngineClass dbc = new DAO.DBEngineClass();
DAO.Database db = dbc.OpenDatabase(
    GetAccessDBNameAndPathFromConnectionString(autoOleDbConn.ConnectionString), null, false, "");
...

    // if this is a table, extract addl DAO info
    DAO.Field dbField = db.TableDefs[tempTableName].Fields[tempColumnName];
    tempAllowEmptyString = dbField.AllowZeroLength;
    tempAutoInc = ((dbField.Attributes & (int)DAO.FieldAttributeEnum.dbAutoIncrField) != 0);
If any of this is not comprehensive enough, I'd recommend looking at the SubSonic provider code on GitHub as mentioned above.
There are a lot of useful bits and pieces there, including conversion between .NET DbType, .NET native type, DAO enumerated type and DAO DDL type.

Friday, December 27, 2013

VBA Collections: Using as a Hash Table for Fast String Lookup in Excel and Access

VBA collections are pretty simple beasts. They only have three methods: Add, Remove and Item, and a Count property. When adding values to the collection, we can include a string key for lookup.
Some sample code:
Dim TestColl As New Collection
TestColl.Add 153, "Greece"
TestColl.Add 23, "Japan"
now we can use an integer index to retrieve the n-th 1-based item (returns 23)
TestColl.Item(2)
or the string to retrieve the item by key (returns 153)
TestColl.Item("Greece")
We can also iterate the values:
Dim v As Variant

For Each v In TestColl 
  Debug.Print v
Next
There is some debate as to whether the collection uses a proper hashed lookup to retrieve the value when given a string key. Many sites/blogs recommend using the Dictionary object in the System.Scripting namespace.
While this may mostly work, over the years I have learned to avoid external dependencies in MS Access wherever possible. Hence I determined to investigate the lookup performance of the native VBA collection for both integer and string based retreival.

The following code builds a collection of 100,000 integer values and then looks up a particular value at the start, middle or end of the collection 10,000 times, so as to return a measurable time.
Public Sub TestCollectionItemLookup()
Dim TestColl As New Collection
Dim i As Long
Dim k As Long
Dim StartTime As Single
Dim Iterations As Long
    Iterations = 10000
    Debug.Print "Iterations=" & Iterations

    For i = 0 To 100000
        TestColl.Add i, CStr("k" & i)
    Next
    
    ' By Int Index, i=2
    StartTime = Timer
    
    For i = 0 To Iterations
        k = TestColl.Item(2)
    Next
    
    Debug.Print "By Int Index, i=2: t=" & Format(Timer - StartTime, "0.000000")
    
    ' By Int Index, i=500
    StartTime = Timer
    
    For i = 0 To Iterations
        k = TestColl.Item(500)
    Next
    
    Debug.Print "By Int Index, i=500: t=" & Format(Timer - StartTime, "0.000000")
    
    ' By Int Index, i=50000
    StartTime = Timer
    
    For i = 0 To Iterations
        k = TestColl.Item(50000)
    Next
    
    Debug.Print "By Int Index, i=50000: t=" & Format(Timer - StartTime, "0.000000")
    
    ' By Int Index, i=99999
    StartTime = Timer
    
    For i = 0 To Iterations
        k = TestColl.Item(99999)
    Next
    
    Debug.Print "By Int Index, i=99999: t=" & Format(Timer - StartTime, "0.000000")
    
    ' By String Index, i=500
    StartTime = Timer
    
    For i = 0 To Iterations
        k = TestColl.Item(CStr("k" & 500))
    Next
    
    Debug.Print "By String Index, i=500: t=" & Format(Timer - StartTime, "0.000000")
    
    ' By String Index, i=99999
    StartTime = Timer
    
    For i = 0 To Iterations
        k = TestColl.Item(CStr("k" & 99999))
    Next
    
    Debug.Print "By String Index, i=99999: t=" & Format(Timer - StartTime, "0.000000")
    
    '*****************************************************************
    
    ' By Int Index, random lookup
    StartTime = Timer
    
    For i = 0 To Iterations
        k = TestColl.Item(CLng(Rnd * 100000))
    Next
    
    Debug.Print "By Int Index, random lookup: t=" & Format(Timer - StartTime, "0.000000")
    
    ' By String Index, random lookup
    StartTime = Timer
    
    For i = 0 To Iterations
        k = TestColl.Item(CStr("k" & CLng(Rnd * 100000)))
    Next
    
    Debug.Print "By String Index, random lookup: t=" & Format(Timer - StartTime, "0.000000")
End Sub
Executing this sub printed the following results for me:
   
Iterations = 10000
By Int Index, i=2: t=0.007813
By Int Index, i=500: t=0.031250
By Int Index, i=50000: t=5.406250
By Int Index, i=99999: t=16.195310
By String Index, i=500: t=0.031250
By String Index, i=99999: t=0.027344
By Int Index, random lookup: t=6.164063
By String Index, random lookup: t=0.027344
We can draw the following conclusions:
- lookup in a collection by integer index just uses a linear for/next loop with a value comparison. Values at the start of the collection will be retrieved far more quickly than ones at the end
- lookup by string key IS hashed, and it makes no difference where in the array the values lies, although key lookup is at least an order of magnitude slower than direct array lookup would be (we would expect integer lookup for i=2 to be 2-3 times slower than an array lookup)
- the code also checks 10,000 tries of random retrieval, in case of some kind of caching was speeding up the key lookup after the first retrieval, but the results indicate this not to be the case, with the result very similar to the repeated lookup of a single value

Thus, VBA collections can be recommended for hashed string lookup for large collections, however should never be used for integer indexed lookup for large collections.
If integer lookup was required, it would be the best strategy to create an array and copy the values into it, then use the array for positional retrieval.