This blog is sort of note to myself. Use-temp-tables-instead-of-table-variables-when-the-number-of-records-is-large. There it is and I promise to remember that every time I write DECLARE @something TABLE (… statement.
And now a bit of explanation. I recently worked on project where we added a new fact table to our datawareshouse. An SSIS package was created to get the data from source, do the dimension lookup and populate the fact table. Simple stuff, except that I had to do some data manipulation while selecting data from the source. Now, I could have just written a stored procedure and used that as data source but we are all aware of the pitfalls of that so I decided to use a table valued functions (TVF). TVFs have predictable metadata output and SSIS plays nice with them. However, you cannot use temp tables in TVF so invariably I ended up using a table variable. The package worked perfectly for smaller workloads but when we increased the number of records in table variable, the performance became more and more severe. A casual google search came up with a raft of blog post suggesting that performance culprit might be table variable. So I changed the function to a stored procedure with temp table and used the stored procedure in SSIS. Sure enough our performance issue was resolved like it never existed. Now, there are many blog posts by very intelligent people on why temp tables perform better than table variables but I wanted to check first hand and here’s my little investigation.
1. Create two temp table and poupulated them with random data. These have nothing to do with actual results they are just used for data population.
/* Create base tables which will contain random generated data */ IF OBJECT_ID('tempdb..#basetable') IS NOT NULL DROP TABLE #basetable IF OBJECT_ID('tempdb..#basetable2') IS NOT NULL DROP TABLE #basetable2 CREATE TABLE #basetable ( id INT ,pdate DATETIME ,sdate DATETIME ) CREATE TABLE #basetable2 ( id INT ,pdate DATETIME ); WITH id AS ( SELECT id = 1 UNION ALL SELECT id + 1 FROM id WHERE id <= 1000 ) INSERT INTO #basetable SELECT id ,cast(cast(RAND() * 100000 AS INT) AS DATETIME) ,cast(cast(RAND() * 100000 AS INT) AS DATETIME) FROM id OPTION (MAXRECURSION 0) ;WITH id AS ( SELECT id = 1 UNION ALL SELECT id + 1 FROM id WHERE id <= 1000 ) INSERT INTO #basetable2 SELECT id ,cast(cast(RAND() * 100000 AS INT) AS DATETIME) FROM id OPTION (MAXRECURSION 0)
2. Get the records generated in temp table above into two table variable, join them and update the first one with values from the second one.
/* Get the records generated above in two temp variables. Join the table variables on id column and update pdate column */ DECLARE @table1 TABLE ( id INT ,pdate DATETIME ,sdate DATETIME ) DECLARE @table2 TABLE ( id INT ,pdate DATETIME ) INSERT INTO @table1 SELECT id ,pdate ,sdate FROM #basetable INSERT INTO @table2 SELECT id ,pdate FROM #basetable2 UPDATE @table1 SET pdate = t2.pdate FROM @table1 t1 LEFT JOIN @table2 t2 ON t1.id = t2.id
If you see the estimated and actual execution plan below, they are exactly same. SQL Server optimizer decides to use nested loop join because there are no statistics on table variables.
Not to mention this is very slow specially when I increased the number of records to over a million.
3. Get the records generated in temp table from step 1 in another set of temp tables, join them and update the first one with values from second one.
/* Get the records generated above in two temp tables. Join the two temp tables on id column and update pdate column */ IF OBJECT_ID('tempdb..#table1') IS NOT NULL DROP TABLE #table1 IF OBJECT_ID('tempdb..#table2') IS NOT NULL DROP TABLE #table2 CREATE TABLE #table1 ( id INT ,pdate DATETIME ,sdate DATETIME ) CREATE TABLE #table2 ( id INT ,pdate DATETIME ) INSERT INTO #table1 SELECT id ,pdate ,sdate FROM #basetable INSERT INTO #table2 SELECT id ,pdate FROM #basetable2 UPDATE #table1 SET pdate = t2.pdate FROM #table1 t1 LEFT JOIN #table2 t2 ON t1.id = t2.id
Now look at the estimated and actual execution plan. SQL Server estimated using nested loop join but used hash-match which is faster in this situation. SQL Server was able to do this because there are statistics available on temporary tables.
This does not absolutely mean that table variables are not useful in any situation. They are brilliant when number of data in them is small (usually less than 1000 rows as rule of thumb). Creating a temp table in this situation would not give much, if at all, benefits. They are light on transaction log as well since their scope is limited to current execution context and they also lead to fewer stored procedure recompilation if used in stored procedure.
So next time you decide to use table variable have a second thought. Think a bit about the effect on the query execution plan and definitely don’t use them if you care moving large amount of data.
