Skip to main content

SQL Server vs Azure Data Warehouse vs Netezza vs Redshift

The Great Database Shoot Out

In Jan'17, I teamed up with Marc van der Zon (Data Scientist), to test and compare several database technologies for our organization's future analytics and BI platform. The technologies that made the shortlist were:
  • SQL Server, because it is the organization's existing BI platform.
  • Azure Data Warehouse, because of its high similarity to SQL Server.
  • Amazon's Redshift, because of its attractive cost, scalability and performance.
  • Netezza, because it is anaffordable on-prem appliance that performs well.

Considerations

We were primarily looking for the best bang-for-buck option, so price and performance were viewed as more important than how feature complete any particular option was. That said, what we regarded as important included:
  • A shallow learning curve for BI developers (i.e. no need for expert skills in order to get good performance)
  • Ability to re-use existing code  (i.e. ANSI-92 SQL compatibility)
  • Time-to-solution.   (i.e. does the platform help bypass IT department bureaucracy)

Testing Methodology

Important to modern insurers are rating factors - qualities about the insured items that signify a change in the risk of insuring that item. Our test was based on taking 1.8 billion rating factors, presented in 5th normal form using 5 relational tables, and transform them into a series of 5 flattened tables. In short, it is a massive denormalizing exercise which tends to lend itself to traditional relational database technologies.

Contenders

In the IT department's corner was an on-prem Dell 970 server, packed with 4 Xeon E7 4th generation quad core processors, teamed with 600GB of error correcting RAM, riding an EMC ExtremeIO solid-state disk SAN on dual 8GB/s HBAs.  Including software licenses, the ticket to ride is roughly $551K (ZAR 7,5 million), given that the ExtremeIO is a shared resource.

In the Data Scientist's corner was the other on-prem offering - an IBM Netezza - a 32TB analytics appliance with a solid reputation for performance, and a comparatively reasonable price tag of roughly 88K (ZAR 1.2 million).

Next up was Microsoft's cloud based Azure Data Warehouse - a massively parallel appliance which splits data and workloads across 60 distributions, and is provisioned using DWUs (Data Warehouse units) - which translate into the amount of RAM and query concurrency the appliance will yield.

We tested Azure at several price/performance points:
  • DWU 600    - costing ~$24k / yr   
  • DWU 1200  - costing ~$53k / yr
  • DWU 2000  - costing ~$88k / yr
Last up was Amazon's Redshift - a truly scalable MPP database based on the Prostgress database engine, in which you choose both the size (CPU & RAM) and number of nodes (logical machines) onto you would like your data and workloads distributed, and pay accordingly for the resources provisioned. The configurations we tested were:
  • 1 Node      -   costing ~$9k/yr      (ZAR 122K)
  • 12 Nodes  -   costing ~$12k/yr    (ZAR 161K)
  • 32 nodes   -   costing ~$31K/yr   (ZAR 429K)
For all of the above configurations, we based our pricing on the provisioning of 13TB data, with at least 3129 hours of up time per year.

Results

To graphically represent the results, we created two measures:
  1. Affordability  - being the inverse of cost.
  2. Productivity   - being the inverse of the query performance.
When plotted onto a Cartesian plane, the differences between these technologies really become apparent:



Both Azure, Netezza and Redshift utterly annihilate on-prem SQL Server in terms of value proposition, and all of them are hands-down a better choice for building out future BI platforms.

For me, Amazon's offering really stood out both on the performance-to-price-point curve, as well as in it's features and ease of use. The ability to scale while remaining on-line sounds like a nice-to-have, but our experience with Azure quickly revealed that an inability to scale online is a productivity killer.

Query Time
On Prem

Existing Server 0:59:00
New Server 0:43:40
Netezza 0:04:50
Azure



DWU600 0:43:30
DWU1200 0:15:25
DWU2400 0:09:28
Redshift

1 node 0:36:09
12 nodes 0:04:45
32 nodes 0:01:47


Notes

We encountered several minor annoyances during the set up and testing of Azure Data Warehouse - sometimes it would take upwards of 30 minutes for the database to pause, so subsequently scaling operations (which take the database offline) would take inordinately long to complete. When running below DWU600, some of the queries would terminate with errors or else take upwards of 4 hours to run - suggesting that the Azure offering doesn't handle low resource conditions particularly well. There were also significant inconsistencies in the elapsed time required for some queries to complete when run at different times, possibly the result of either congestion or contention? Azure DW's Polybase also proved to be a comparatively fussy affair, with lots of trial and error necessary to get the 5 CSV files loaded into tables.

The testing of Netezza went smoothly and query times were predictable, as one might expect from a dedicated appliance. There wasn't lots of configuration necessary to get things going, and the loading of data was a reasonably simple affair. 

Amazon's Redshift has a lot going for it. The documentation is 1st class, and the follow-along tutorials can be completed in a matter of minutes each. Some of the features that really set Redhisft apart from competitors were:
  1. It stays online when scaling up or down, and it scales reasonably quickly.
  2. It is an incredibly simple, single step process to load CSV data into Redshift.
  3. It analyses data as it is loaded and intelligently determines the best compression algorithm to use for storing the data.
  4. It allows lookup tables to be present, in their entirety, on every compute node.
    This significantly improves the performance of relational joins.

Finally, there's on-premises SQL Server machine, which the server team and DBAs say has many, many CPU cores, and gigabytes per second of awesome performance, and will be ready next Friday. Probably. After the latest service pack has been installed.

...It also has column store indexes. Did we mention how many CPU cores it has?

Conclusion

With no specific query tuning, Amazon's Redshift demonstrated performance that we were unable to obtain on any of the other platforms at a comparable price point. With respect to get-it-done productivity it stands head and shoulders above of the competition.

Netezza makes for an attractive alternative for use cases that require an on-premise platform, and its important to remember that large volumes of data can be shipped into and out of an on-premise server for free, whereas data charges do apply for both Microsoft and Amazon's cloud offerings.

Seriously, though...   if like me you've been a SQL Server stalwart most of your career, take Redshift for a spin.  It'll knock your socks off!

Comments

Popular posts from this blog

Reading Zip files in PowerQuery / M

Being a fan of PowerBI, I recently looked for a way to read zip files directly into the Data Model, and found this blog which showed a usable technique. Inspired by the possibilities revealed in Ken's solution, but frustrated by slow performance, I set out to learn the M language and write a faster alternative. UnzipContents The result of these efforts is an M function - UnzipContents - that you can paste into any PowerBI / PowerQuery report. It takes the contents of a ZIP file, and returns a list of files contained therein, along with their decompressed data: If you're not sure how to make this function available in your document, simply: Open up PowerQuery (either in Excel or in PowerBI) Create a new Blank Query. Open up the Advanced Editor  (found on the View tab in PowerBI). Copy-Paste the above code into the editor, then close the editor. In the properties window, rename the the function to  UnzipContents Usage Using the function is fairly straight forw

Power Query: Transforming YYYYMM dates (the quick way)

Accountants. Their unit of work seems to be the month, as if individual days don't exists, or don't count somehow. Nowhere is this better seen than in the notion of the accounting period , which all too often follows the form YYYYMM.  Try converting this directly into a date and Power Query starts making excuses faster than a kid with his hand caught in the cookie jar. The quick solution to this is to understand what Power Query's Table.TransformColumns does, and then leverage this knowledge to transform your YYYYMM values into proper date type columns. Table.TransformColumns As it's name suggests, this handy function allows you to convert the contents of a column from one type to another. The basic syntax is: = Table.TransformColumns( #"Your Source Table", { A list of tuples, specifying the columns and what functions to apply to each value} ) Lists {denoted by curly braces} are something you need to get comfortable with if you

Easily Move SQL Tables between Filegroups

Recently during a Data Warehouse project, I had the need to move many tables to a new file group. I didn't like any of the solutions that I found on Google, so decided to create on of my own. The result? MoveTablesToFilegroup Click here for a nifty stored proc allows you to easily move tables, indexes, heaps and even LOB data to different filegroups without breaking a sweat. To get going, copy-paste the code below into Management Studio, and then run it to create the needed stored procedure. Hopefully the arguments are self explanatory, but here are some examples: 1. Move all tables, indexes and heaps, from all schemas into the filegroup named SECONDARY: EXEC dbo.sp_MoveTablesToFileGroup @SchemaFilter = '%', -- chooses schemas using the LIKE operator @TableFilter  = '%', -- chooses tables using the LIKE operator @DataFileGroup = 'SECONDARY', -- The name of the filegroup to move index and in-row data to. @ClusteredIndexes = 1, --