Every company is different and has different needs when it comes to just about everything. When it comes to cloud data warehousing, analytical systems, and big data you have a lot of options. The option that is best for you depends on a lot of factors. Below you will see a breakdown of Snowflake, Presto, and Hive. This should help you determine which you prefer for your business.
First, let’s start off with what cloud data warehousing and big data is. A data warehouse is a system that collects data from various different sources within an organization. It then uses that data to provide information for better decision making. There are 2 types of data warehouse’s, one that is cloud-based and the other which is traditional on-premise systems. More companies are switching over to cloud-based warehouses because no hardware is needed, setup and scaling are quicker and easier, and it can perform more complex analytical queries faster.
Big data is basically what it sounds like. It simply means a huge volume of data. This data is both structured and unstructured. The key here is what companies can do with the information that’s important, not so much the amount of information. The data provided can be analyzed to provide businesses the ability to make better decisions about what the company’s next move should be. Don’t worry if your business is not technical because you can still find and leverage the right technology needed for your business.
What is Snowflake
The primary purpose of Snowflake is to use the cloud and everything it has to offer. Some of the main benefits include:
a. Quick decision making through analytics gathered from all your data. This can easily be provided to the people within your business.
b. There is very little maintenance, you can scale your analytics.
c. Provide customers with a better experience by providing them with fast, consistent, and relevant information.
Many other data warehouses can’t match Snowflakes simplicity, performance, and cost. It is also easy to transfer the data you already have.
You can make the most of the data you receive because it removes all the management needs. Since it runs in the cloud and has no infrastructure there is very little work on your end. Everything is done automatically for you. They have the tech to load and optimize natively for both structured and semi-structured data.
One of the most impressive features is how fast it can process information and complete tasks. It uses very advanced optimization on the database engine so the data is received much faster than other cloud and traditional data warehouses. Within seconds you can then easily share data with anyone in the company, investors, and even customers with very little effort on your end.
Scalability is very important as you will get what you want when you want it because it is constantly scaling up and down when it needs to. It also does that without any disruptions to your business. No matter how much data there is or the number of users, Snowflake scales automatically. You can integrate Snowflake incredibly fast and you can do it with ready-made or custom applications and tools.
The price point is also very interesting. Instead of paying a large set cost, you only pay for what you use. This usage base pricing is for those who pay a lot to get everything set up in the beginning and are paying for what they don’t even use.
What is Presto
Presto is basically an open-source, distributed SQL query engine. If you’re wondering what that is, it is an engine that has the ability to run interactive analytical queries very fast against small or large data sources. It does not have its own storage system. It works really well with Hadoop. Amazon EMR Hadoop distribution has Presto packaged with it. Presto supports HDFS, HBase, Amazon S3, Amazon Redshift, Microsoft SQL, and more. Presto uses an architecture that is somewhat like a massively parallel processing database management system.
With Presto, you don’t have to move data into a different analytics system to query the data. Instead, it can run these queries where it’s currently stored which provides the results within seconds. Presto has been designed to support complex queries, sub-queries, distinct counts, percentiles, aggregations, joins, left/right outer joins, and standard ANSI SQL semantics. There are a lot of companies that use this such as Airbnb, Nasdaq, and Netflix.
Once the query is collected, the request is then processed throughout various stages across worker nodes. The processing is done across the network between the stages. This is done so that it evades any excess I/O overhead. With the additional worker nodes, the processing is quicker and it allows for more parallelism.
You can easily deploy presto in the cloud. It is a great option for a workload in the cloud. Since the cloud provides amazing features such as reliability, scalability, economies of scale, availability, and performance, Presto is an incredibly useful tool. With Presto, within minutes a cluster can be launched. There is also no need to do anything when it comes to configuration, setup, node provisioning, and cluster setup and tuning.
One query can take data from various sources, which gives you the ability to see analytics from various departments throughout the organization. Presto is pretty much in the middle when it comes to speed and price. They are fast but not overly expensive either. You either have an expensive tool that provides very quick response times or cheap, free ones that have very slow response times. Choosing Presto gives you a nice middle ground.
What is Apache Hive
Apache Hive provides the ability to do advanced jobs on HDFS and MapReduce. It’s an ETL and data warehousing tool that is best used with HDFS. SQL developers can write Hive Query Language (HQL) that is very similar to that of standard SQL. Hive makes it a lot easier to analyze massive amounts of data, do data encapsulation, and ad-hoc queries.
The downside of HQL is the limitations it has with the commands it can understand. That said it is still a great tool to use. You can use it to run queries from various databases like Java Database Connectivity, or an Open Database Connectivity application. Also, queries tend to take minutes instead of seconds because Hive is based on Hadoop. You would not want to use this if you need really fast responses. You also don’t want to use this if you use a lot of write operations since the Hive is read-based.
With that said, if you are familiar with IBM, they offer Db2 Big SQL which makes it easier, more secure, and faster to access Hive.
The way Hive thrives is by managing and querying structured data that’s in tables because Hive creates and loads data into tables and databases. You may be wondering what the difference is between HQL and SQL. Well, SQL executes on a traditional database where Hive does it on Hadoop’s infrastructure. Depending on what your business is currently using this may still be a good option for you.
There are 3 important factors to remember when it comes to the data your company needs; Volume, Velocity, and Variety. A lot of data needs to be collected from various sources which can include sensor information, social media, and various business transactions that occur throughout all the different departments. Then those data streams need to be dealt with at a very fast speed and in real-time. The data that comes in can come from various formats which could include; documents, emails, financial transactions, audio, and video. Your data warehouse has to be able to do it all and in the end, provide you with useful information.
Your company has many needs and being able to gain high-quality insights into your business fast is very important. There is always a cloud data warehousing solution out there for you. If none of these seem to be the right fit you may want to check out part 2 of this blog (coming soon) which talks more about Hadoop, Casandra, and HBase. Comparing all 6 will help you determine the best course of action.