How to Reduce the cost of Google BigQuery Data Processing?

“Mysql Server has gone away” OR “Lost Connection to Mysql Server during query”- These are some of the most common errors a developer of a DBA looks at on his/her computer screen, whenever data in a MySQL table exceeds beyond a particular limit.

It might be interesting to know about Google BigQuery Cost Optimization.

This is due to the limitation of MySQL to process a very large amount of data if the query is not written in optimized form. Or if the query has been optimized, still it may take very much I/O to scan data if data to be processed is very large. Normally in a company like a startup, where much of focus is not on optimizing things, its very difficult to make a query on some of the MySQL tables like “tb_transactions”, “tb_users” etc.

A high stream of real-time data is being pushed in these tables and that too very frequently. As a result, data becomes very large and Mysql shows sign of break down if in real time large data is being queried from MySQL tables.

In that scenario, different companies follow different strategies, depending on bandwidth and expertise they are having

Some opt for warehousing the data in a form, which is recently being recognized as “Data Lake”.
Some try to distribute the data in the form of “Mysql Sharding”.
And some tries to opt for Google Big Data Query service, providing the same interface and data query language as SQL.

BigQuery is very reliable data tool by Google platform aimed at the interactive analysis of massively large datasets. Every Mysql query, giving pain to Mysql Server is executed very smoothly in Google BigQuery. Though it may take some finite time depending on the data being processed, Google makes sure that no query gets stuck in middle and every query is returning the result it is expected to give.

As there is no free lunch in this world, so this convenience of processing and analyzing large Mysql data tables brings some cost associated with it. Google charges from its users on the basis of data processed per query (with a min bar of 200 MB per query). As a result, companies with huge data sets have to pay a large sum of dollars at every month end.

And the irony is, this cost gets increased with each extra bit of data being stored on BigQuery (each passing day) and with each extra bit of data being queried. But these large sums of money can be avoided with some clever query skills and with a little bit of awareness.

Below are some of the best methods which can help a company or a developer to reduce the BigQuery cost to a much greater extent, irrespective of how much data is been increasing daily.

Implement these to achieve BigQuery Cost Optimization

Start using TABLE_DATE_RANGE() function more frequently.

Google provided this function for cost reduction purpose. Start storing the data in smaller tables per day. In place of storing whole transactional data in just one huge “tb_transactions” table, store in smaller tables like one “tb_transactions” table per day. eg “tb_transactions20170815” (for data of Aug 15, 2017). As a result, storing data in smaller tables will help to process only that much amount of data as required by the end user.

We just need to pass the date range explicitly in TABLE_DATE_RANGE(). Hence query execution cost will decrease per query. For info about this function and its syntax visit. Google BigQuery Reference

Avoid using “Select * FROM table_name”. In place of this, use “Select col_name_1, col_name_2 from table_name”.

It means select only required columns from a table in place of selecting all the columns or ‘*’. For this point, first, we have to understand how Google charges money. Google charges on the basis of total data (all rows of selected columns from a table) being processed as compared to only processed rows.

Each column from a table we are fetching is multiplied by its data type size and count of all the rows present in that table, to give the actual data for which we will be charged. e.g. let’s say we have an “id” col having data type as “int” (4 bytes) and that table has 1000 rows.

So querying even ‘select “id” from table_name limit 1’ will process 4*100 bytes data in place of expected 4*1 bytes data. This is because for BigQuery selecting a column means whole data (for that column) from the table to be processed and hence charged.

Using these two methods only, can help a company to bring down and optimize its BiqQuery Analytics costs to a much greater extent.