Big DataBig Data Processing

How to Reduce the cost of Google Big Query Data Processing?4 min read


“Mysql Server has gone away” OR “Lost Connection to Mysql Server during query”- These are some of the most common errors a developer of a DBA looks at on his/her computer screen, whenever data in a MySQL table exceeds beyond a particular limit. This is due to the limitation of MySQL to process a very large amount of data if the query is not written in optimized form. Or if the query has been optimized, still it may take very much I/O to scan data if data to be processed is very large. Normally in a company like a startup, where much of focus is not on optimizing things, its very difficult to make a query on some of the MySQL tables like “tb_transactions”, “tb_users” etc. A high stream of real-time data is being pushed in these tables and that too very frequently. As a result, data becomes very large and Mysql shows sign of break down if in real time large data is being queried from MySQL tables.

In that scenario, different companies follow different strategies, depending on bandwidth and expertise they are having

  • Some opt for warehousing the data in a form, which is recently being recognized as “Data Lake”.
  • Some try to distribute the data in the form of “Mysql Sharding”.
  • And some tries to opt for Google Big Query service, providing the same interface and data query language as SQL.

Big Query is very reliable data tool by Google aimed at interactive analysis of massively large datasets. Every Mysql query, giving pain to Mysql Server is executed very smoothly in Google Big Query. Though it may take some finite time depending on the data being processed, Google makes sure that no query gets stuck in middle and every query is returning the result it is expected to give.

As there is no free lunch in this world, so this convenience of processing and analyzing large Mysql data tables brings some cost associated with it. Google charges from its users on the basis of data processed per query (with a min bar of 200 MB per query). As a result, companies with huge data sets have to pay a large sum of dollars at every month end. And the irony is, this cost gets increased with each extra bit of data being stored on Big Query (each passing day) and with each extra bit of data being queried. But these large sums of money can be avoided with some clever query skills and with a little bit of awareness. Below are some of the methods which can help a company or a developer to reduce the big query cost to a much greater extent, irrespective of how much data is been increasing daily.

Start using TABLE_DATE_RANGE() function more frequently.

Google provided this function for cost reduction purpose. Start storing the data in smaller tables per day. In place of storing whole transactional data in just one huge “tb_transactions” table, store in smaller tables like one “tb_transactions” table per day. eg “tb_transactions20170815” (for data of Aug 15, 2017). As a result, storing data in smaller tables will help to process only that much amount of data as required by the end user. We just need to pass the date range explicitly in TABLE_DATE_RANGE(). Hence query execution cost will decrease per query. For info about this function and its syntax visit. Google Big Query Reference

Avoid using “Select * FROM table_name”. In place of this, use “Select col_name_1, col_name_2 from table_name”.

It means select only required columns from a table in place of selecting all the columns or ‘*’. For this point, first, we have to understand how Google charges money. Google charges on the basis of total data (all rows of selected columns from a table) being processed as compared to only processed rows. Each column from a table we are fetching is multiplied by its data type size and count of all the rows present in that table, to give the actual data for which we will be charged. e.g. let’s say we have an “id” col having data type as “int” (4 bytes) and that table has 1000 rows. So querying even ‘select “id” from table_name limit 1’ will process 4*100 bytes data in place of expected 4*1 bytes data. This is because for big query selecting a column means whole data (for that column) from the table to be processed and hence charged.

Using these two methods only , can help a company to bring down its Biq Query Analytics cost to a much greater extent.