Work with Large Datasets

It's exciting automating data table work in Catalytic. As you scale up and work with larger and larger data sets, certain strategies and building techniques help keep things smooth and efficient.

Introduction

Large data sets can mean big insights, but as data grows, processing time increases and data tables approach our system limits.

If you’re working with a large dataset now, check the Catalytic System Limits page to understand any specific limitations. If your data set approaches or exceeds a current Catalytic system limit, this guide can give helpful recommendations.

Architect a Workflow Around System Limits

With Catalytic, you can store data in an Excel file, data table, or CSV. To work with large datasets, consider the ways you architect the process around these options. For example, data tables have a current limit of 100,000 rows: If you’re building a Workflow and you know you’re going to store more than 100,000 rows of data, architect your primary data as a CSV, not a data table!

There are various system limits enforced by Catalytic that ensure a high-quality experience for you and your team. These limits help prevent interruptions to your processes and automation.

One way to build with system limits is considering the architecture of your process, like how data is used and compared, how granular it is, or how the data is stored and managed long term.

Determine the “Point of Least Granularity”

When you first design a Workflow, you may feel like tossing all the data you have into it. But to future proof your process, and create the most value, build a Workflow with data granularity in mind.

In the example below, each data table shows a unique view of the same data. Imagine a process that creates yearly reports: the Point of Least Granularity is probably the table on the right with the yearly data—it’s the point where you have enough detail to achieve your business outcome, and no more than necessary.

screen readers look here

Start thinking about data granularity by asking what business outcome you’re looking for, and with the example above, think about what data you need to meet your business objective—the insight and analysis you’re trying to get to is likely representable by a more specific, filtered data set.

Roll up data into a dashboard or report

Similarly, look for ways to summarize data: you may be able to summarize 100 million rows of data in a 10 column, 10 row data table report. Data that’s summarized in a dashboard or report has 2 big benefits:

  1. Rolled up or summarized data is easier to store.
  2. Summarized data is at a human level—someone can look at it and make a decision based on it.

A data set with 100,000 rows doesn’t mean the report should be 100,000 rows; create a summary dashboard out of key statistics from the data. Calculate the median, mode, range, variance, etc., then collect and compare this smaller data.

Compare week to week, not day to day

If you need a weekly summary of a 10,000 row dataset that grows by 1,000 rows a week, there are two ways to gather the data:

  1. Analyze the entire data set every week. Each week, the table grows, and in a year, the data set will increase by over 400% to 52,000 rows.

  2. Compare the changes in the data set week to week. Each week, 1,000 rows are compared to the last 1,000 rows.

The comparison between these smaller summary data points makes processing faster because you’re just looking at each slice of the data, instead of the entire dataset.

Work with CSVs and Tables

Our platform supports CSV files up to 1gb, or 250 million cells. By comparison, we support 5 million cells in data tables, and up to 1mb Excel files.

screen readers look here

If you set up your process to use CSVs for large datasets, you’re guaranteed the most storage compared to Excel files or Data tables. CSV files are also the fastest to process—what takes 5 minutes on a data table or Excel files could take seconds with a CSV.

There are also unique CSV actions that are perfect for large data sets.

Transfer between a CSV and data tables

Catalytic has plenty of actions that integrate CSVs and data tables. For example, if you configure a process to use data tables for smaller sets of data, you can save everything to a master CSV file for long term and large scale storage. Adding data from a table to a CSV is easy:

  1. Use the Tables: Convert data table to text to convert a data table into a CSV.
  2. Use the CSV: Update file with another file to add CSVs to the master CSV file.

Split up large data tables

Data tables typically grow in size over time, and it takes planning to split them up. Use the Tables: Create tables for each unique value action to split a table up by each unique value in a column, and create a new table with all the rows for each unique value.

Avoid Batch Processing to Process Data

A batch process means starting multiple Workflow instances or actions as a batch based on a master list. A common example of a batch process is using the Start Workflow for Each Row action to send out thousands of emails all at once based on the rows of a table or Excel file.

Batch processes are a tool and only needed in the right situation. For processing data, like summing, multiplying, or calculating averages, batch processes are inefficient compared to more powerful actions.

Consider actions like CSV: Summarize rows with formulas, which can count the numbers of rows in a CSV then calculate an average, or create fields from different row values, or create arrays out of data.

As well, actions like Tables: Apply formulas to columns and CSV: Apply formulas to columns can change thousands of rows in a single action, where batch processing would require thousands of instances for the same result.

Sorry about that. What was the most unhelpful part?









Thanks for your feedback

We update the Help Center daily, so expect changes soon.

Link Copied

Paste this URL anywhere to link straight to the section.