Why we're choosing to move from In-Memory Analytics to Out-of-Core
At TypeSift we pride ourselves on being a true engineering-first company. That's why we're always striving to make the core of our product, our Analytics Query Engine, as fast as it can possibly be. At the time of its initial design we had chosen an In-Memory approach which is considered the standard in the industry. But as we continue to evolve alongside the needs of our clients we realized that there are strong motivations to move to an out-of-core model. That is, run aggregations and analyses on data that resides on disk rather than in RAM. The engineering challenge is keeping the system "interactive", that is, being able to aggregate hundreds of millions of rows in under a second (the shoot-for-the-moon metric is 1Bn rows/second). But beyond that the key advantages of an out-of-core system would be:
Read on to learn why we're researching this change and how this fits into a world of primarily in-memory analytics.
In-Memory Analytics has been the de facto standard for nearly 20-years. The premise is that keeping data in RAM and processing it there is significantly faster than pulling data off of disk (especially spinning disks). And this is definitely still the case. In-Memory analytics rose to prominence in the early 2000s because the cost of RAM started to decline significantly. This sudden drop in price gave business intelligence vendors access to the incredible speed of RAM, and therefore made Star Schemas a viable alternative to Cubes.
Although much slower than Cubes (who have a constant lookup time on aggregates), Star Schemas are significantly more flexible. If a business user wanted to change a group-by, they could do it with a single click (or query) rather than waiting days for an engineer to build a Cube that includes the specific aggregation and intersection of dimensions they wanted.
Until the 2000s, the big limitation of Star Schemas supporting interactive analytical workloads (ie. waiting less than a second to retrieve a result) was speed. Innovations in storage layout (specifically Columnar storage) in conjunction with various compression schemes resulted in huge leaps forward, but these systems were still primarily designed for engineers. Interactive analytical workloads for the business (who didn't want to write SQL queries) were largely out of reach. Storing data entirely in RAM was a game changer. It allowed more business-focused software with more user-friendly interfaces to take advantage of cheap hardware (ie. RAM).
That said, shifting to RAM introduced a number of engineering hurdles, which have now become the standard for the past 20 years.
The cost of RAM has come down significantly over the years, but so too has the cost of disk. Even with advancements in storage technology that disks faster, and therefore more expensive, the overall cost of storing something on disk is far lower than in RAM.
Something many folks don't realize about in-memory systems is that you can't easily decouple RAM and CPU in the way that you can storage and CPU. Because of the way computers are architected, there's a limit to how much RAM a single CPU can access. It has to do with the number of DIMM slots a CPU has (which is where your RAM sticks fit in) and how much capacity that stick has. Once you hit the maximum for a given CPU (between 2-4 slots), you must add more CPUs in order to add more RAM. In the analytics game we refer to this as the "CPU Tax" because you may be adding more CPU power than you nee.
You might argue that if you have more data you need more cores to process it. But that's assuming you want to aggregate all that data all the time, which is not the case. A classic example is historical data. Though infrequent, users may want to access data from a few years ago. Keeping this readily accessible means having more RAM, and paying for more CPUs. But the number of times you want to access it is infrequent - therefore you're paying an overall CPU tax to keep data in memory that you don't access very often. The only other option is to play a game of Musical Chairs with your RAM, evicting some data in favour of others in order to run your analysis, and thereby disrupting other peoples' workflows.
One frustration we've all experienced is a restart of the analytics server. Because RAM is ephemeral, your data is wiped and you have to wait for it to be restored. Some systems have clever mechanisms like "incremental loading" or a proprietary in-memory layout that is fast to load, but these only help to ameliorate the problem rather than eliminate it entirely. It would be wonderful to have the data immediately available for access in the event of a server restart.
Given the above frustrations that arise with using RAM as your main storage media for analytical workloads, what's the alternative? Well if you can't keep it in RAM the only other place to put your data is on-disk and process it there - so called "out-of-core" processing. The "core" in this case referring to RAM.
The irony is that before the early 2000s, the thought of sticking everything in RAM was unheard of - it was just too expensive and there wasn't enough of it. Nowadays putting everything in RAM is a foregone conclusion. It's designing for on-disk workloads that have become the challenge, because you still need to figure out how to keep things performant for the user.
Most of what you might consider "analytical databases" (Vertica, C-Store, MonetDB, Snowflake, Greenplum) operate in an out-of-core fashion because there's just too much data to fit into memory. Most of what you might consider "business intelligence" tools (for dashboarding and reporting) are in-memory. But why out-of-core for one and in-memory for another? There seems to be an unspoken rule that divides the two.
Traditionally, out-of-core systems are meant for engineers. You're taking business requirements away from a meeting, translating them into SQL stored procedures, spinning up/down compute in the cloud (if that's an option to you), and then coming back to the business with the results. As an engineer or Business Intelligence analyst, you typically have free reign over all the data, whereas the average business user does not.
However, if you want to visualize that data, apply some governance and restrict access to the underlying data, then you need a different server, and different software, where you can pull the results of that stored procedure into RAM and display it in a dashboard for slicing and dicing.
As I've said this has been the standard architecture for almost 20-years. And it has worked wonderfully for that time - in most cases you don't need the full data set in-memory, just a slice of what's available. The issue is that, in 2021, the size of that slice has been getting bigger and bigger. Users want access to more data, both current and historical, and more data is being generated overall in 2021 than it was in 2001. There have been recent innovations around MPP (massively parallel processing) databases designed for the cloud, but again those systems are for engineers, not for direct use by the business. Insofar as innovations go for interactive analytics on larger data sets, there hasn't been much available out-of-the-box, and so customers' analytics servers continue to grow in size alongside their data.
So wouldn't it be nice if you could have the best of both worlds? A system that lets you store a tremendous amount of data but also support interactive analytical workloads (ie. slice, dice and visualize) all in one system? And do this in a way that doesn't require us to add more cores than is necessary?
This is exactly the problem we're tackling.
TypeSift is broken into four modules - Apps (the ability to convert Excel Spreadsheets into custom applications without writing any code), Analytics (Business Intelligence, reporting and data warehousing), FP&A (Planning & Forecasting) and Syndication (pushing aggregate results to 3rd parties).
It's the Analytics piece that is designed as an in-memory system. We chose this because it was the best practice at the time. Like I said, if we were building a system in 1996, we wouldn't have even thought of storing things in-memory. But by 2016 trying to build an analytics system that operated on disk was only making your life more difficult than it needed to be.
But as our business evolved, and as we listened to our customers, we realized that there were subtle needs to the overall workflow that simply could not be met by an in-memory system. These were the same subtle issues that we experienced in our business intelligence day jobs prior to founding the company, and in the years since, these issues were still not solved by the majority of business intelligence vendors. I've outlined those subtle issues, the common business intelligence pitfalls that literally every company goes through below. We realized that by moving to an out-of-core design we'd be able to achieve the following which, to our knowledge, is not possible in current state-of-the-art analytics systems available.
Adding comments in a discussion, tagging users and annotating data points on a graph are part and parcel for a collaboration workflow. Except that it's not as straightforward to achieve for in-memory systems.
Something most folks don't realize about BI systems is that, depending on how the architecture was designed, it is entirely possible that historical data may change. Going back and rewriting data from the past isn't encouraged, but sometimes is necessary to correct data or errors. Sometimes you have to make decisions when the data hasn't rolled in yet, and when it does, you need to understand the original context under which you made that decision.
Although the golden rule of Business Intelligence is "Don't delete anything", the truth is that's more of a North Star guiding principle that companies strive towards, rather than a prerequisite.
This throws a wrench in the whole idea of "collaboration". How can I collaborate, comment and make decisions based on some data if that data is going to change later on? If I were to annotate a data point on a graph and then that data point moves, then what (quite literally) was the point?
This is the issue with in-memory analytics - because the storage (RAM) is ephemeral, so too are your decisions, comments and overall collaboration. There are only two ways to get around this:
In looking at the two options, only #2 seems to make any sense. The bottleneck of course being the amount of RAM you have and the time it takes to pull snapshots off of disk and into memory.
But out-of-core systems are designed to operate on data in-situ (in place) without pulling anything into memory. In fact, this is how modern operating systems work, and how things were done for decades prior to the In-Memory movement. So if we could keep full backups on disk, apply aggregations on that data in-place, and collaborate from there, then your comments, tags and annotations on any given data point will always be valid.
Server restarts are a reality. We would love to believe that they never happen, but it's a fact of life. Sometimes you just need to "turn it off and on again". The restart itself may only take a few minutes, but it's loading all the data back in that's the pain. It's not uncommon for businesses to wait 45-minutes, an hour or even 3-hours for the data to be reloaded into RAM. But with a system that is driven primarily off of disk this is not an issue. The data is ready to go as soon as the server comes back up, regardless of the size of your data.
Here's another issue - what happens if you need to refresh all your reports with new data in the middle of the day? With an in-memory system, this means that various dashboards (and sometimes the entire system) will have to go down while "data is reloading". This isn't as big a deal for small businesses.
But the challenge for companies as they grow is that if one department wants to reload their data, they may have to take the system down for everyone else. The only way to avoid this problem is to build multiple version of the same dashboard for every department (or even team?) so that only one dashboard is down at a time. But with traditional BI tools this usually means more RAM, more server costs, more document CAL (Client Access License) costs, longer ETL times, and overall more administration overhead for the BI team.
An out-of-core system doesn't have this problem. You could refresh the entire data warehouse in the middle of the day without impacting anyone's analytical workloads. The reason why is that you just keep everyone pointed at the most recent, valid data while new data is being processed. Once the new data is available, give users the option to now point their reports at the new data. We can do this because disk is so much cheaper, and therefore so much more of it is available than RAM.
If users have the option to point their reports at the most recent data, then why not point it at any version in the past? This is the idea behind having multiple-versions of your data warehouse. It's a little bit like the "Time Machine" feature (or backup/restore concept) from your Macbook. Users can run their reports "as of" some specific date and quickly see the difference.
During my time at Arc'teryx we would often say that the most recent data is the best data. If we couldn't have that, give the users (1-day) old data. Failing that, it's better to have no data than wrong data. Obviously that last point is a non-starter because you can't have people sitting around with nothing to do while your BI team is trouble-shooting a bug in the ETL. But even providing 1-day old data could be a pain to achieve.
The idea of a graceful failover is that if some ETL reload goes awry, you still have the most recent valid version of the data warehouse that users can work on. This derives naturally from Constant Uptime and Multi-Version Data Warehousing. It gives your engineering team time and space to track down issues without completely derailing the business's day.
The one thing we haven't addressed about out-of-core systems is performance. After all, the whole reason for moving to In-Memory was to get speed. If we're moving out-of-memory aren't we giving that up?
Well, truthfully, yes. A little. There's always going to be a balance between how much ****data you can analyze and how quickly you can analyze it. And to be clear, we're not saying that every workload that was once in-memory should move on to disk, much as you can't say that every on-disk workload can move into RAM.
But 2021 looks very different from 2001. Storage technology has had 20-years to improve. Remember, there was a time when almost every laptop had a spinning disk in it with a physical "read head" that moved around to find your data. This is basically unheard of now with SSDs (Solid State Drives) that have no moving parts.
Storage has continued to improve, both in performance and affordability. Spinning disks haven't totally disappeared, they're very stable and great for really longterm storage, but they're also typically found in big server farms that power the cloud.
One issue with SSDs was that, although the medium is much faster than a spinning magnetic disk, the protocol through which the CPU talks to that data (SATA) was still the same. It was designed to talk to a spinning head, not to some newfangled flash technology. That too is starting to go away. NVMe (Non-volatile Memory Express) is also based on flash-technology, but it talks to the CPU directly over PCIe lanes, a much faster channel.
Lastly, there is some very promising research being done on PMEM (Persistent Memory), a kind of "long-lived" RAM. It too can talk directly to the CPU, but it does so over the CPU's Memory Bus, rather than PCIe. In fact, PMEM can fit directly into the RAM DIMM slots and operate either as system RAM or extended storage. It's still not as fast as RAM but is incredibly fast nonetheless, can store far more data, and best of all it persists. Although there have been some efforts by Intel to commercialize PMEM, it is still mostly in the lab and has yet to reach wider support by major cloud providers.
The beauty of having these technologies readily available in the cloud, and having software designed to work with them, is that the customer needn't worry about setting any of this up themselves. They can push the onus of engineering design and implementation onto the software vendor (ie., TypeSift).
So how fast can we really get? We believe it's possible to query up-to 1-Billion rows per second off of disk. The vast majority of companies' workloads will fall far below 1-billion rows. And for those that exceed 1-billion rows, they would build a bespoke architecture integrating best-of-breed systems and TypeSift would fit in as an intermediate layer in their overall stack.
We believe that for certain workloads, and certainly for the businesses we serve, there is tremendous opportunity for developing out-of-core processing for interactive analytical workloads. It's better designed for growth and scale, and brings enterprise-grade features to companies who would otherwise need to hire full engineering teams to accomplish, much as we did at Arc'teryx. We think that many vendors will move towards this paradigm in the next 10 years. We're also keeping a close watch on PMEM as that strives to offer the best of both worlds and may replace RAM as the standard storage medium for interactive analytical workloads in the near future. But until then, Out-of-Core is likely the way forward.
Why we're choosing to move from In-Memory Analytics to Out-of-Core.
Here are three investments we see companies making right now to protect and even grow their EBITDA
Explore two, high-level approaches that companies can take when embarking on their data strategy—the Traditional Approach and the Minimalist Approach.
We’ll share three cost reduction strategies for staying lean and mean through COVID-19.
Chasms you must leap as you grow, but in terms of expected ROI on your data.