7.6.22

There is only stream processing.

This post is about what to call stream processing. Strangely enough, it has little do with terminology, but almost all with perspective.

If you are pressed for time (ha!), the conclusion is that "stream processing vs batch processing" is a false dichotomy. Batch processing is merely an implementation of stream processing. Every time people use these as opposite, they are making the wrong abstraction.

Can a reasonable definition be general?

A reasonable definition of data stream is a sequence of data chunks, where the length of the sequence is either "large" or unbounded.

This is a pretty general definition, esp if we leave open what "large" means.

  • Anything transmitted over TCP, say a file or HTTP response
  • a Twitter feed
  • system or application logs, say metadata from any user's visit to this blog
  • the URLs of websites you visit
  • location data acquired from your mobile phone
  • personal details of children who start school every year

It also stands to reason that stream processing is processing of a data stream. A data stream is data that is separated in time.

When a definition is that general, engineers get suspicious. Should it not make a difference whether a chunk of data arrives every year or every millisecond? Every conversation about stream processing happens in a context where we know a lot more about the temporal nature of the sequence.

The same engineers, when dealing with distributed systems, have no problem refering to large data sets as a whole, even if they fully know that it may be distributed over many machines. We may well call a large data set data that is separated in space. A data stream could also be both, separated in time and space (maybe we could call that a "distributed data stream").

At the warehouse

Where am I going with this? Let's quickly establish that "batch processing" is but a form of stream processing, in the above definition.

What do people call batch processing? The etymology of this goes back to punch cards, but it is not about those.

Batch processing is frequently found in data warehousing, or extract-transform-load (ETL) process. This is a setup that has been around since the 1970s, and this does not make it bad. What is essential is that data is periodically ingested (extract), say in the form of large files. It is then turned (transform) into a uniform representation that is suitable for various kinds of querying (load).

Accumulating data and processing it periodically, we have seen this before. Does the data fit the general definition of data stream? Surely, since there is a notion of new data is coming in.

What could be alternative terminology? Tyler Akidau uses the word "batch engine" in this post on stream processing. So the good old periodic processing could be called "stream processing done with a batch engine."

I promised that this is not only about terminology. When did people first feel the need to distinguish batch processing from stream processing?

Data stream management systems

The people who systematically needed to distinguish processing of data streams from simply large data sets were the database community. Dear academics, I don't mean to hurt any feeling but I will just count all papers on "data stream processing" or "complex event processing" as database community.

The database researchers implemented efficient evaluation of a (reasonaly) standard, high level, declarative language: relational queries (SQL). Efficient evaluation and performance meant to make best use of available, bounded resources. As part of this journey, architectures appeared that look very similar to what streams of partial results (such Volcano, or iterator model).

Take a look the relational query (SELECT becomes $\pi$, WHERE becomes $\sigma$, JOIN becomes $\Join$). What would happen if we flipped the direction of the arrows? Instead of bounded data like the "Employee" and "Building" tables, we could have continuous streams that go through the machinery.

Exercise: draw the diagram above, with all arrows reversed, and think about how this could be a useful stream processing system. Maybe there is a setup procedure that has to happen in each building before an engineer starts working from there and they would like to know about arrivals and new departures.

Data stream management system (DSMS) became a thing 20 years ago when implementors realized that most of their stuff will continue to work when tuples come in continuously, mutatis mutandis.

  • Michael Stonebraker, of PostgreSQL fame, built a system called Aurora and founded a company StreamBase systems
  • At Stanford they wrote about the Stanford Stream Data Management System and Continuous Query Language (CQL)
  • (I'm not going to do a survey here)

Academic authors will delineate their field using descriptions as follows.

  • "high" rate of updates
  • requirement to provide results in "real time" (bounded time)
  • limited resources (e.g. memory)

We are now getting closer to the more specific meaning people attach to "stream processing:" not only do we receive the data in chunks at different times, we also need to produce results in a "short" amount of time, or with "bounded" machine resources.

In order to understand why database folks streaming, one shoud know that evaluating queries in DBMS with high performance is also a form a stream processing. When a user types a SQL query at the prompt (SELECT ... FROM ..., the all-caps giving that distinctive pre-history feel), the result should come in as quickly as possible, and there are surely some limits on the machine resources. So generalizing query evaluation to continuosly arriving data really was a logical next step to overcome limitations.

Interlude: Windowing and micro-batching

If you did the exercise, you may wonder how a join is supposed to work when we take a streaming perspective? There are multiple answers.

A particular nice and useful scenario is if the joins can be considered as a simple lookups, for example when each row in a data stream is enriched with something that is looked up via a service.

Another scenario is the windowed stream join: here we consider bounded segments (windows) of two data stream and join what we find in those. Usually, this requires some guarantees: the streams are a priori not synchronous. They may either be synchronous enough or one may use some amount of buffering and periodically process what is in the buffer.

Wait - did I just write "periodic processing"? That is right, so it looks like when stream processing is hardcore enough, it contains periodic processing again. This is where usually people will say things like "micro-batch." There are simply scenarios in stream processing that cannot be done without periodic processing (everything that involves windows in processing time).

A horizontal argument

Now, the words "limited machine resources" meant something different 20 years ago. We can (and do) build systems that involves machines and communications, sharding or "horizontal scaling". From the good old MapReduce to NoSQL and Craig Chambers FlumeJava (which lives happily on as Apache Beam) to Spark and Cloud Dataflow, there is a series of systems, APIs that deal with stream processing in a distributed manner.

Tying the knot

The irony of talking about "batch processing" today is that periodic processing of large files, also involves distributed stream processing underneath. When a large input data set is processed with a MapReduce, it is distributed across a set of workers that map-phase locally produce partial result. The shuffle phase then takes care of getting all partial results to the right place for the reduce-phase. The "separation in space" is also a "separation in time:" a particular partition can be "done" before others, which means that results is a result stream.

Depending on the level at which one is discussing a system, the performance expectations we have, the trade-offs, one may be able to ignore spatial and temporal separation. It seems that the recognizing the temporal separation always brings advantages.

8.4.22

Polaroids from Swiss Cyber Security Days SCSD Day 1

Here are a few things I took home from Day 1 of the Swiss Cyber Security Days conference (SCSD) which was two days ago. I work as a software engineer and am used to deal with technical angles, but cyber security is one of the areas where technological change impacts society at large. I made the decision to go shortly after the Russia-Ukraine war broke out; if cyber security was a pressing problem before, the new political era should make it unavoidable for society to ignore and I wanted to find out first-hand what that means.

(Note: the title is a kind of metaphor that is resolved at the end. The public domain images I include here are random things from the interwebs, not actually from the conference.)

Switzerland Officials and National Councillors

Day 1 was opened according to official protocol by Doris Fiala, member of Swiss National Council, member of the parliamentary Security Policy Committee, and president of SCSD. She greeted many of the present officials and parliamentary members. The morning featured Swiss officials inlcuding high ranking ones who are directly responsible for dealing with cyber security.

The speakers provided perspectives on Swiss national policy, the role of federal government and the armed forces. The officials spoke in national languages German and French often switching from one to the other in the same talk, and there was simultaneous translation (like in UN conferences.) A few highlights:

Florian Schütz has been acting as Swiss federal Cyber Security Delegate since 2019 and reported on strategy and outlook. He talked about the activities of the National Cyber Security Center NCSC, and how it relates to the Financial Sector security center which was founded this week in Zurich. It is a fact that cyber security means something different in every sector and that a federal center would be limited in depth, which can be addressed by creating and support sector-specific cyber security centers where decision makers from companies in that sector can participate directly. He also pointed out that most companies in Switzerland are SMB, and even if many are aware of the problem now, they do no know what concretely to do about the risk.

The majority of Swiss companies have less than 0.5M CHF revenue per year of which they might be able to set aside 3-5% for cyber security. This is not a lot! Getting companies to take the risk seriously and improve their security posture through awareness and training was picked up by other speakers.

The challenge for official institutions is that they have in general not been made aware of cyber security incidents, since companies are not incentivized to report breaches. There is a current proposal to make reporting mandatory, which would help improve visibility.

Major General Alain Vuitel is director of the Project Cyber Command of the Swiss Armed Forces, and talked about the process of establishing Cyber Command. This is a signficant step, since the army is structured into four existing commands and the Cyber Command will become the fifth one. Vuitel pointed out the importance of information in warfare and that this is a factor Ukraine, and the civil society and collateral damage from attack on communication infrastructure like satellites and disinformation campaigns.

The cyber attack on KA-SAT satellite communications network on Feb 24th, the day Russia attacked Ukraine, led to loss of maintenance capabilities for a German operator of wind turbines. Also in the weeks leading up to the outbreak of war, there were cyber attacks on Ukraining infrastructure. War is no longer armed conflict; covert operations and information war mean that in times of "peace", there are attacks going on and this accepted fact was observed in real-time. Vuitel also pointed out that modern communication also bring a new potential: every citizen with a mobile phone can act as a sensor, providing information, provided they have the freedom to act, which underlines the importance of communication infrastructure.

keyboard with lock - to stand for ransomware

Finally, he pointed out that while recruitment is challenging, Switzerland has a system of mandatory military service and militia system. This means the armed forces is able to reach large parts of the population and provide them cyber security training and, it provided them with the means to recruit their first cyber bataillon.

Judith Bellaiche, National Councillor and CEO of SWICO, described the evolution of cyber security as reflected in parliamentary discussions over the last 15 years. Political discourse in Swiss parliament reflects society and the topic of cyber security evolved from early years of getting aware, demanding reports and making sense to an increasing political demands for new official competencies and laws. Most impressively, she summarized this as raw fear: the parliamentary representatives feel that the situation is out of control and demand action, with proposals to expand the role of the state. In order to validate, representative surveys were conducted and it turned out that the general population is even more afraid and leans on expanding the role of the state. This is a political development which will likekly have consequences. It would be a major paradigm change if the role of the state was expanded to protect private companies from cyber security, however it does not look like the private sector can fix this themselves. The debate for expanding the technical and organizational competencies of the state is certain to affect Swiss society.

Practitioners and the Private Sector: Ransomware

In a podium discussion organized by the World Economic Forum on "What’s next for multistakeholder action against cybercrime," it was articulated that ransomware is the biggest topic in cyber security. Countless conpanies are victims of ransomware attacks and the conference experience report from incident responders. Jacky Fox, managing director Accenture Security, told a war story of Irish hospitals losing their entire infrastructure, doctors having to resort to pen and paper and unable to operate diagnostic machines, unable to effectively treat patients and that incident responders had to face the impact of their decisions are matters of life and death.

keyboard with lock - to stand for ransomware

Serge Droz, seasoned incident responder and director of FIRST (Forum of Incident Response and Security Teams) pointed out that case data is important. Investigating a single case of ransomware is imposible, while with data from 20-30 cases of ransomware provides basis for investigation as one can recognize patterns. He pointed out the difference in roles that in a multistakeholder action, incident responders have as first priority the return to normal conditions, while prosecution's first priority is to identify the attacker and forensics.

Maya Bundt, Cyber Practice Leader at Swiss Re and representing insurances, pointed out how the idea of data sharing is much debated but rarely gets concrete. There are many different kinds of "data" involved in ransomware attacks, from technical security-relevant data like indicators of copmromise to case-specific data, how did something happen or specific data victim company, and that in the debate this is often all mixed up. For insurances, what matters is cost, and that the height of the ransom is often only a small fraction of the cost to a company which needs to deal with data loss, getting back to business and forensics. It was pointed that the role of the insurance company is to make the risk transferable, while the risk associated with a cyber security attack is not completely transferable. Insurances do play an important role since they can demand companies to implement practices and improve their posture, and they have an interest to do so, because if too many people are attacked, the insurances get so expensive that nobody will be able to afford them anymore.

The White House: Chris Inglis on getting left of the attack

Chris Inglis currently serves as the first US National Cyber Director and advisor to the President of the United States Joe Biden on cybersecurity and of course the US persective was a highlight. His talk was a balance of rallying support and putting "cyberspace" into perspective with a conceptual framework. Cyberspace is a permanent reality that affects everyone's life. It affects whether electricity is available, whether public transportation works and whether your local store is able operate. Consequently, the defense of cyberspace is not about the defense of technical "stuff", but about the critical functions that are served by the stuff. From this, he derives three factors of cyber defense is about:
  • the stuff: already since the 60s-70s, we know about notions and challenges of technical quality
  • the people: not only the ones who are adjacent to stuff, but who work for and benefit from the critical functions that affect everyone's lives
  • the doctrine, or "roles and responsibilities" that we assign in society. He mentions the supply-chain attack SolarWinds and how many actors were aware but thought someone else is responsible, and that adversaries are actively looking out for weak doctrine.

Defense is fragmented, because cyberspace is not perceived as a universal, ambient sphere of society but as disconnected patches, which means adversaries can deal with a target one at a time but are not countered with a coordinated response.

He advocates for resilience by design, arguing that there are too many variables for anything to be called "secure." Actually defend, not the "stuff" but the critical functions, ensure that anomaly can be detected as early as possible and countered with maximal leverage, away from current practices that assume an ineffective and meaningless division of labor. He points out that cyberspace is a shared resource, and this type of defense was done successfully many times before: in the automobile industry, in the aviation industry, for food and drugs. When asked on the role of regulation, he answered that first comes a general understanding of what constitues common practice, tailored to the sector of interest, and only then there might be regulation as a last resort, preferably coordinated and not in 50 overlapping variants.

Wrapping up

There was a lot more to the conference, but I will end here. It is clear to see that society at large, including Switzerland, and is definitely aware and things will happen. I noticed people mentioned "supply chain" and also SBOM and the recent NIST standard, how to improve security posture in healthcare which depends on industry as well as certifications (security patches can apparently break certification) and much more. I picked "polaroid" as a title because just like the good old notions, practices and challenges around software quality that in Chris Inglis's words accompany the software industry since the 60s, polaroid (instant pictures) is something that comes, goes bankrupt, and comes back since we cannot really do without.

I share the optimism of some of the speakers like Serge Droz who say that after many years of helplessness, due to raised awareness and a more coordinated action, we as society should be able to tackle the ransomware phenomenon. Beyond technical questions, establishing practices and security standards will require everyone, society at large to take part in the effort.

29.3.22

The ultimate cross language guide to modules and packages

Intent \ Language go java rust npm racket
source file - - - - module
logical grouping package package module module collection
unit of release module module package
(crate)
package package

racket is special in that a racket package (*) may happily mess with multiple collections.