Designing Surveillance-Intensive Applications (book)

Reading period: Feb 2017–May 2017

Designing Data-Intensive Applications. Martin Kleppmann. O’Reilly 2017 (preprint version)

a gift from O’Reilly at FOSDEM 2017
like Distributed Systems by Tannenbaum but with more real-world examples and application explanation

Brief notes

startups
- quick iteration of features (rather than scalability)
document databases
- XML, JSON
- lack joins, encourage denormalization
log based structures
- TODO LSM trees
- append only, databases
OLTP vs OLAP (transactions vs analytics)
- kingdom of RDBMSs
RPC
- design flaw that tris to hide inherent drawbacks of network (unreliability, latency, repeated delivery)
- SOAP (reliance on scaffolders, API generators)
“For a successful technology, reality must take precedence over public relations, for nature cannot be fooled.” by Feynman
mobile clients – extreme of distributed systems
- “As the rich history of broken calendar sync implementations demonstrates, multi-leader replication is a tricky thing to get right.”
- work in disconnected mode (very long latency)
TODO stages of consistency
- total ordering
- causal ordering
- reading your writes
- transactions
  - multi-version concurrency control (MVCC), snapshots
    - writers never block readers and vice versa
  - still write skew (FoL, room booking)
    - airlines and their overbooking policies (if it’s economic your data needn’t be consistent)
  - 2 phase locking vs 2 phase commit
  - SSI (serializable snapshot isolation) – new kid on the block
- strong consistency
- relation to distributed consensus problem
  - various assumptions (e.g. availability of random numbers)
  - most algorithms assume a single node can process all data (should it be necessary)
end-to-end constraints
- request ID to eliminate write skew(?)
immutability FTW
- except for data retention regulations
  - moral aspects of data processing (data = exhaust (cf. environmentalism) , s/data/surveillance/)
  - asymmetric relation between a user and service (cf. employee and employer)
    - social cost of not using social networking service
skewed workload
- followers of Justin Bieber on Twitter
partitioning and scalability
- initially there is only a single partition (with little data) but it’s a potential bottleneck
system model
- assumptions one has to take about system to provide guarantees
absolute time is a luxury
- even each CPU can have different time
- drift
- APIs (Google Spanner) that work with confidence intervals
- problem with finite windows (e.g. collecting events)
difference between computer science and software engineering
- “a real implementation may still have to include code to handle the case where something happens that was assumed to be impossible”
tools/systems integration
- Unix philosophy (utilities + pipes)
- idea of piping e-mail account and shopping history to a data analytics tool
- convergence of Map-Reduce and stored procedures in RDBMS (MPP?)
- stream processing (dual problem to RDBMS – static query, ephemeral data)
“People often specialize in one particular niche of technology, and remain unaware of the requirements that exist in other niches.”
beginnings of web/distributed applications
- the state: page could only scroll up and down