Analytics for not-so-big data with DuckDB

In the past decade the industry has seen hundreds of new databases. Most of these newcomers are operational databases, meant for online workloads and being a primary datastore for applications.

Database

Architecture

Big Data

A handful of new databases are meant for analytical use-cases, mainly large scale big data workloads. Which makes DuckDB an interesting exception, because it's built for workloads that are too big for traditional databases, but not so big that they justify complicated big data tools. It's a lightweight, open-source, analytical database for people with gigabytes or single terabytes of data, not companies with hundreds of terabytes and teams of data engineers.

In this session we'll take DuckDB out for a test drive with live demos and discussion of interesting use-cases. We'll see how to use it to quickly run analytical queries on data from multiple data sources. We'll look at how to use DuckDB to transform and manipulate diverse datasets, such as turning a bunch of raw CSV data in S3 into a set of tables in MySQL with a single command. We'll check out its embedded capabilities, by running the database directly inside a Python application. And finally, we'll build a quick-and-dirty Data Lake by using DuckDB, without any complicated big data tools.

David Ostrovsky

At age 9 little David found an old book called "Electronic Computational Machines" at the library and, after reading it in a single weekend, decided that this was what he wanted to do with his life. Three years later he finally got to touch a computer for the first time and discovered that it was totally worth the wait. One thing led to another and now he’s a software engineer at Meta. David is a software developer with over 25 years of industry experience, speaker, trainer, blogger and co-author of “Pro Couchbase Server”. He specializes in large-scale distributed system architecture.

NDC { Oslo }

Analytics for not-so-big data with DuckDB

David Ostrovsky