WTF Memory - Exploring Memory Usage

Overview
An educational repository to learn about memory usage in data libraries using the memray profiler. It includes examples of loading and processing data with libraries like DuckDB, Apache Arrow, and Delta Lake, as well as the data interchange between them.
This project started as an attempt to improve dbt-duckdb integration with Delta tables by understanding memory consumption while interchanging data between DuckDB and the Delta writer via Arrow format. A nice visualization of memory allocation while transferring data between DuckDB and Delta Lake via Arrow format can be found here.
Learnings
- dbt just sends SQL commands to the database, and the queries are not always optimized for memory efficiency
- Arrow is very useful, and DuckDB has to provide it as an iterator to the caller because the buffer manager can exchange memory blocks and this is why fixed pointers don’t work
- Changes to an open source project have to be small and incremental to be accepted. I invested a lot of time but was unable to ensure backward compatibility and therefore my PR was not accepted
- Memray is excellent, and its graph representation helps visualize memory usage and understand what you’re doing wrong in your program
- Understanding ram memory usage is important for performance of a data application
Related tweets
Doing some baby unscientific experiments with different code flows while exporting for the dbt-duckdb
— Aleksandar Milicevic (@milicevica23) March 4, 2024
We are particularly interested in the difference between b (current flow) and c (potentially a new flow) #duckdb pic.twitter.com/0W0HG52lKQ
Playing around with memray and delta-rs and i am not sure if i understand what i see
— Aleksandar Milicevic (@milicevica23) February 25, 2024
1. Why is .arrow() not stable but it takes around 1 GB?
2. What is happening with resident memory?
3. if we give arrow format to delta writer it copies data again? pic.twitter.com/fg38REN8fw
