Disposable Software Meets Data Analytics: On-Demand Insights with a Datalake and Python

The Disposable Software Idea

In “Disposable Software”, Anish Acharya at Andreessen Horowitz argues that software used to be expensive and slow to build, so we built it for serious, lasting use cases—payroll, ERP, big consumer networks. Today, with LLMs and AI-native runtimes, we can spin up small, personal, or throwaway apps in an hour or less. Building little apps is starting to feel like doodling in a notebook; the limit is imagination, not ROI.

The same shift applies to data analytics. We used to justify only “serious” analytics: data warehouses, modeled schemas, BI tools, and pipelines that had to last. But not every question deserves a permanent solution. Sometimes you just need an answer once, or for a narrow audience. A datalake plus disposable Python scripts lets you do analytics on demand—without committing to a product.

Datalake: Store First, Interpret Later

A datalake is typically object storage (e.g. S3, GCS, or MinIO) where you land raw or semi-raw data—events, logs, exports, API snapshots—often in columnar formats like Parquet. You don’t have to define a single schema up front or build a full data model. You ingest and keep; you interpret when you need to (schema-on-read).

Low commitment: Add new sources and new shapes without redesigning the whole system.
Cheap storage: Object storage is inexpensive; you can keep more history and experiment.
Query when needed: Use DuckDB, Athena, or Python to query only when you have a question.

That fits the disposable mindset: you don’t have to build the “right” schema forever. You build enough structure to query (e.g. date-partitioned Parquet), then write small scripts for specific questions.

Disposable Python Scripts for Analytics

A disposable script reads from the lake, does a transform or aggregation, and outputs a result—CSV, chart, or a table for a simple dashboard. You keep it if it’s useful; you delete it if it was one-off. Tools like DuckDB and pandas (or Polars) make this fast without a dedicated analytics stack.

Example 1: One-Off Report from the Lake (DuckDB + Parquet)

Events are in the lake under s3://my-bucket/events/year=2026/month=03/*.parquet. You need a quick summary by day and event type. One script, no pipeline:

import duckdb

con = duckdb.connect()
con.execute("""
    INSTALL httpfs; LOAD httpfs;
    SET s3_region = 'us-east-1';
""")

# Query Parquet in S3 directly
df = con.execute("""
    SELECT
        date_trunc('day', event_time) AS day,
        event_type,
        count(*) AS cnt
    FROM read_parquet('s3://my-bucket/events/year=2026/month=03/*.parquet')
    GROUP BY 1, 2
    ORDER BY 1, 2
""").fetchdf()

df.to_csv("monthly_event_summary.csv", index=False)
print(df)

Run it when you need it. If the question never comes back, you never run it again. If it does, you might turn it into a cron job or a small scheduled report—still “disposable” in spirit.

Example 2: Ad-Hoc Cohort with Pandas

You have user signups and activity in Parquet files under the lake. You want a one-time cohort: “Users who signed up in January and did at least one action in the first 7 days.” No warehouse, no dbt—just a script:

import pandas as pd

# Read from lake (local path or s3fs if you mount S3)
signups = pd.read_parquet("s3://my-bucket/signups/2026-01.parquet")
actions = pd.read_parquet("s3://my-bucket/events/2026-01/*.parquet")

signups["signup_date"] = pd.to_datetime(signups["created_at"]).dt.date
actions["event_date"] = pd.to_datetime(actions["timestamp"]).dt.date

# First 7 days per user
cohort = signups.merge(
    actions,
    on="user_id",
    how="inner",
)
cohort = cohort[cohort["event_date"] <= cohort["signup_date"] + pd.Timedelta(days=7)]

activated = cohort.groupby("user_id").agg({"event_id": "count"}).reset_index()
activated = activated[activated["event_id"] >= 1]
result = signups[signups["user_id"].isin(activated["user_id"])]

result.to_csv("jan_activated_cohort.csv", index=False)

Once you have the CSV, you can share it, plug it into a slide, or feed a one-off chart. The script is the “app”; the datalake is the only durable part.

Example 3: On-Demand Dashboard Feed

You want a simple “daily active” metric for a small internal dashboard. Instead of a full BI pipeline, run a script (manually or on a schedule) that reads from the lake, aggregates, and writes a small JSON or CSV that the dashboard consumes:

import duckdb
import json
from pathlib import Path

con = duckdb.connect()
con.execute("INSTALL httpfs; LOAD httpfs;")

# Last 30 days DAU from events
dau = con.execute("""
    SELECT
        date_trunc('day', event_time)::date AS day,
        count(DISTINCT user_id) AS dau
    FROM read_parquet('s3://my-bucket/events/**/*.parquet')
    WHERE event_time >= current_date - 30
    GROUP BY 1
    ORDER BY 1
""").fetchdf()

out = Path("dashboard_data/dau.json")
out.parent.mkdir(parents=True, exist_ok=True)
out.write_text(dau.to_json(orient="records", date_format="iso"))

The dashboard is a static page or a simple app that loads dau.json. The “pipeline” is a single script. If requirements change, you edit or replace the script—no need to treat it as permanent infrastructure.

When to Upgrade (and When Not To)

Disposable analytics don’t replace data warehouses or BI when you have recurring, governed, multi-team reporting. They’re for ad-hoc questions, one-off reports, and small-audience or personal dashboards. When a script becomes critical and runs every day, you might promote it to a proper job (Airflow, cron, or a small service). When a one-off stays one-off, you leave it as a script or delete it. The datalake remains the durable layer; the scripts are the disposable layer on top.

Conclusion

a16z’s disposable software idea—that we can build small, throwaway apps because the economics of software have changed—extends to analytics. With a datalake and disposable Python scripts, you can do data analytics on demand: answer one-off questions, generate reports, and feed simple dashboards without building a permanent BI product. Store first, interpret later; script when you need it; keep or throw away. Software creation used to be constrained by ROI; analytics can now be constrained more by curiosity and need than by infrastructure.

Read the original

Anish Acharya, “Disposable Software,” Andreessen Horowitz (a16z).

Disposable Software | a16z