![]() |
|
![]() |
|
BigQuery is nice but it's definitely a major foot-gun in terms of cost. It's surprisingly easy to rack up high costs with say a misconfigured dashboard or a developer just testing stuff.
|
![]() |
|
> and may surprise you at any point in time with ever higher bills without higher usage. What? really? Do they change your pricing plan? How can they charge more for the same usage? |
![]() |
|
It's FEMA's NFHL. I can't recall the specific layer of the GDB file, but you could probably figure it out. Try loading up Iowa into redshift and if that works for you I'd be quite surprised. My org has a very large AWS spend and we got to have a chat with some of their SWE's that work on the geospatial processing features for Redshift and Athena. We described what we needed and they said our only option was to aggregate the data first or drop the offending rows. Obviously we're not interested in compromising our work just to use a specific tool, so we opted for better tools. The crux of the issue was that the large problem column was the geometry itself. Specifically, MultiPolygon. You need to use the geometry datatype for this[1]. However, our MultiPolygon column was 10's to 100's of MB's. Well outside the max size for the Super datatype from what I can tell as it looks like that's 16 MB. [1]: https://docs.aws.amazon.com/redshift/latest/dg/GeometryType-... |
![]() |
|
things are changing recently in aws, but few years ago R-family instances had one of the fastest uplink to EBS. for example only the larges M4 instance (m4.16xlarge) has 10gbps uplink, versus R5b where it starts from 10gbps (for the lowest tier) and ends on 60gbps @ 260k IOPS. you can very easily choke EBS with DB. EDIT: only newer C instances have comparable uplink to EBS, C5 or C4 (and some C6) starts from ~4.7gbps. just compare the EBS bandwidth column in https://aws.amazon.com/ec2/instance-types/ |
![]() |
|
Great post! Hi Ali! I think what's missing here is an analysis of what is gained by moving the weather data into a RDBMS. The motivation is to speed up queries. But what's the baseline? As someone very familiar with this tech landscape (maintainer of Xarray and Zarr, founder of https://earthmover.io/), I know that serverless solutions + object storage can deliver very low latency performance (sub second) for timeseries queries on weather data--much faster than the 30 minutes cited here--_if_ the data are chunked appropriately in Zarr. Given the difficulty of data ingestion described in this post, it's worth seriously evaluating those solutions before going down the RDBMS path. |
![]() |
|
Why? Most weather and climate datasets - including ERA5 - are highly structured on regular latitude-longitude grids. Even if you were solely doing timeseries analyses for specific locations plucked from this grid, the strength of this sort of dataset is its intrinsic spatiotemporal structure and context, and it makes very little sense to completely destroy the dataset's structure unless you were solely and exclusively to extract point timeseries. And even then, you'd probably want to decimate the data pretty dramatically, since there is very little use case for, say, a point timeseries of surface temperature in the middle of the ocean! The vast majority of research and operational applications of datasets like ERA5 are probably better suited by leveraging cloud-optimized replicas of the original dataset, such as ARCO-ERA5 published on the Google Public Datasets program [1]. These versions of the dataset preserve the original structure, and chunk it in ways that are amenable to massively parallel access via cloud storage. In almost any case I've encountered in my career, a generically chunked Zarr-based archive of a dataset like this will be more than performant enough for the majority of use cases that one might care about. [1]: https://cloud.google.com/storage/docs/public-datasets/era5 |
![]() |
|
Hey OP (assuming you're the author), you might be interested in this similar experiment I did about four years ago, same dataset, same target, similar goal! https://rdrn.me/optimising-sql/ Similar sequence of investigations, but using regular Postgres rather than Timescale. With my setup I got another ~3x speedup over COPY by copying binary data directly (assuming your data is already in memory). |
![]() |
|
Contributor to the PG bulk loading docs you referenced here. Good survey of the techniques here. I've done a good bit of this trying to speed up loading the Open Street Map database. Presentation at https://www.youtube.com/watch?v=BCMnu7xay2Y for my last public update. Since then the advance of hardware, GIS improvements in PG15, and osm2pgsql adopting their middle-way-node-index-id-shift technique (makes the largest but rarely used index 1/32 the size) have gotten my times to load the planet set below 4 hours. One suggestion aimed at the author here: some of your experiments are taking out WAL writing in a sort of indirect way, using pg_bulkload and COPY. There's one thing you could try that wasn't documented yet when my buddy Craig Ringer wrote the SO post you linked to: you can just turn off the WAL in the configuration. Yes, you will lose the tables in progress if there's a crash, and when things run for weeks those happen. With time scale data, it's not hard to structure the loading so you'll only lose the last chunk of work when that happens. WAL data isn't really necessary for bulk loading. Crash, clean up the right edge of the loaded data, start back up. Here's the full set of postgresql.conf settings I run to disable the WAL and other overhead: wal_level = minimal max_wal_senders = 0 synchronous_commit = off fsync = off full_page_writes = off autovacuum = off checkpoint_timeout = 60min Finally, when loading in big chunks, to keep the vacuum work down I'd normally turn off autovac as above then issue periodic VACUUM FREEZE commands running behind the currently loading date partition. (Talking normal PG here) That skips some work of the intermediate step the database normally frets about where new transactions are written but not visible to everyone yet. |
![]() |
|
Yeah I'm curious about this too. Been meaning to ask on the Timescale forums. My only guess is that there's some small extra overhead due to hypertable chunking. I know Timescale has a blog post from 2017 claiming a 20x higher insert rate but that's for inserting into a table with an index. The general wisdom for loading huge amounts of data seems to be that you should insert into a table with no indexes then build them later though. So with no index, inserting into a hypertable seems a bit slower. Timescale blog post: https://medium.com/timescale/timescaledb-vs-6a696248104e |
![]() |
|
Timescale hypertables automatically have an index on the timestamp. To make this more comparable you could create the same index on the normal table and test the ingestion rate.
|
![]() |
|
ERA5 covers 1940 to present. That's well before the satellite era (and the earlier data absolutely has more quality issues) but there's nothing from 170 years ago.
|
![]() |
|
Yeah, don't use TimescaleDB, use ClickHouse - I have 10 years of NOAA climate data on my desktop that I query when I want to go on holidays ;-)
|
![]() |
|
TimescaleDB primarily serves operational use cases: Developers building products on top of live data, where you are regularly streaming in fresh data, and you often know what many queries look like a priori, because those are powering your live APIs, dashboards, and product experience. That's different from a data warehouse or many traditional "OLAP" use cases, where you might dump a big dataset statically, and then people will occasionally do ad-hoc queries against it. This is the big weather dataset file sitting on your desktop that you occasionally query while on holidays. So it's less about "can you store weather data", but what does that use case look like? How are the queries shaped? Are you saving a single dataset for ad-hoc queries across the entire dataset, or continuously streaming in new data, and aging out or de-prioritizing old data? In most of the products we serve, customers are often interested in recent data in a very granular format ("shallow and wide"), or longer historical queries along a well defined axis ("deep and narrow"). For example, this is where the benefits of TimescaleDB's segmented columnar compression emerges. It optimizes for those queries which are very common in your application, e.g., an IoT application that groups by or selected by deviceID, crypto/fintech analysis based on the ticker symbol, product analytics based on tenantID, etc. If you look at Clickbench, what most of the queries say are: Scan ALL the data in your database, and GROUP BY one of the 100 columns in the web analytics logs. - https://github.com/ClickHouse/ClickBench/blob/main/clickhous... There are almost no time-predicates in the benchmark that Clickhouse created, but perhaps that is not surprising given it was designed for ad-hoc weblog analytics at Yandex. So yes, Timescale serves many products today that use weather data, but has made different choices than Clickhouse (or things like DuckDB, pg_analytics, etc) to serve those more operational use cases. |
![]() |
|
So click house is a column db. Any thoughts on if the performance would be a wash if you just pivoted the timescale hypertable and indexed the time + column on timescale?
|
![]() |
|
Tech lead for WEMC here - see https://tealtool.earth Straightforward charts of climate related data for different countries and regions around the globe For temperature and a few other variables, it shows historical data from the EU Copernicus service (C3S) along with three different projected series out to 2100 for CO2, it shows the latest historical data The charts are concerning and I am sure my co-workers are not hell bent on faking data to scare people just to get more funding; they work too much and go to too many meetings. |
![]() |
|
You're trying to see a small change over a large fuzzy area. You need to look at the deviation from the mean (or some other reference), maybe filtered to remove the seasonal changes.
|
![]() |
|
I know what I am seeing. I know how is made a graph like that. 80 years of noisy data in a small row. But it's enough to lowering the panic speech about C.C.
|
![]() |
|
Quick note to observe that all of the above while true becomes less of a practical issue as CPUs become faster vs i/o, which they have done and probably will keep doing.
|
![]() |
|
Can it actually reuse the query plan? Couldn't that lead to a worse performing plan eventually? Say after inserting lots of data such that a different plan becomes a better option.
|
One thing you quickly realize with geospatial data is that it's incredibly nuanced. You have to be quite careful about understanding which coordinate reference system (CRS) and for visualization which projection is being used. The CRS is somewhat paranoia inducing if you don't have great infrastructure setup with the right tools to carry that metadata with your geospatial data.
I've tested everything AWS has to offer, Postgres/PostGIS, Spark/DB, Snowflake, Trini, and ARCGis.
I'm convinced the best tool for large scale geospatial work is Google BigQuery and it's not even close. It took an expensive multi hour query running on PostGIS deployed on an enormous m6a EC2 instance to less than 5 seconds that ran in the BigQuery free tier. It does make sense if you think about it, Google was early with Maps, but it is just stunning how much better they are in this specific niche domain.
This was using publicly available FEMA data that Snowflake and AWS services would just choke on, because the geometry column exceeded their maximum byte size. Spark doesn't even have geospatial data types and the open source packages that add support leave a lot to be desired.
This guy is running on-prem, so maybe this made sense, but I just would never bother. The storage for BQ would probably be less than $100/months for 20 TB.