（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=40051191

地理空间数据涉及复杂的坐标参考系统 (CRS) 和投影。用户发现，Google BigQuery 擅长处理大规模地理空间工作，与 Amazon Web Services (AWS)、PostgreSQL/PostGIS、Spark/DB、Snowflake 和 Trino 等平台相比，查询执行时间显着缩短。这对于处理由于几何列大小限制或缺乏本机地理空间功能而无法由其他平台有效处理的大量数据集特别有用。然而，BigQuery 可能会带来巨大的成本，需要仔细规划以避免意外费用。或者，利用 ClickHouse 等面向列的数据库来处理某些地理空间工作负载也可以提高性能。用户通常选择 TimescaleDB 来处理涉及实时数据处理、频繁更新和已知查询模式的操作用例。虽然它为地理空间数据提供了一些优势，但对于大规模或临时分析任务，其性能可能无法直接与 BigQuery 进行比较。

I've done a good amount of geospatial analysis for work.

One thing you quickly realize with geospatial data is that it's incredibly nuanced. You have to be quite careful about understanding which coordinate reference system (CRS) and for visualization which projection is being used. The CRS is somewhat paranoia inducing if you don't have great infrastructure setup with the right tools to carry that metadata with your geospatial data.

I've tested everything AWS has to offer, Postgres/PostGIS, Spark/DB, Snowflake, Trini, and ARCGis.

I'm convinced the best tool for large scale geospatial work is Google BigQuery and it's not even close. It took an expensive multi hour query running on PostGIS deployed on an enormous m6a EC2 instance to less than 5 seconds that ran in the BigQuery free tier. It does make sense if you think about it, Google was early with Maps, but it is just stunning how much better they are in this specific niche domain.

This was using publicly available FEMA data that Snowflake and AWS services would just choke on, because the geometry column exceeded their maximum byte size. Spark doesn't even have geospatial data types and the open source packages that add support leave a lot to be desired.

This guy is running on-prem, so maybe this made sense, but I just would never bother. The storage for BQ would probably be less than $100/months for 20 TB.

I went through a similar phase with a process that started with global OSM and Whosonfirst to process a pipeline. Google costs kept going up (7k a month with airflow + bigquery) and I was able to replace it with a one time $7k hardware purchase. We were able to do it since the process was using H3 indices early on and the resulting intermediate datasets all fit on ram.

System is a Mac Studio with 128GB + Asahi Linux + mmapped parquet files and DuckDB, it also runs airflow for us and with Nix can be used to accelerate developer builds and run the airflow tasks for the data team.

GCP is nice when it is free/cheap but they keep tabs on what you are doing and may surprise you at any point in time with ever higher bills without higher usage.

DuckDB is the real magic. On an nvme disk with decent amount of RAM, it goes brrrrrr.

I would love it if somehow Postgres got duckdb powered columnstore tables.

I know hydra.so is doing columnstores.

DuckDB being able to query parquet files directly is a big win IMO.

I wish we could bulk insert parquet files into stock PG.

BigQuery is nice but it's definitely a major foot-gun in terms of cost. It's surprisingly easy to rack up high costs with say a misconfigured dashboard or a developer just testing stuff.

Definitively agree here. Once the data is in BigQuery, people will start doing ad-hoc queries and building Grafana dashboards on top of it.

And sooner or later (usually sooner) somebody will build a fancy Grafana dashboard and set it to refresh every 5 second and you will not notice it until it's too late.

That is a very cool setup!

My org would never allow that as we're in a highly regulated and security conscious space.

Totally agree about the BQ costs. The free tier is great and I think pretty generous, but if you're not very careful with enforcing table creation only with partitioning and clustering as much as possible, and don't enforce some training for devs on how to deal with columnar DB's if they're not familiar, the bills can get pretty crazy quickly.

> and may surprise you at any point in time with ever higher bills without higher usage.

What? really? Do they change your pricing plan? How can they charge more for the same usage?

When you queried their 'Open Data' datasets and linked with your own it was absurdly cheap for some time. Granted we used our hacking skills to make sure the really big queries ran in the free tier and only smaller datasets got in the private tables.

I kept getting emails about small changes and the bills got bigger all over the place including BigQuery and how they dealt with queries on public datasets. Bill got higher.

There is a non zero chance I conflated things. But from my point of view: I created a system and let it running for years - afterwards bills got higher out of the blue and I moved out.

We just ported our ~500gb data warehouse from an 8 core postgres rds instance to bigquery. Rebuilds went from 5 hours to 11 minutes and costs are about the same or lower. and that’s with caching of some very big tables in postgres and rebuilding from scratch in bq.

I much prefer postgres as a tool I can run myself but the order of magnitude performance benefits are hard to argue with.

Do you mind linking the specific dataset? I agree very wide columns break a lot of tools but other columnar postgres forks should support this no problem. It sounds like you didn't use Redshift, which I find surprising as it directly competes with BQ. Redshift has "super" columns that can be very large, even larger than the maximum supported by BigQuery.

I constantly see folks finding out the hard way that PostGIS is really hard to beat. The fact that Trini/Presto and Spark have languished here is particularly telling.

It's FEMA's NFHL. I can't recall the specific layer of the GDB file, but you could probably figure it out. Try loading up Iowa into redshift and if that works for you I'd be quite surprised.

My org has a very large AWS spend and we got to have a chat with some of their SWE's that work on the geospatial processing features for Redshift and Athena. We described what we needed and they said our only option was to aggregate the data first or drop the offending rows. Obviously we're not interested in compromising our work just to use a specific tool, so we opted for better tools.

The crux of the issue was that the large problem column was the geometry itself. Specifically, MultiPolygon. You need to use the geometry datatype for this[1]. However, our MultiPolygon column was 10's to 100's of MB's. Well outside the max size for the Super datatype from what I can tell as it looks like that's 16 MB.

[1]: https://docs.aws.amazon.com/redshift/latest/dg/GeometryType-...

I’m glad to hear this first hand experience. I’m pretty sure that want to build and refine my own geospatial data horde.

I wonder if you think a longterm project is better rolling their own. The second priority is that I expect it all to be locally hosted.

Thanks for any considerations.

Frankly I think for long term hoarding BQ is hard to beat. The storage costs are pretty reasonable and you never pay for compute until you actually run a query, so if you're mostly just hoarding, well, you're probably going to save a lot of time, money, and effort in the long run.

Thank you for the insights! Yeah I'm still not sure how Postgres/PostGIS will scale for me, but good to know that BigQuery does this nicely.

This is not something I'm productionizing (at least not yet?) and I'm giving myself zero budget since it's a side project, but if/when I do I'll definitely look into BigQuery!

Thanks for the info. The issue is using EBS. If you used an instance with NVME drives it would probably have been faster than BQ (and you aren't billed per-query...). I would suggest an R or I4 class instance for this, m6 is not good for the money here.

You would just have to setup replication for backups, but this could just be rsync to EBS or some other replication solution depending on your database.

Unless you’re CPU-bound (unlikely), the r-family is usually a better fit for RDBMS, IME. 8x RAM:vCPU ratio, compared to 4x for m-family.

Then there’s the x-family, which can go up to bonkers levels, like 4 TiB of RAM (and local NVMe).

As a sibling comment mentioned, though, if you can fit the data into local storage, that’s the way to do it. EBS latency, even with io2.blockexpress, simply cannot compete.

That said, if I did the math correctly based on the 71 GB/31 days footnote, you’re looking at about 2.15 PB to load the entire thing, so, uh, good luck.

things are changing recently in aws, but few years ago R-family instances had one of the fastest uplink to EBS. for example only the larges M4 instance (m4.16xlarge) has 10gbps uplink, versus R5b where it starts from 10gbps (for the lowest tier) and ends on 60gbps @ 260k IOPS. you can very easily choke EBS with DB.

EDIT: only newer C instances have comparable uplink to EBS, C5 or C4 (and some C6) starts from ~4.7gbps. just compare the EBS bandwidth column in https://aws.amazon.com/ec2/instance-types/

The other fun thing about AWS instances is that the network uplink speed isn’t always what’s advertised. There is of course the “Up to X Gbps” levels (30 minutes of rated speed guaranteed every 24 hours), but there are also other limits, like cross-region speed being capped at 50% of rated capacity.

This rarely matters until it does, like if you’re setting up an active-active DB across regions. Then it’s a fun surprise.

Yes. Most RDBMS are memory-bound (indexes, connection overhead, buffer pool…), so the more, the better.

At my last job, I switched a large-ish (100K QPS at peak) m5 MySQL instance to a smaller (in AWS numbering) r6i that was cheaper, despite having the same amount of RAM and being generation newer. That, combined with careful tuning and testing, resulted in queries speeding up 20-40%, AND we then had plenty of room for vertical scaling again if necessary.

This is super cool! I run DevlRel @ Timescale, and I love seeing our community create well written posts like this!

My initial reaction is that I think one of the reasons you're seeing a hypertable being slower is almost certainly that it creates an index on the timestamp column by default. You don't have an index on your standard table which lets it go faster.

You can use create_hypertable with create_default_indexes=>false to skip creating the index, or you can just drop the index before you ingest data. You'll eventually want that index - but it's best created after ingestion in a one-shot load like this.

I'd also be interested in how the HDD you're reading data from is holding up in some of the highly parallel setups?

Thank you for reading and for your kind words!

Ah I did not know about the `create_default_indexes=>false` and that a time index is created by default for hypertables. I'll add a note to explain this! Also curious to benchmark inserting without the time index then creating it manually.

Even with 32 workers I think the HDD was fine. I did monitor disk usage through btop and the SSD that Postgres lived on seemed to be more of a bottleneck than the HDD. So my conclusion was that a faster SSD for Postgres would be a better investment than moving the data from HDD to SSD.

Great post! Hi Ali!

I think what's missing here is an analysis of what is gained by moving the weather data into a RDBMS. The motivation is to speed up queries. But what's the baseline?

As someone very familiar with this tech landscape (maintainer of Xarray and Zarr, founder of https://earthmover.io/), I know that serverless solutions + object storage can deliver very low latency performance (sub second) for timeseries queries on weather data--much faster than the 30 minutes cited here--_if_ the data are chunked appropriately in Zarr. Given the difficulty of data ingestion described in this post, it's worth seriously evaluating those solutions before going down the RDBMS path.

Hey Ryan and thank you for the feedback!

I agree that storing the data is appropriately chunked Zarr files is almost surely going to be faster, simpler to set up, and take up less space. Could even put up an API in front of it to get "queries".

I also agree that I haven't motivated the RDBMS approach much. This is mainly because I took this approach with Postgres + Timescale since I wanted to learn to work with them, and playing around with ERA5 data seemed like the most fun way. Maybe it's the allure of weather data being big enough to pose a challenge here.

I don't have anything to back this up but I wonder if the RDBMS approach, with properly tuned and indexed TimescaleDB + PostGIS (non-trivial to set up), can speed up complex spatio-temporal queries, e.g. computing the 99th percentile of summer temperatures in Chile from 1940-1980, in case many different Zarr chunks have to be read to find this data. I like the idea of setting up different tables to cache these kinds of statistics, but it's not that hard to do with Zarr either.

I'm benchmarking queries and indexes next so I might know more then!

This is a bit off-topic but I'm interested in the same space you are in.

There seems to be an inherent pull between large chunks (great for visualising large extents and larger queries) vs smaller chunks for point-based or timeseries queries. It's possible but not very cost-effective to maintain separately-chunked versions of these large geospatial datasets. I have heard of "kerchunk" being used to try and get the best of both, but then I _think_ you lose out on the option of compressing the data and it introduces quite a lot of complexity.

What are your thoughts on how to strike that balance between use cases?

> It's possible but not very cost-effective to maintain separately-chunked versions of these large geospatial datasets.

Like all things in tech, it's about tradeoffs. S3 storage costs about $275 TB a year. Typical weather datasets are ~10 TB. If you're running a business that uses weather data in operations to make money, you could easily afford to make 2-3 copies that are optimized for different query patterns. We see many teams doing this today in production. That's still much cheaper (and more flexible) than putting the same volume of data in a RDBMS, given the relative cost of S3 vs. persistent disks.

The real hidden costs of all of these solutions is the developer time operating the data pipelines for the transformation.

That's a great point, it really is all about tradeoffs. In my use case there is strong motivation to keep data creation times low, so writing out multiple datasets comes at a product/opportunity cost moreso than a storage cost. Thanks for the insight.

Why?

Most weather and climate datasets - including ERA5 - are highly structured on regular latitude-longitude grids. Even if you were solely doing timeseries analyses for specific locations plucked from this grid, the strength of this sort of dataset is its intrinsic spatiotemporal structure and context, and it makes very little sense to completely destroy the dataset's structure unless you were solely and exclusively to extract point timeseries. And even then, you'd probably want to decimate the data pretty dramatically, since there is very little use case for, say, a point timeseries of surface temperature in the middle of the ocean!

The vast majority of research and operational applications of datasets like ERA5 are probably better suited by leveraging cloud-optimized replicas of the original dataset, such as ARCO-ERA5 published on the Google Public Datasets program [1]. These versions of the dataset preserve the original structure, and chunk it in ways that are amenable to massively parallel access via cloud storage. In almost any case I've encountered in my career, a generically chunked Zarr-based archive of a dataset like this will be more than performant enough for the majority of use cases that one might care about.

[1]: https://cloud.google.com/storage/docs/public-datasets/era5

The main reason why was that it's a personal project and I wanted to do everything on my home server so that I wouldn't have to pay for cloud resources, and so that I could learn Postgres, TimescaleDB, and eventuallly PostGIS.

But as rabernat pointed out in his comment, pulling out a long time series from the cloud replica is also slow. And I know I eventually want to perform complex spatio-temporal queries, e.g. computing the 99% percentile of summer temperatures in Chile from 1940-1980.

I don't doubt that a cloud replica can be faster, but it's at odds with my budget of $0 haha.

I too need to do percentiles. One option is loop through the grids but bin/histogram it. You'll get a really good 99% from a 1 Kelvin bin width.

Also, I've found the diurnal profile from ERA5 analysis can be abysmal in some locations. ERA5-Land is much better, high resolution, though only available over... er... land.

To your point about not relying on cloud. Noted in the Google option [1] link above:

> Update Frequency: The ERA5 dataset is currently not refreshed in the Google Cloud Public Dataset Program. The program provides ERA5 data spanning from 1940 to May 2023.

Another alternative, Amazon [2], also deprecated:

> The provider of this dataset will no longer maintain this dataset. We are open to talking with anyone else who might be willing to provide this dataset to the community.

[2] https://registry.opendata.aws/ecmwf-era5/

True, but in fact, the Google ERA5 public data suffers from the exact chunking problem described in the post: it's optimized for spatial queries, not timeseries queries. I just ran a benchmark, and it took me 20 minutes to pull a timeseries of a single variable at a single point!

This highlights the needs for timeseries-optimized chunking if that is your anticipated usage pattern.

That might be nice if someone would do it and teach others to use it. Some labs have an RDBMS-based pipeline with published algorithms and data that nobody wants to try to reimplement (and which nobody would be paid to do). About the best improvement we could get was moving from an ancient version of MySQL to Postgres + PostGIS. I think Timescale would have helped. There were other reasons also to run locally due to privacy, cluster access, funds etc.

“Is a relational database even appropriate for gridded weather data? No idea but we’ll find out.”

Love this. It’s the exact opposite of all the other ‘well actually’ mainstream tech posts and I am here for all of it. Props for keeping the reader fully engaged on the journey.

Hey OP (assuming you're the author), you might be interested in this similar experiment I did about four years ago, same dataset, same target, similar goal!

https://rdrn.me/optimising-sql/

Similar sequence of investigations, but using regular Postgres rather than Timescale. With my setup I got another ~3x speedup over COPY by copying binary data directly (assuming your data is already in memory).

> I think it would be cool to have historical weather data from around the world to analyze for signals of climate change we’ve already had rather than think about potential future change.

This is querying a model rather than real observations, so it may more reflect the model's understanding of the world. Still useful; just worth noting I think.

Fair point, also worth noting is:

    The data is output from a climate model run that is constrained to match weather observations.

    So where we have lots of weather observations, ERA5 should match it closely.

    And where we do not have any weather observations, ERA5 will be physically consistent and should match the climatology, i.e. the simulated weather’s statistics should match reality.

    ERA5 covers the entire globe at 0.25 degree resolution, and stretches back in time to 1940 with hourly resolution.

For the time frame (post 1940) there is likely to be actual weather data points twice a day at almost any point on the globe with cities and or ports that colonial expansion reached.

What I'd like to see in such tools are two things:

* Indications of actual observed data points (location time) .. so that the observed point mesh for a day in 1949 about the globe can be seen, and

* Interpolation for a modern day (with dense observed data points) from a reduced set of sparse points (just the ones, say, from that day in 1949) to get a sense of how the infill from sparse points matches the observed but unused points.

I'm not a climate skeptic in the least, global estimation of a coarse global mean from relatively sparse points is fine in my experience (I've done a lot of exploration geophysics for mineral | energy resource mapping) and a bit of interpolated wiggle in the model doesn't blow the big corase estimation .. but this question of interpolation from sparse data is one that always gets raised by skeptics.

Contributor to the PG bulk loading docs you referenced here. Good survey of the techniques here. I've done a good bit of this trying to speed up loading the Open Street Map database. Presentation at https://www.youtube.com/watch?v=BCMnu7xay2Y for my last public update. Since then the advance of hardware, GIS improvements in PG15, and osm2pgsql adopting their middle-way-node-index-id-shift technique (makes the largest but rarely used index 1/32 the size) have gotten my times to load the planet set below 4 hours.

One suggestion aimed at the author here: some of your experiments are taking out WAL writing in a sort of indirect way, using pg_bulkload and COPY. There's one thing you could try that wasn't documented yet when my buddy Craig Ringer wrote the SO post you linked to: you can just turn off the WAL in the configuration. Yes, you will lose the tables in progress if there's a crash, and when things run for weeks those happen. With time scale data, it's not hard to structure the loading so you'll only lose the last chunk of work when that happens. WAL data isn't really necessary for bulk loading. Crash, clean up the right edge of the loaded data, start back up.

Here's the full set of postgresql.conf settings I run to disable the WAL and other overhead:

wal_level = minimal max_wal_senders = 0 synchronous_commit = off fsync = off full_page_writes = off autovacuum = off checkpoint_timeout = 60min

Finally, when loading in big chunks, to keep the vacuum work down I'd normally turn off autovac as above then issue periodic VACUUM FREEZE commands running behind the currently loading date partition. (Talking normal PG here) That skips some work of the intermediate step the database normally frets about where new transactions are written but not visible to everyone yet.

What an interesting post!

> At a sustained ~462k inserts per second, we’re waiting ~20 days for our ~754 billion rows which is not bad I guess It’s less time than it took me to write this post.

Hah, as I've been gravitating more to writing larger and more in depth blog posts I can relate to the surprising effort it can require.

> 727,080 snapshots in time for each variable like temperature

Maybe we should be recording and transmitting the parameters for a function or set of functions that describes these temperature etc changes rather than a set of points to drastically reduce the size of these data sets.

As someone who used to do some GIS hacking in an office job[0] before I was a 'Software Developer/engineer' this is super cool.

[0] - Honestly some of the coolest stuff I ever got to do in it's own right. Building tools that could move data between AutoCAD, Microstation, and Google Earth while also importing other bits with metadata from Trimble units[1]. Also it was probably the most I ever used math in my entire career [2], so there's that.

[1] - I wound up finding a custom font maker, and one of our folks made a font library with the symbols, made it easy to write a parser too :D

[2] - One of those PDFs I always seem to wind up having floating on a hard drive is the USGS 'Map Projections, a working manual'. At one point I used it as a reference to implement a C# library to handle transforms between coordinate systems... alas it was internal.

> I think it would be cool to have historical weather data from around the world to analyze for signals of climate change we’ve already had rather than think about potential future change.

This is a very good instinct! A pretty major portion of modern climate science is paleoclimatology, with a goal of reaching far beyond reliable human measurements. A lot of earth's previous climate states were wildly different from the range of conditions we have experienced in the past 10,000 years, and a better climate record is essential to predicting the effects of massive carbon emission.

Ice cores from Antarctica/Greenland are the most famous instance of this, but there's a lot of other cool ones - there are chemical records of climate change in cave stalactites, ocean floor sediments, etc.

Hypertable insert rates are faster and more predictable over time. Each individual insert might incur a small bit of extra overhead, but they scale forever since each temporal chunk is indexed separately vs a regular table where the entire index needs to fit in memory. This is a case where you can't make meaningful inferences from micro-benchmarks (they tested 20k rows, you probably need 200M to start seeing the diff)

Isn't data inserted into basically a normal Postgres table with hypertable extensions? I don't know the details of Timescale but that sounds like it would incur a cost of a normal Postgres insert, plus potentially extra work at insert time, plus extra work in the background to manage the hypertable.

Not entirely. A hypertable is a postgres table chunked over time. There is the assumption that most data and queries are time-relevant, but also that older data is less relevant than new data.

Indexes are per chunk. So if the query analyzer understands you only touch 2023 it can omit looking at any chunk that is from other years and keep those out of memory. Same with the indexes.

Yeah I'm curious about this too. Been meaning to ask on the Timescale forums. My only guess is that there's some small extra overhead due to hypertable chunking.

I know Timescale has a blog post from 2017 claiming a 20x higher insert rate but that's for inserting into a table with an index. The general wisdom for loading huge amounts of data seems to be that you should insert into a table with no indexes then build them later though. So with no index, inserting into a hypertable seems a bit slower.

Timescale blog post: https://medium.com/timescale/timescaledb-vs-6a696248104e

Timescale hypertables automatically have an index on the timestamp. To make this more comparable you could create the same index on the normal table and test the ingestion rate.

I encourage people to look into the ERA5 dataset provenance especially when you approach the observations made toward the "pre industrial date" of 1850 .

Remember that modern global surface temperatures are collected by satellites, and the dataset is comingled with recordings observed visually & made by hand using buckets by sailors who were not primarily academic researchers. Segments of high resolution, low noise data (satellites) are mixed with low resolution, low coverage, high noise records (hand records on a boat made surrounding the united kingdom).

My point is to be in awe of the technical aspects of this effort but also keep in mind that we are only making copies of low resolution, noisy manuscripts from sailors 170 years ago.

ERA5 covers 1940 to present. That's well before the satellite era (and the earlier data absolutely has more quality issues) but there's nothing from 170 years ago.

Two remarks with postgres and lots of data:

1) I always wonder of there is a better way than COPY. I tend to quickly get 100% CPU without saturating I/O

2) The row overhead seems big. A row has 23 bytes overhead, this table has 48 bytes data per row, so even without page overhead, we lose ~1/3 of our storage. This is pure data storage, without any index.

1. Yeah to me it seems very hard to saturate I/O with Postgres unless maybe you insert into an unlogged table. I guess there's quite a bit of overhead to get all the nice stability/consistency and crash-resistance.

2. That is a good point. I'm hoping TimescaleDB's compression helps here but yeah I don't think you'll ever get the database size below the data's original footprint.

If I look into perf, it seems mostly parsing overhead. I can saturate a create newtable as select from oldtable. Unfortunately, CSV seems still the lingua franca for transport between DBs. Maybe some day a more binary oriented transport protocol will appear( e.g parquet?)

If you don't have access to COPY if the postgres instance is managed, I've had a lot of luck with encoding a batch of rows as a JSON string, sending the string as a single query parameter, and using `json_to_recordset` to turn the JSON back into a list of rows in the db.

I haven't compared how this performs compared to using a low-level sql library but it outperforms everything else I've tried in sqlalchemy.

I too use the ERA5 reanalysis data and I too need quick time series. As the data comes in [lat, lon] grids, stacked by whatever period you've chosen, e.g. [month of hourly data, lat, lon], it becomes a massive matrix transpose problem if you want 20+ years.

What I do is download each netCDF file, transpose, and insert into a massive 3D HDF file organized as [lat, lon, hour]. On my workstation it takes about 30 minutes to create one year for one variable (no parallel I/O or processes) but then takes milliseconds to pull a single (lat, lon) location. Initial pain for long-term gain. Simplistic, but I'm just a climatologist not a database guru.

> The data is output from a climate model run that is constrained to match weather observations.

That's interesting. Why store it? Why not compute it as needed using the model?

FWIW, I am not an expert in this space and if someone is, it would be good to understand it.

If you’re going to work with weather data use a columnar database, like BigQuery. If you set things up right your performance will generally be a few seconds for aggregation queries. I setup a data platform like this at my previous company and we were able to vastly outperform our competitors and at a much lower cost.

The great thing about this data is it is generally append only, unless errors are found in earlier data sets. But it’s something that usually only happens once a year if at all.

Curious if he could squeeze more performance by using a different structure to store the same data. Some of these float4 cols could probably be stored as int2. Depending on how many decimal places are needed, can divide the int to get the resulting floating point value.

Yeah, don't use TimescaleDB, use ClickHouse - I have 10 years of NOAA climate data on my desktop that I query when I want to go on holidays ;-)

TimescaleDB primarily serves operational use cases: Developers building products on top of live data, where you are regularly streaming in fresh data, and you often know what many queries look like a priori, because those are powering your live APIs, dashboards, and product experience.

That's different from a data warehouse or many traditional "OLAP" use cases, where you might dump a big dataset statically, and then people will occasionally do ad-hoc queries against it. This is the big weather dataset file sitting on your desktop that you occasionally query while on holidays.

So it's less about "can you store weather data", but what does that use case look like? How are the queries shaped? Are you saving a single dataset for ad-hoc queries across the entire dataset, or continuously streaming in new data, and aging out or de-prioritizing old data?

In most of the products we serve, customers are often interested in recent data in a very granular format ("shallow and wide"), or longer historical queries along a well defined axis ("deep and narrow").

For example, this is where the benefits of TimescaleDB's segmented columnar compression emerges. It optimizes for those queries which are very common in your application, e.g., an IoT application that groups by or selected by deviceID, crypto/fintech analysis based on the ticker symbol, product analytics based on tenantID, etc.

If you look at Clickbench, what most of the queries say are: Scan ALL the data in your database, and GROUP BY one of the 100 columns in the web analytics logs.

- https://github.com/ClickHouse/ClickBench/blob/main/clickhous...

There are almost no time-predicates in the benchmark that Clickhouse created, but perhaps that is not surprising given it was designed for ad-hoc weblog analytics at Yandex.

So yes, Timescale serves many products today that use weather data, but has made different choices than Clickhouse (or things like DuckDB, pg_analytics, etc) to serve those more operational use cases.

Do you say that because its quicker to insert large batches of rows into Clickhouse, or because it's better in other ways?

(I'm currently inserting large batches of rows into MySQL and curious about Clickhouse...)

So click house is a column db. Any thoughts on if the performance would be a wash if you just pivoted the timescale hypertable and indexed the time + column on timescale?

Haha very cool use! Yeah reading up on TimescaleDB vs. Clickhouse it seems like columnar storage and Clickhouse will be faster and better compress the time series data. For now I'm sticking to TimescaleDB to learn Postgres and PostGIS, but might make a TimescaleDB vs. Clickhouse comparison when I switch!

I can replicate your benchmark when I get a moment, the data, is it free to share if I wanted to make an open browser?

I have a feeling general public has very limited access to weather data, and graphs in news that state "are we getting hotter on colder" are all sensational and hard to verify.

It's a tangent.

I am curious if we query the data to give us temperature at a given time for all lat n long and plot it geo spatially , would the result give anything on heat distribution of energy received across the lat and long at that point in time?

No, you misunderstand (probably because you wish that was true). It's a basic conclusion of data visualization principles. Any method that could detect the trendline would—e.g. basic statistics—but "plot every single datapoint in a big cluster" can't, so it doesn't.

If climate change were visible at that scale (tiny resolution between 0 and 40 degrees) we'd be all boiled since a while.

Still, you can see signs: the maximum temperature until 1990 or so seems to be around 35 degrees, since then there are several peaks above that value and in 2016 (?) it looks to be 38-39. It's certainly less visible on the peaks in the low, because maybe the absolute lowest scores appear to be in the 1990-2000 decade, but then again, all years in the 2010-2020 decade seem to be slightly higher than the minimum temperature in any other decade.

That said, there is massive downscaling involved in such scale, so I wouldn't be too surprised if some details were just skipped and not visible. I wouldn't trust this interpretation much - if a visualization it needs to be, I'd rather plot a moving average with a window of 6 months at least (or even 1 year to entirely rule seasonalities out), and see if that one has an upward trend or not (I bet it does).

[EDIT] I now see the post below with the year averages since 1979. It does indeed seem that 1995-1997 were abnormally cold years, and also that 2010-2020 is the warmest decade since then (and likely since quite a bit longer). So the outliers analysis here above seem to stand :-)

I have not analyzed any data yet and the purpose of plotting the time series was to show an example of the data as a function of time. As others have already mentioned, the swing in Durban temperatures over the seasonal cycle is ~25°C while global temperature increases due to climate change so far are on the order of 1°C.

Plus weather data tends to be quite noisy, just think how variable the weather can be day-to-day and we're squishing 80 years of that into one plot. Also worth noting that different places may experience climate change differently. Some places may be the average temperature go up, some maybe only in the summer, so you'll have to look at averages. Some places may see more extreme summer highs, so then you can't just look at averages but the average extremes or the tail end of the temperature distribution.

So it'll be hard to discern any climate change from just a cursory glance. I'm not saying it's there, just that it requires more analysis.

Tech lead for WEMC here - see https://tealtool.earth Straightforward charts of climate related data for different countries and regions around the globe

For temperature and a few other variables, it shows historical data from the EU Copernicus service (C3S) along with three different projected series out to 2100

for CO2, it shows the latest historical data

The charts are concerning and I am sure my co-workers are not hell bent on faking data to scare people just to get more funding; they work too much and go to too many meetings.

You're trying to see a small change over a large fuzzy area. You need to look at the deviation from the mean (or some other reference), maybe filtered to remove the seasonal changes.

I know what I am seeing. I know how is made a graph like that. 80 years of noisy data in a small row. But it's enough to lowering the panic speech about C.C.

Unless you're reading historical documents, these words have only one meaning in the English language. It is in no sense American English specific.

One advantage of the short scale is that its cadence is the same as that of SI units: million is mega, billion is giga, trillion is tera, and so on. The long scale of course has the -ard names, so this isn't exactly a problem, any more than French saying "four twenties and nineteen" for ninety-nine is a problem.

The difference is one of the major reasons the SI scale names exist in the first place, in fact.

It also matches the decimal separators used by everyone but Indians, which strikes me as more logical.

Sorry, I wasn't intending to imply that they were American English specific, just that I only have experience with English as spoken in the USA.

What's the advantage of the long scale? Just historical?

I don't think the long scale actually has advantages compared to the short. One could argue euphony: the alternative repetition of million, milliard, billion, billiard, and so on, is pleasing in a way.

But really, each is just a quirk of languages. I don't expect 7 to sound like "seven" in other languages, why expect billion to sound like "billion" rather than "milliard" or "trillion"? When conveying information across languages, we use numerals, sometimes with the SI scales, which are universal.

Just another confusing faux amis to figure out when learning another language, really.

This annoys me:

> You can use parameterized queries to protect against SQL injection,

No, that't not what parameterized queries are for. That's just a side benefit.

Parameterized queries are so that the database doesn't have to parse the data from SQL to do the query. It's stupid to turn all the data to text, send the text over the wire, escaped and quoted as appropriate, then parse it back to whatever form it was in originally when you can just send it binary straight to the database.

And if you are doing many similar queries with different results, e.g. inserts as here, or maybe querying the user table by user id every page load, the database doesn't have to parse any SQL each time and can even reuse the query plan.

This may be why psycopg3 performed better than pandas df.to_sql() function in the single row insert case.

I would note that I think the above is a rather minority opinion - while parameterized queries are great for reuse simply using them to guard against SQL injection is still an excellent use of them. If your query is reusable then go for it, but most complex queries tend not to be reusable if they involve optional inputs.

Quick note to observe that all of the above while true becomes less of a practical issue as CPUs become faster vs i/o, which they have done and probably will keep doing.

Isn't the opposite mostly happening, with CPU's single-core performance mostly flat for the last decade, while I/O speeds have been improving every year (especially in networking, not as sure about disk I/O).

Can it actually reuse the query plan? Couldn't that lead to a worse performing plan eventually? Say after inserting lots of data such that a different plan becomes a better option.

I think it's going to be very rare that a database changes enough for a plan to be significantly worse than the optimal plan during the life of a prepared query.

A database engine could easily mitigate against it by re-planning queries every 1000 runs of the query or have the query plan expire after 60 seconds or something. It might be worth trying to re-prepare the query from time to time as part of this kind of bulk insertion benchmarking.

The other thing that would be good to try is closing the transaction every 1000 inserts or some other number. Doing several inserts in a single transaction is certainly better than 1 insert per transaction, but there may come a point where the transaction being too big starts to slow things down.

Usually the TTLs you'll set on these plans[1] are pretty short. It's helpful if you need to do a burst of a thousand queries of the same form over a short interval.

1. At least effectively - this is usually manged via persisting handles or plans themselves.

2. Which is usually a bad code smell from ORM overuse. Every time you dispatch a query you're paying latency overhead - so in an ideal world your report delivery will involve a single query against the database per report request (whether this is a website and those requests are page loads or you're running some internal system for business reporting).

（评论） (comments)

（评论）
(comments)