![]() |
|
![]() |
| I remember us once giving a supplier access to our internal bug tracker for a collaborative project. They were unable to get to the “…/openissue” endpoint. |
![]() |
| I once worked for a company that blocked Cuban sites because of .cu (which is the Portuguese word for the end of your digestive system), but did not block porn sites (or so I was told ;-). |
![]() |
| That seems like the sort of thing you check on your last day as you’re going out the door.
“The rumors are true!” (Although less amusing, you could also just ask the IT guys and gals) |
![]() |
| Apache Arrow Columnar Format:
https://arrow.apache.org/docs/format/Columnar.html :
> The Arrow columnar format includes a language-agnostic in-memory data structure specification, metadata serialization, and a protocol for serialization and generic data transport. This document is intended to provide adequate detail to create a new implementation of the columnar format without the aid of an existing implementation. We utilize Google’s Flatbuffers project for metadata serialization, so it will be necessary to refer to the project’s Flatbuffers protocol definition files while reading this document. The columnar format has some key features: > Data adjacency for sequential access (scans) > O(1) (constant-time) random access > SIMD and vectorization-friendly > Relocatable without “pointer swizzling”, allowing for true zero-copy access in shared memory Are the major SQL file formats already SIMD optimized and zero-copy across TCP/IP? Arrow doesn't do full or partial indexes. Apache Arrow supports Feather and Parquet on-disk file formats. Feather is on-disk Arrow IPC, now with default LZ4 compression or optionally ZSTD. Some databases support Parquet as the database flat file format (that a DBMS process like PostgreSQL or MySQL provides a logged, permissioned, and cached query interface with query planning to). IIUC with Parquet it's possible both to use normal tools to offline query data tables as files on disk and also to online query tables with a persistent process with tunable parameters and optionally also centrally enforce schema and referential integrity. From https://stackoverflow.com/questions/48083405/what-are-the-di... : > Parquet format is designed for long-term storage, where Arrow is more intended for short term or ephemeral storage > Parquet is more expensive to write than Feather as it features more layers of encoding and compression. Feather is unmodified raw columnar Arrow memory. We will probably add simple compression to Feather in the future. > Due to dictionary encoding, RLE encoding, and data page compression, Parquet files will often be much smaller than Feather files > Parquet is a standard storage format for analytics that's supported by many different systems: Spark, Hive, Impala, various AWS services, in future by BigQuery, etc. So if you are doing analytics, Parquet is a good option as a reference storage format for query by multiple systems Those systems index Parquet. Can they also index Feather IPC, which an application might already have to journal and/or log, and checkpoint? Edit: What are some of the DLT solutions for indexing given a consensus-controlled message spec designed for synchronization? - cosmos/iavl: a Merkleized AVL+ tree (a balanced search tree with Merkle hashes and snapshots to prevent tampering and enable synchronization) https://github.com/cosmos/iavl/blob/master/docs/overview.md - Google/trillion has Merkle hashed edges between rows in order in the table but is centralized - "EVM Query Language: SQL-Like Language for Ethereum" (2024) https://news.ycombinator.com/item?id=41124567 : [...] |
![]() |
| It's probably complaining about the relative path, try replacing `-v ./pg-data:/var/lib/postgresql/data` with `-v "$PWD/pg-data:/var/lib/postgresql/data"` |
![]() |
| Was thinking the same thing when I saw those zeros in the checksum field. Perhaps the consequences are significant.
Here's a benchmarking exercise I found: https://www-staging.commandprompt.com/uploads/images/Command... With a tidy summary: > Any application with a high shared buffers hit ratio: little difference. > Any application with a high ratio of reads/writes: little difference. > Data logging application with a low ratio of reads/inserts, and few updates and deletes: little difference. > Application with an equal ratio of reads/inserts, or many updates or deletes, and a low shared buffers hit ratio (for example, an ETL workload), especially where the rows are scattered among disk pages: expect double or greater CPU and disk I/O use. > Run pg_dump on a database where all rows have already been previously selected by applications: little difference. > Run pg_dump on a database with large quantities of rows inserted to insert-only tables: expect roughly double CPU and disk I/O use. |
![]() |
| Humans work better. HN is small scale enough that a moderator can come along, collapse the off topic comments and fix the title, and it's not an issue. |
![]() |
| IIRC, if the original submitter edits the title once it has been posted, the edited version sticks, i.e. the filter only works the first time and you can override it if you notice it. |
![]() |
| @drewsberry: I wish you had an RSS feed! I tried to subscribe to your blog but if there is one it's not linked.
(Enjoyed the post) |
Next paragraph mentions TOAST and this byte is related to that. The low order bits (on little endian platforms) determine whether the value is stored inline (00, first 4 bytes are total length), is stored in TOAST table (11) or is shorter than 127 bytes (01 for even length, 10 for odd length, the total length is first byte >> 1). So for 0x25 you get 01, so length is 0x25 >> 1 = 18, which is that byte followed by "Equatorial Guinea".
Edit: the reason why endianness matters is that the same representation is also used in memory and the whole first word is interpreted as one length value. The toast tag bits have to be in first byte, which is most easily done as two highest order bits of that word on big endian. That means that it is placed in the two highest bits of the byte.