(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=41097241

尽管创建列对齐表看起来很简单,但由于其复杂性,作者发现它具有惊人的挑战性。 编写此类函数需要确定每列的最大长度,并根据这些长度添加适当的间距或缩进。 例如,在 Python 中,此过程涉及迭代表数据并构建格式字符串,该格式字符串用空格填充条目。 然而,即使具有 f 字符串和内置填充功能等语言功能,实现此功能仍然很复杂且容易出错。 此外,作者还分享了他们使用不同编程语言的经验,包括 Lil(一种类似于 Lisp 的函数式语言)和各种 shell 脚本。 他们建议更简单的方法,例如允许尾随空白填充,可以使实现更容易。 作者还回忆起一次面试经历,涉及在远程 SRE 职位申请期间在基于 Web 的文字处理程序中进行编码练习,其中要求他们在 bash 中创建一个类似 netstat 的工具。 尽管无法完成任务,他们还是提供了替代方案,提出编写简化的 ps 和 fusions 替代品。 他们的方法被认为是令人满意的,暗示现实世界的场景有可能密切反映文章中描述的挑战。

相关文章

原文


> # turns out that the most difficult problem in computer science is aligning things

> # this one function looks simple but it took so fucking long

Heh. I can't count how many times I've written a column-aligning functions in various programming languages, and each time it is a pain. And it sounds simple in my head - just get max-length of each column, and add spaces up to the next multiple of tab-size.

But even in Python with f-strings and all the fancy padding stuff it has, it is ends up a convoluted, unreadable mess:

    # randomwordgenerator.com
    table = [
        ['agony', 'kick', 'pump'],
        ['frown', 'lonely', 'mutation'],
        ['sail', 'tasty', 'want'],
    ]

    tab_width = 4
    n_rows, n_cols = len(table), len(table[0])
    max_width = [
        max(len(table[r][c]) for r in range(n_rows))
        for c in range(n_cols)
    ]

    for r in range(n_rows):
        for c in range(n_cols):
            item = table[r][c]
            if c == n_cols - 1:
                # do not print space after last item
                print(item)
            else:
                # only print newline after last item
                # EDIT: found a bug here after commenting...
                # width = int(((max_width[c] + 1) / tab_width) * tab_width)
                width = ((max_width[c] // tab_width) + 1) * tab_width

                print(f'{item:<{width}}', end='')
Even while writing this code for the comment for the hundredth time, I've had to fix at least 5 bugs! Truly horrible.


I've literally added Pandas to some of my projects just so I could use a DataFrame to print a nicely formatted table instead of writing code like this.

Surely there's a library out there to do this job, it seems like such a common use-case. I'm surprised it's not in the standard library to be honest!



You could probably do something a little simpler:
  widths = [len(max(col, key=len)) for col in zip(*table)]

  print("\n".join(" | ".join([f"{r:<{w}}" for r, w in zip(row, widths)]) for row in table))

  # Gives:

  # agony | kick   | pump    
  # frown | lonely | mutation
  # sail  | tasty  | want


Ha, clever. It didn't occur to me that I could use zip for extracting columns from a table.

I personally prefer readable to compact, so this is what I ended up with, inspired by your version:

    tab_width = 4
    columns = zip(*table)
    column_widths = [
        max(len(item) for item in column)
        for column in columns
    ]
    column_indents = [
        tab_width * ((width // tab_width) + 1)
        for width in column_widths
    ]

    for row in table:
        items = [
            f'{item:<{indent}}'
            for item, indent in zip(row, column_indents)
        ]
        print(''.join(items).rstrip())


Hah nice - yes I had it on three lines, which breaks out the logic a little better, but I thought it'd be good to just have a couple of lines to paste in and it would work. Someone elsewhere said they import pandas just to print a table! Surely overkill.



I answered a question once with a rather simple O(n) solution:

https://stackoverflow.com/questions/10865483/print-results-i...

    sql = "SELECT * FROM someTable"
    cursor.execute(sql)
    conn.commit()
    results = cursor.fetchall()

    widths = []
    columns = []
    tavnit = '|'
    separator = '+' 

    for cd in cursor.description:
        widths.append(max(cd[2], len(cd[0])))
        columns.append(cd[0])

    for w in widths:
        tavnit += " %-"+"%ss |" % (w,)
        separator += '-'*w + '--+'

    print(separator)
    print(tavnit % tuple(columns))
    print(separator)
    for row in results:
        print(tavnit % row)
    print(separator)
Am I missing anything, are there any glaring bugs? I don't see this as a particularly difficult problem.


I had a go at doing this in Lil, with the simplification of using spaces for alignment instead of tabs, since tab-formatting can be very brittle:
    on format_tab x do
     w:max each r in x count @ r end
     f:(list "%%-%is ")format -1 drop w
     ("\n",""fuse f,"%s")format x
    end
The basic idea is to assemble a format string which right-pads all but the last column, and then use that format string uniformly across all rows.

And then in K for comparison (somewhat clumsily):

    ft:{,/"\n"/((1+|/#:''t)$/:t:-1_'x),'-1#'x}
Much simpler still if we permit trailing whitespace padding:
    ft:{,/"\n"/(1+|/#:''x)$/:x}


I’m at the other end of things: I often have to parse column-aligned data. Sounds easy, except that values can contain spaces, are padded with spaces, are sometimes misaligned and sometimes overflow their column.

Maybe you and I can make a pact here and now to just not column-align data, but rather use some simpler human-readable format? Win-win?



> Maybe you and I can make a pact here and now to just not column-align data, but rather use some simpler human-readable format? Win-win?

To be fair, most of scripts that print tabular data for human reading also contain "--json-out" flag.

Maybe I can just use YAML as a compromise? :)



Have you ever used the Octopus Deploy command line tool? It says on every command “oh you can use -f json” except that almost none of its commands actually implement it and you get the human readable output which you then have to sed/awk/grep your way around in…



I mean to write "most of my scripts". My mistake.

> Have you ever used the Octopus Deploy command line tool?

Thankfully, no :) but I've had similar experience with other pieces of software. Some even provide ability to output JSON, but you need to find the right incantation to do so (looking at you, Docker CLI, and your --format="{{json .}}"!)



Oh man, seriously! It's way harder than it seems like it would. It can also be tempting to make use of tabs, but IME that usually makes it worse because you will hit edge cases that mess it up, so then you have to start tracking different behavior. It's a hell of trek.



How about this?
  max_width = max(len(x) for y in table for x in y))

  items = []

  for row in table:
      rowstr = "".join(
          el + " " * (max_width - len(el)) for el in row
      )
      items.append(rowstr)
      # or, to print inline
      #for el in row:
      #    print(item, end = " " * (max_width - len(el))) 
      #print()

  print("\n".join(items))


    from itertools import zip_longest

    tab_width = 4

    col_max_widths = [(v := max(map(len, a))) + tab_width - (v % tab_width)
                        for a in zip_longest(*table, fillvalue='')]

    for row in table:
        print(''.join(c.ljust(cw) for cw, c in zip(col_max_widths, row)))


Isn't zip_longest only useful for zipping sequences of different lengths? What exactly is it doing here that couldn't be done with plain zip?

Also that code (IMHO, of course) still gives me a headache when I try to mentally parse it.



Regular zip works, I just don’t like the worst case because if one row has fewer columns the entire column will be quietly dropped since builtin zip works as zip_shortest rather than raise an exception. I never use bultin zip on input data.



If you want to launch another process, might as well launch ps.

The challenge here is how to get ps in an environment where the command won't run. And in that case, awk won't run either.

As for why it might happen, well just save the following shell script and run it on a Linux system.

    #! /bin/sh
    $0 &
    perl -e 'push @big, 1 while 1` &
    $0
Now that your system is struggling, figure out how to rescue it.

(True story. I once worked with a careless programmer who would make mistakes whose results looked like that fairly regularly. It was...an education.)



> An interview question for a position that requires knowledge of bash/linux/stuff could be:

> What if you're ssh'd into a machine, you're in your trusty bash shell, but unfortunately you cannot spawn any new processes because literally all other pids are taken. What do you do?

I'd look in the /proc/[pid]/ filesystem for visibility into what processes are exhausting the PID space.

`kill` is a shell builtin in bash, you don't have to rely on forking a new process like /bin/kill. If you can find out the parent process whose children are exhausting PIDs you're well on your way to stopping it and getting a handle on things again.

And I'll be darned, this script parses /proc. No | pipes or $( .. ) substitutions that would need to spawn another bash subshell process either. Pretty clean.



My answer in an interview was “exec Python”. Then you can call all the posix functions you need without launching separate commands.

This went over quite well.



It's funny, because at university, you would be assessed (perhaps) on such a question, and you would not be allowed to use these things! And yet, in "real life", this is exactly how you'd go about accomplishing the task.



Heh! But for real, though. Then you have a repl with access to all the functions in the os module. You can glob files to iterate over /proc. You can send signals. You can open network connections. As far as emergency shells go, you could do far, far worse.

Edit: also, all valid JSON is valid Python. Do not `eval(input_data)` in prod or I will haunt you. But, in an emergency…



I mean realistically speaking: If I can do `foo = `, check `typeof(foo)`. and output foo again to double-check what the REPL thinks foo contains, then I'm pretty safe to `eval(foo)`.

Sure, you could fake it with custom objects and all of that, but not when I'm pasting a string value into a REPL. If you had hijacked my workstation, shell or the remote python to the point you can exploit that... Yeah. I don't think you'd need me as a user then anymore.



> Here is what people have been saying about ctypes.sh:

"that's disgusting"

"this has got to stop"

"you've gone too far with this"

"is this a joke?"

"I never knew the c could stand for Cthulhu."



I'd probably just reboot the machine, honestly. You'll be back up and running faster than spending time in a hobbled environment hunting down and killing the parent processes. And if you're out of PIDs probably a lot of other things are in a bad state. Just start clean.



> I'd look in the /proc/[pid]/ filesystem for visibility into what processes are exhausting the PID space.

From the source code:

    # so initially i was hoping you could get everything from /proc//status
    # because it's easy to parse (in most cases) but apparently you can't get
    # things like the cpu% :(


You can calculate a cpu% from the tick information (uticks,kticks,sticks) in /proc/[pid]/stat. I've done it once in a script after spending considerable time reading the manual of proc.



Specifically the issue here was that it's littered between `/proc//stat{,us}` and then for some of the information you have to look in `/proc` itself for things like major number - driver mapping (for figuring out which TTY something is running on).

Realistically you can get a useful `ps` by catting/grepping `/proc//status` for all the processes, but the goal here was to replicate exactly the output of procps `ps aux`. Except for the bugs in column alignment, she fixed those intentionally.



'echo' is a shell builtin. argv[] length restrictions only apply to exec. it's the same reason the script works, which uses more or less the same technique, only in a 'for' loop, which, is also builtin.

even if it were an issue.. say on a terminal without working scrollback.. you can just as easily:

    echo 1*
and so forth.


echo 1*; echo 2*; ...

Break it into tenths (ninths, maybe, with no leading zeroes?), or finer granularity if necessary.

The argument list isn't nearly as constrained as it was a decade ago. "echo {00000001..10000000}" works in bash on most modern distros where shells on earlier systems would have choked on a tiny ARG_MAX.



sudo forks at least once (bash spawns /usr/bin/sudo), but also will fork to execute the command if logging is enabled (see the manual page for sudo(8)).

you can `exec sudo` but this will hose you if it tries to fork (because now you've lost your bash).



If you're out of pids, you can't ssh back in (though this raises the question of how you ssh'd in in the first place). And hopefully you have root ssh logins disabled.

But I think a prerequisite is that you already have a root shell; some systems don't allow accessing all of /proc unless you're root, and if you figure out what process is exhausting all your pids and want to kill it, you probably need to be root to do that, unless you're very lucky and that process happens to be running under your regular user account.

At any rate, you'd need to `exec restart now`, because just `restart now` would try to fork. (Also, there's no `restart` command; I think you meant `reboot`, and it doesn't need arguments. `shutdown -r now` would also do it.)



Would exiting the ssh session not free up the pid again? Also yes, I meant `reboot` not `restart`, and I always forget its only shutdown that needs the `now`, not reboot



Re sub processes, genuinely curious, how do
   [[ $cmdline ]] && exec {cmdline}>&-
and
  exec {cmdline}< "$dir"/cmdline || continue
work?


This is actually in the POSIX standard for the shell.

"The redirection operator:

  [n]>&word

"shall duplicate one output file descriptor from another, or shall close one. If word evaluates to one or more digits, the file descriptor denoted by n, or standard output if n is not specified, shall be made to be a copy of the file descriptor denoted by word; if the digits in word do not represent a file descriptor already open for output, a redirection error shall result; see Consequences of Shell Errors. If word evaluates to '-', file descriptor n, or standard output if n is not specified, is closed. Attempts to close a file descriptor that is not open shall not constitute an error. If word evaluates to something else, the behavior is unspecified."

https://pubs.opengroup.org/onlinepubs/9699919799/utilities/V...



[[ is a keyword, and exec is a builtin. With the {name}< syntax, exec is opening a file descriptor and assigning it's numerical value to $name, and {name}>&- closes it



Back in 2011, I interviewed at a large-ish tech company from the US for an "SRE" role (and never had I hear that term before) that, amongst other things, created an online, browser-based alternative to MS Office. Some rounds in, after the usual phone screening, the task was some sort of supervised programming in a $thatcompany Docs document while talking to the interviewer on speaker phone. Since I had rated "shell scripting" and "Linux" fairly highly in my mandatory self-assessment sheet that I had turned in weeks prior, I was tasked with conjuring a `netstat` replacement in bash.

I quickly realized I couldn't do it (because I did not, at the time, know where and how exactly socket information was held in /proc/), but I offered to write trimmed-down replacements of `ps` and `fuser` instead. The interviewer deemed that - and my solutions eventually produced in that wretched, browser-based word processor - acceptable, and a few weeks later, I shipped out to the on-site interview series.

Now I wonder if the hypothetical scenario presented as the motivation for this exercise is more grounded in reality than Izabera (thanks for all your help in #bash over years btw!) would care to admit... ;)



Necessarily, as this is the kind of API (a file system with a well-defined structure) that the kernel exposes for querying/providing that kind of information :)



I've never dug too far beyond procfs on Linux, but always assumed there was a more formal C API backing most of these things up like with sysctl (or in general on the BSDs). Relying so heavily on string parsing seems great for quick shell scripts but less great for other languages.



> What if you're ssh'd into a machine, you're in your trusty bash shell, but unfortunately you cannot spawn any new processes because literally all other pids are taken. What do you do?

A long ago for fun I created an interactive website to explore this type of scenario. https://oops.cmdchallenge.com



This is cool, but once I beat it, it just brought me back to the first step and wouldn't let me see the others. That's frustrating because I wanted to check out the "View Solutions" lists for the other steps to see what other approaches I could have taken.



Izabera is one of the gurus in #bash@libera (formerly freenode).

I love all those gurus. They've taught me so much over the past decade.



That's some pretty clean bash. In my experience most bash code is badly written and inefficient, but this is a good example of code that isn't.



> What if you're ssh'd into a machine, you're in your trusty bash shell, but unfortunately you cannot spawn any new processes because literally all other pids are taken. What do you do?

But what if i'm instead in my trusty POSIX shell without bash support? The bash script is not POSIX complient :(



This does not work in bash 3.2, but it is functional on bash 4.2.
  $ ./psaux.bash
  ./psaux.bash: line 182: printf: `(': invalid format character
  ./psaux.bash: line 185: printf: `(': invalid format character
  $ rpm -q bash
  bash-3.2-33.el5_11.4.0.1
联系我们 contact @ memedata.com