This blog post recounts a recent debugging session that uncovered a surprising set of issues involving Bash, getcwd()
, and OverlayFS.
What began as a simple customer bug report evolved into a deep dive worth sharing.
Initial Bug Report
A customer reported that OpenSSH scp failed after switching to OverlayFS. We found the following error in the logs:
shell-init: error retrieving current directory: \
getcwd: cannot access parent directories: Inappropriate ioctl for device
After analyzing the report, we realized the message didn’t come from scp itself but from the Bash shell.
We asked the key question: why couldn’t Bash determine the current working directory, and why did it fail with ENOTTY
(Inappropriate ioctl for device)?
Ruling Out the Kernel
Because the issue appeared after the introduction of OverlayFS, we reviewed the OverlayFS source code in the Linux kernel for any code paths that return ENOTTY
.
Although such paths exist, we considered hitting them highly unlikely.
Bash uses glibc and is written in C.
We examined the glibc system call wrapper for getcwd()
but found no logic that could return ENOTTY
.
The wrapper mainly handles buffer allocation and falls back to a generic implementation if the system call fails.
To test this theory, we enabled system call tracing.
Surprisingly, the trace revealed that the getcwd()
system call never got called.
Since glibc offers multiple getcwd()
implementations depending on the system, we double-checked that we had reviewed the correct Linux-specific one.
We found no code path that bypassed the system call.
Bash’s home made getcwd()
A hunch led us to check how Bash links to the getcwd
symbol:
$ nm -D bash | grep getcwd
...
000c7b10 T getcwd
...
This showed that Bash includes its own getcwd()
function rather than relying on glibc’s version.
We expected this output instead:
$ nm -D bash | grep getcwd
...
U getcwd
...
Surprised, we inspected the Bash source and confirmed it does contain a getcwd()
implementation, but guarded by the following:
#if !defined (HAVE_GETCWD)
Developers originally intended this fallback for ancient Unix systems lacking the getcwd()
system call.
On Linux, HAVE_GETCWD
should normally be defined.
We confirmed in config.h
:
At first, this puzzled us, under normal conditions the implementation should never compile.
But further inspection of config-bot.h
showed this logic:
#if defined (HAVE_GETCWD) && defined (GETCWD_BROKEN) && !defined (SOLARIS)
# undef HAVE_GETCWD
#endif
Sure enough, our config.h
defined GETCWD_BROKEN
.
That explained why Bash used its internal fallback.
But why did the system consider getcwd()
broken?
Cross-Compilation Confusion
We examined the output of the configure
script in detail to trace the origin of GETCWD_BROKEN
.
We found this line:
checking if getcwd() will dynamically allocate memory with 0 size... \
configure: WARNING: cannot check whether getcwd allocates memory when cross-compiling \
-- defaulting to no
The check in aclocal.m4
sets GETCWD_BROKEN
if it can’t confirm that getcwd()
allocates memory with a zero-size buffer.
Since the build occurred in a cross-compilation environment, the test defaulted to failure.
We discovered that Bash becomes problematic in cross-compilation environments. Since Bash is cross-compiled for ARM in this specific setup, this made sense. We then wondered why this issue wasn’t more widespread. After all, both Bash and OverlayFS are common in embedded systems.
Next, we looked into how major embedded Linux projects like Yocto handle cross-compiling Bash.
Although the Bash Yocto recipe didn’t mention getcwd
, we found this line in meta/site/common-glibc
:
bash_cv_getcwd_malloc=${bash_cv_getcwd_malloc=yes}
Yocto explicitly overrides the test result to avoid the fallback. The embedded Linux build system we used didn’t apply such a workaround. This clarified the issue. After we implemented a similar override, the issue vanished.
Root Cause Analysis
At this point, we had identified and fixed the bug. But several questions remained:
- Why did the issue appear only with OverlayFS?
- Why did Bash’s fallback
getcwd()
fail?
During testing, we observed another error message:
shell-init: error retrieving current directory: \
getcwd: cannot access parent directories: Success
This indicated that errno
was sometimes set to 0
, suggesting no error occurred, yet getcwd()
still failed.
OverlayFS and Inode Numbers
To answer the remaining questions, we analyzed Bash’s getcwd()
implementation.
On Linux, you can determine the current working directory in two ways:
- Use the
getcwd()
system call - Read the
/proc/self/cwd
symlink
Bash’s implementation used neither, aiming to support systems lacking these features. In fact, the fallback dates back to the last millennium. It used a classic Unix algorithm to reconstruct the working directory path:
- It calls
stat(".")
to obtain the inode number of the current directory. - It calls
readdir("..")
to read the parent directory’s entries. - It compares inode numbers to identify
"."
’s name.
It repeats this process recursively to climb the full path.
Note that this simplified description omits many details.
In practice, you must evaluate both inode (st_ino
) and device (st_dev
) to work across mount points.
Tracing revealed that the fallback getcwd()
failed on the very first path component.
stat(".")
returned an inode number N
, but readdir("..")
returned no matching directory with and inode number N
.
OverlayFS merges two directories, a lower (read-only) and an upper (writable) layer.
When calling readdir()
on a directory, OverlayFS combines entries from both layers without performing full lookups.
It returns the underlying inode numbers directly, unmodified.
This design means that inode numbers from readdir()
don’t guarantee uniqueness or stability in the merged view.
Two entries might even share an inode number without being hard links.
OverlayFS uses this approach to provide fast directory listings, performing a full lookup for each entry would incur performance penalties.
Conversely, stat()
triggers a full lookup.
OverlayFS allocates an inode object that provides stable and unique inode numbers.
That stability is crucial for tools like find
or du
.
Bash’s fallback getcwd()
assumes that the inode from stat()
matches one returned by readdir()
.
OverlayFS breaks that assumption.
We eventually realized that OverlayFS documentation acknowledges this limitation:
For directories, the inode number from readdir()
may not match the number from stat()
.
The Role of the xino Feature
OverlayFS can deliver stable inode numbers via readdir()
when the xino
feature is active.
64-bit systems can encode extra data (e.g., instance numbers) into inode fields to prevent collisions.
This works without requiring a full lookup and does not hurt readdir()
performance.
However, 32-bit systems lack this space and the xino
feature it not available.
We encountered the original problem on a 32-bit ARM platform, which explained why the issue occurred there.
Incorrect Use of readdir()
in Bash
One question remained: why did getcwd()
sometimes fail with ENOTTY
?
Upon inspecting Bash’s getcwd()
, we noticed it misused readdir()
slightly:
readdir()
returnsNULL
both on EOF and on error.- To distinguish between an error condition and the end of the directory list, the caller must set
errno
to zero before callingreaddir()
. - If
readdir()
returnsNULL
anderrno == 0
, it means EOF. - Bash forgot to reset
errno
before the call. For about 30 years, no one noticed.
As a result, when readdir()
returned NULL
with no match, Bash incorrectly assumed an error.
It returned NULL
and left errno
in an undefined state.
Sometimes, ENOTTY
from a previous system call remained, producing misleading errors.
We have reported the issue to the GNU Bash project. Once the bug report becomes publicly visible, it will be linked here.
Conclusion
This bug hunt revealed several contributing factors:
- A misconfigured cross-compilation environment caused Bash to use its fallback
getcwd()
. - OverlayFS introduced subtle inode behavior differences, especially on 32-bit systems.
- Bash’s fallback
getcwd()
relied on assumptions that failed with OverlayFS. - A decades-old oversight in Bash’s error handling created misleading
errno
values.
While we resolved the issue with a simple build tweak, the investigation highlighted deeper lessons about portability assumptions, legacy code, and filesystem complexity.