Release Notes (v2.5.0)¶
May 18, 2024
The v2.5.0 release includes major new features along with numerous fixes and improvements.
New command-line program name.
Task update command capabilities.
Shell completions.
UNIX signal interrupts for graceful shutdowns.
Configuration management improvements.
Tagging system improvements.
Features¶
What’s in a name?¶
The original hyper-shell
name is still valid and will remain available for backwards
compatibility. But going forward we are dropping the hyphen everywhere, preferring simply
hypershell in writing and hs
at the command-line.
But why?
At the outset of the project we liked the hyphen for the program name at the command-line; it was aesthetic and distinct. But the hyphen cannot go everywhere we need it to. It cannot go in the environment variables, it cannot go in the package name, and it feels out of place on the file system. It remains a point of confusion for users who are unsure how to refer to the project, so we’re deprecating the hyphen everywhere.
Furthermore, it has been our experience that hyper-shell
is simply a lot of characters to type
when developing workflows interactively. It has the benefits of uniqueness on any given system; and
it remains a good practice to use in scripts in the same way long-form options are. Conversely, just
as it is more ergonomic to use short-form, single letter options at the command-line, even stacked,
so too is the ability to invoke programs with fewer characters.
New capabilities for task management¶
Previously the hs task update
command would only allow you to specify a single task
UUID, a field to change, and the new value.
In Issue #27 we consider the much
need capability to cancel a task. But what about other sorts of actions (singular and
bulk) such as deletion, reversion, or even updating tags en masse?
Instead of creating many additional subcommands, we have extended the original
update command to allow for arbitrarily many field and tag modifications
in a single call. As well, we’ve added --cancel
, --revert
, and --delete
as special cases that apply managed changes.
We’ve ported over the same machinery that underlies the hs task search
command
to allow updates to target tasks based on search. In the following example, we
modify the exit_status of all tasks whose arguments end in this file pattern.
Update fields and tags with positional arguments
hs task update exit_status=-10 -w 'args ~ .20230627.gz$'
These sorts of updates are sent to the database in one transaction. We can actually
apply updates based on relative position as well using --order-by
and --limit
;
these changes however require us to pull task IDs and apply changes iteratively.
In the following example, we apply a special tag to the most recently submitted task.
Apply tag to most recently submitted task
hs task update mark:false --order-by submit_time --desc --limit 1
Though not shown in these snippets, the update command first submits a count query to check how many tasks are affected. It will prompt the user with this information and ask for confirmation before applying the change.
To force the update to proceed without user intervention, use --no-confirm
.
In the following example we cancel all tasks that meet our criteria.
Cancel remaining tasks without confirmation
hs task update --cancel --remaining --no-confirm
The --remaining
, --completed
, --failed
, and --succeeded
options
expand in the same way as with hs task search
and the tasks’ exit_status.
The --cancel
option implies setting schedule_time to now and exit_status
to -1. The schedule_time remains null until the scheduler thread on the server
selects it; thus setting to now means it cannot be selected.
The --revert
option applies the same field changes as when the server starts
and identifies orphaned tasks. Basically, it retains its ID, submit details, and tags.
It will be as if the task had never been scheduled.
The --delete
option physically removes task records from the database.
Shell completions!¶
If you use BASH or ZSH you can now autocomplete subcommands, options, and arguments! (Sorry CSH/TCSH and PowerShell users, nothing for these yet).
We’ll try to convey some of the cool completions here, but you’ll have to see for yourself.
At the command-line, press <TAB>
(once, or twice depending on your shell) to trigger
completions.
Configuration: When using hs config get
or hs config set
, not only do you get
standard, static option completion, the positional arguments are the application parameters
and valid options. The shell completion function introspects your current configuration and
offers these. Further, when setting values, some options are pre-populated with valid
enumerations (e.g., logging.level
). Notably, the console.theme
completes with all
of the valid theme names.
Search: When invoking hs task search
or hs task update
, the positional arguments
represent task fields, which are completed for you. Beyond this, when filtering on tags
with -t
or --with-tag
, it first completes with all valid, existing, distinct tag keys.
If you follow that key with a :
character, it completes with all existing, distinct
values for that particular key. This is all run on the database side and unless you have
a database with ~10M+ records, should complete in one second or less.
Note
This feature is so useful, you might be interested to poll the database for this
information directly using one of two new options for search:
hs task search --tag-keys
or hs task search --tag-values <KEY>
.
Server and Client: When invoking the server and client programs there are additional
smart completions. For the client, when completing the --host
option, we parse your
known hosts (~/.ssh/known_hosts
and /etc/hosts
) and offer them. This is particularly
useful in a Linux cluster environment. For the server, create an ad-hoc authkey with
-k
by tab completing a 16-digit key generated as a checksum from /dev/urandom
.
Note
The completion definition file must be installed to the correct location on your system or sourced in your login profile in order for completions to be enabled.
UNIX signal interrupts¶
HyperShell has the capacity to heal from clients going missing. We’ve had heartbeats implemented for a long time. The client timeout feature allows for dynamic clusters to automatically scale down when task pressure is low. Unfortunately however, up until now we did not have the ability to choose to scale down because of external factors. An example of this in the context of typical HPC environments is the finite lifetime of job allocations. Imagine the database and server running externally in a persistent fashion and clients popping into existence on a cluster (using a scheduler like Slurm). In this environment, jobs can run up against their walltime limit in a matter of hours depending on the configuration. This would be a known scenario; and an unfortunate waste of resources to allow tasks to begin execution knowing the client will be unceremoniously killed by the scheduler, causing the eviction process to unfold and the orphaned tasks to get reverted and rescheduled.
Wouldn’t it be nice if you had some kind of hook into the system that would send your program a signal that it is nearing a cliff and should drain tasks and shutdown as soon as possible. Thankfully, most modern HPC schedulers do indeed offer this feature. And now we have added a signal handling facility to HyperShell.
The SIGUSR1
and SIGUSR2
signals are intended for application developers to program
against as fixed, recognized signals. We now use them for both the client and server to indicate
a less catastrophic escalation of shutdown requests.
Sending the SIGUSR1
signal will trigger the schedulers to halt and begin shutdown procedures.
On the client side, this means that all current tasks (and any in the local queue) will be allowed
to complete, but the system will drain and shutdown at the completion of these tasks.
Sending the SIGUSR2
signal implies the same, but on the client side will set a flag to send
local interrupts to tasks to come down faster. As described in the previous release with regard
to the task.timeout
feature, we send SIGINT
, SIGTERM
, and SIGKILL
in an escalating
fashion to halt running tasks.
With regard to signals, we have also added a user configurable parameter
-S
, --signalwait
(or task.signalwait
, 10 seconds by default). This is the period
in seconds the client will wait between signal escalations when halting a task.
Configuration management improvements¶
At the command-line, the hs config
commands allow the use of --system
or --user
as
an option to target either of these locations. We’ve now added --local
to all of the commands
and --default
on the get
command.
The hs config which
command now provides much richer output showing not only the site from
which an option has precedence but improved presentation and now a comparison to the default value.
A new --site
option limits output to only the site information (e.g.,
system
, user
, local
, env
, default
).
Query for site of configuration parameter
hs config which logging.level
Output
debug (user: /home/user/.hypershell/config.toml | default: warning)
hs config which logging.level --site
Output
user
Further, we’ve added a new HYPERSHELL_CONFIG_FILE
environment variable. When set, it disables
system, user, and local configuration files in favor of only the named file. Setting this
variable as empty results in only environment variables being considered. This can be useful in
situations where many instances of the program need to coexist on the same system and incidental
modification of the user-level configuration file might break jobs.
Tagging system improvements¶
Previously, all tag values were considered text. We have modified the encoding to understand and store any valid JSON value type. Note however that some limitations apply and special handling has been implemented where possible; e.g., SQLite considers true and false to be the synonymous with 0 and 1, respectively.
Previously, all task metadata was injected into tasks’ environment variables (e.g., TASK_SUBMIT_HOST
).
Tag data was specifically stripped however because its more complex JSON was not amenable to simple
encoding. However, we now deal with it directly and re-inject them with a TASK_TAG_
prefix. This means
tag data is available at runtime. So a task submitted with --tag site:b
would have
TASK_TAG_SITE=b
defined at runtime.
Fixes¶
IntegrityError: duplicate violates unique constraint #29¶
At high throughput, when many clients are connecting, when a database is involved, if the server decides to evict a client because you have the eviction policy set too low (seconds), the client will be evicted, and tasks reverted. But the client in question was only delayed (whether due to the network, throughput, or the size of the database slowing down operations), and will collide with the existing database record when the server re-registers the client upon the next heartbeat message.
In other words, if the server ever claims a client should be evicted too aggressively when a client is not actually gone, there is no protection mechanism currently to avoid the collision.
We fix this by allowed re-registration; though you should avoid this scenario.