Release Notes (v2.5.0)

May 18, 2024

The v2.5.0 release includes major new features along with numerous fixes and improvements.

  • New command-line program name.

  • Task update command capabilities.

  • Shell completions.

  • UNIX signal interrupts for graceful shutdowns.

  • Configuration management improvements.

  • Tagging system improvements.


Features


What’s in a name?

The original hyper-shell name is still valid and will remain available for backwards compatibility. But going forward we are dropping the hyphen everywhere, preferring simply hypershell in writing and hs at the command-line.

But why?

At the outset of the project we liked the hyphen for the program name at the command-line; it was aesthetic and distinct. But the hyphen cannot go everywhere we need it to. It cannot go in the environment variables, it cannot go in the package name, and it feels out of place on the file system. It remains a point of confusion for users who are unsure how to refer to the project, so we’re deprecating the hyphen everywhere.

Furthermore, it has been our experience that hyper-shell is simply a lot of characters to type when developing workflows interactively. It has the benefits of uniqueness on any given system; and it remains a good practice to use in scripts in the same way long-form options are. Conversely, just as it is more ergonomic to use short-form, single letter options at the command-line, even stacked, so too is the ability to invoke programs with fewer characters.


New capabilities for task management

Previously the hs task update command would only allow you to specify a single task UUID, a field to change, and the new value. In Issue #27 we consider the much need capability to cancel a task. But what about other sorts of actions (singular and bulk) such as deletion, reversion, or even updating tags en masse?

Instead of creating many additional subcommands, we have extended the original update command to allow for arbitrarily many field and tag modifications in a single call. As well, we’ve added --cancel, --revert, and --delete as special cases that apply managed changes.

We’ve ported over the same machinery that underlies the hs task search command to allow updates to target tasks based on search. In the following example, we modify the exit_status of all tasks whose arguments end in this file pattern.

Update fields and tags with positional arguments

hs task update exit_status=-10 -w 'args ~ .20230627.gz$'

These sorts of updates are sent to the database in one transaction. We can actually apply updates based on relative position as well using --order-by and --limit; these changes however require us to pull task IDs and apply changes iteratively.

In the following example, we apply a special tag to the most recently submitted task.

Apply tag to most recently submitted task

hs task update mark:false --order-by submit_time --desc --limit 1

Though not shown in these snippets, the update command first submits a count query to check how many tasks are affected. It will prompt the user with this information and ask for confirmation before applying the change.

To force the update to proceed without user intervention, use --no-confirm. In the following example we cancel all tasks that meet our criteria.

Cancel remaining tasks without confirmation

hs task update --cancel --remaining --no-confirm

The --remaining, --completed, --failed, and --succeeded options expand in the same way as with hs task search and the tasks’ exit_status.

The --cancel option implies setting schedule_time to now and exit_status to -1. The schedule_time remains null until the scheduler thread on the server selects it; thus setting to now means it cannot be selected. The --revert option applies the same field changes as when the server starts and identifies orphaned tasks. Basically, it retains its ID, submit details, and tags. It will be as if the task had never been scheduled. The --delete option physically removes task records from the database.


Shell completions!

If you use BASH or ZSH you can now autocomplete subcommands, options, and arguments! (Sorry CSH/TCSH and PowerShell users, nothing for these yet).

We’ll try to convey some of the cool completions here, but you’ll have to see for yourself. At the command-line, press <TAB> (once, or twice depending on your shell) to trigger completions.

Configuration: When using hs config get or hs config set, not only do you get standard, static option completion, the positional arguments are the application parameters and valid options. The shell completion function introspects your current configuration and offers these. Further, when setting values, some options are pre-populated with valid enumerations (e.g., logging.level). Notably, the console.theme completes with all of the valid theme names.

Search: When invoking hs task search or hs task update, the positional arguments represent task fields, which are completed for you. Beyond this, when filtering on tags with -t or --with-tag, it first completes with all valid, existing, distinct tag keys. If you follow that key with a : character, it completes with all existing, distinct values for that particular key. This is all run on the database side and unless you have a database with ~10M+ records, should complete in one second or less.

Note

This feature is so useful, you might be interested to poll the database for this information directly using one of two new options for search: hs task search --tag-keys or hs task search --tag-values <KEY>.

Server and Client: When invoking the server and client programs there are additional smart completions. For the client, when completing the --host option, we parse your known hosts (~/.ssh/known_hosts and /etc/hosts) and offer them. This is particularly useful in a Linux cluster environment. For the server, create an ad-hoc authkey with -k by tab completing a 16-digit key generated as a checksum from /dev/urandom.

Note

The completion definition file must be installed to the correct location on your system or sourced in your login profile in order for completions to be enabled.


UNIX signal interrupts

HyperShell has the capacity to heal from clients going missing. We’ve had heartbeats implemented for a long time. The client timeout feature allows for dynamic clusters to automatically scale down when task pressure is low. Unfortunately however, up until now we did not have the ability to choose to scale down because of external factors. An example of this in the context of typical HPC environments is the finite lifetime of job allocations. Imagine the database and server running externally in a persistent fashion and clients popping into existence on a cluster (using a scheduler like Slurm). In this environment, jobs can run up against their walltime limit in a matter of hours depending on the configuration. This would be a known scenario; and an unfortunate waste of resources to allow tasks to begin execution knowing the client will be unceremoniously killed by the scheduler, causing the eviction process to unfold and the orphaned tasks to get reverted and rescheduled.

Wouldn’t it be nice if you had some kind of hook into the system that would send your program a signal that it is nearing a cliff and should drain tasks and shutdown as soon as possible. Thankfully, most modern HPC schedulers do indeed offer this feature. And now we have added a signal handling facility to HyperShell.

The SIGUSR1 and SIGUSR2 signals are intended for application developers to program against as fixed, recognized signals. We now use them for both the client and server to indicate a less catastrophic escalation of shutdown requests.

Sending the SIGUSR1 signal will trigger the schedulers to halt and begin shutdown procedures. On the client side, this means that all current tasks (and any in the local queue) will be allowed to complete, but the system will drain and shutdown at the completion of these tasks.

Sending the SIGUSR2 signal implies the same, but on the client side will set a flag to send local interrupts to tasks to come down faster. As described in the previous release with regard to the task.timeout feature, we send SIGINT, SIGTERM, and SIGKILL in an escalating fashion to halt running tasks.

With regard to signals, we have also added a user configurable parameter -S, --signalwait (or task.signalwait, 10 seconds by default). This is the period in seconds the client will wait between signal escalations when halting a task.


Configuration management improvements

At the command-line, the hs config commands allow the use of --system or --user as an option to target either of these locations. We’ve now added --local to all of the commands and --default on the get command.

The hs config which command now provides much richer output showing not only the site from which an option has precedence but improved presentation and now a comparison to the default value. A new --site option limits output to only the site information (e.g., system, user, local, env, default).

Query for site of configuration parameter

hs config which logging.level
Output
debug (user: /home/user/.hypershell/config.toml | default: warning)
hs config which logging.level --site
Output
user

Further, we’ve added a new HYPERSHELL_CONFIG_FILE environment variable. When set, it disables system, user, and local configuration files in favor of only the named file. Setting this variable as empty results in only environment variables being considered. This can be useful in situations where many instances of the program need to coexist on the same system and incidental modification of the user-level configuration file might break jobs.


Tagging system improvements

Previously, all tag values were considered text. We have modified the encoding to understand and store any valid JSON value type. Note however that some limitations apply and special handling has been implemented where possible; e.g., SQLite considers true and false to be the synonymous with 0 and 1, respectively.

Previously, all task metadata was injected into tasks’ environment variables (e.g., TASK_SUBMIT_HOST). Tag data was specifically stripped however because its more complex JSON was not amenable to simple encoding. However, we now deal with it directly and re-inject them with a TASK_TAG_ prefix. This means tag data is available at runtime. So a task submitted with --tag site:b would have TASK_TAG_SITE=b defined at runtime.



Fixes


Tags not duplicated on task retry #26

When a task fails and -r, --max-retries is used to automatically re-submit a task, tags were not properly replicated. This was a simple omission in the code that duplicates the task metadata.

This fix will likely be applied as a patch release in v2.4.1 as well.


IntegrityError: duplicate violates unique constraint #29

At high throughput, when many clients are connecting, when a database is involved, if the server decides to evict a client because you have the eviction policy set too low (seconds), the client will be evicted, and tasks reverted. But the client in question was only delayed (whether due to the network, throughput, or the size of the database slowing down operations), and will collide with the existing database record when the server re-registers the client upon the next heartbeat message.

In other words, if the server ever claims a client should be evicted too aggressively when a client is not actually gone, there is no protection mechanism currently to avoid the collision.

We fix this by allowed re-registration; though you should avoid this scenario.