Components
Components Diagram
Handlers
Each handler is associated to a service and is used to process different events and to keep the different services of the platform synchronized. Handlers are useful for example to process user or permission changes, or to manage file modification events.
FileSystem
The FileSystem handler is used to manage some aspects of the directory/file structure associated with the platform. It is specifically used to handle user workspaces and the user access to different other directories. The handler is responsible for creating/deleting a workspace if a user is created/deleted.
Here are some typical directories found in the user_workspaces
:
/user_workspaces/<user_name>/notebooks -> /jupyterhub_user_data/<user_name>
/user_workspaces/<user_name>/shapefile_datastore # Managed by the `GeoServer` handler
/user_workspaces/public/wps_outputs # Managed by the `FileSystem` handler
JupyterHub user data
A symlink is created in the user workspace to give access to the notebook directory of the user. The notebook directory
is the directory that was originally mounted in bird-house as the writable-workspace
. When adding the Cowbird
component to bird-house (see
here), the
mounted writable-workspace
becomes the user workspace managed by Cowbird (e.g.: /user_workspaces/<user_name>
)
which requires a symlink to access the notebook directory originally used by bird-house and JupyterHub.
WPS outputs data
The WPS outputs data, generated by the different WPS birds and Weaver, is stored in a distinct data directory and must be made accessible to the users via their workspace.
There are 2 types of data :
Public data: This data must be accessible to every user.
User specific data: This data must only be accessible to the related user.
The public WPS outputs data is made accessible by generating hardlinks from the WPS outputs data directory to the
user workspaces location (e.g.:/user_workspaces/public/wps_outputs
). When a JupyterLab instance is started via
bird-house, the directory containing the hardlinks will be mounted as a volume, such that the users obtain access to
its content via their JupyterLab instance. The volume will be made read-only to prevent a user from modifying
the public data.
The user WPS outputs data is made accessible by generating hardlinks from a WPS outputs data directory containing user
data to a subdirectory found in the related user’s workspace. For example, with a source path
/wps_outputs/<bird-name>/users/<user-id>/<job-id>/<output-file>
, a hardlink is generated at the path
/user_workspaces/<user-name>/wps_outputs/<bird-name>/<job-id>/<output-file>
. The hardlink path uses a similar
structure as found in the source path, but removes the redundant users
and <user-id>
path segments. The
hardlink files will be automatically available to the user on a JupyterLab instance since the workspace is mounted as
a volume. Any file that is found under a directory /wps_outputs/<bird-name>/users/<user-id>/
is considered to be
user data and any outside file is considered public.
The permissions found on the user data are synchronized with the permissions found on Magpie. If Magpie uses a
secure-data-proxy
service, this service handles the permissions of those files (see here). If a file does not have a corresponding route on the secure-data-proxy
service, it will use the
closest parent permissions.
Note
If the access to a WPS outputs file is allowed, the file access will be read-only and any write permissions from the
secure-data-proxy
service will be ignored. This is because WPS outputs are produced by external processes and
the resulting data should remain constant.
Note
The files/directories permissions are only applied to others
(see Components - Usage of ‘others’
permissions section for details).
Warning
The route resources found under the secure-data-proxy
service must match exactly a path on the filesystem,
starting with the directory name wps_outputs
, and following with the desired children directories/file names.
If the file does not have any read or write permissions, the hardlink will not be available in the user’s workspace.
Note
Permissions should not be modified via the file system, but should only be managed via the
secure-data-proxy
service on Magpie. Permission modifications on the file system will be ignored.
Refer to DAC-571 for more details on the design choices for the management of permissions.
If no secure-data-proxy
service is found, all user files are assumed to be available with read permissions for
the user.
Note that different design choices were made to respect the constraints of the file system and to prevent the user from accessing forbidden data:
To avoid having to copy every file from the WPS outputs data directory to the corresponding user workpaces, the usage of links was chosen.
The usage of hardlinks was chosen instead of symlinks. The reason is that using symlinks would require mounting the WPS outputs data directory on the user JupyterLab instance. This is because symlinks require access to their source files, contrary to hardlinks which can be mounted individually without access to source files. Mounting the whole WPS outputs data directory would give the user access to all the data. Even if the data is not made available on the JupyterLab browser via the UI, the data would still be accessible via the terminal found on JupyterLab. Using hardlinks lets us mount only the public directory instead of the whole WPS outputs directory, which contains a mix of public and user data.
Changing permissions of the linked workspace files to control user access was not an option since symlinks and hardlinks always use the same permissions as their original source file. Symlinks/hardlinks cannot have custom file permissions independent from their source file. Therefore, it is not possible for example to give access to a user and prevent the access to another user for a specific file using the file permissions, even if users have their own hardlinks or symlinks in their personal workspace.
Another considered option was to add anonymous volumes to hide some parts of the WPS outputs data directory. This could have been useful to mount the whole WPS outputs directory on the user JupyterLab instance and to hide the user data subdirectories found in the WPS outputs by using anonymous volumes for each user subdirectory. This would automatically mount the public data while hiding the user related data. It was still considered easier and safer to go with the hardlinks option to prevent potential errors which could accidentally give access to other users’ data. For example, if a bird directory was created while a user JupyterLab instance was running, it would require both the instance and JupyterHub to be restarted in order to generate the new required anonymous volume to hide the user data from that new bird. Without a restart of the instance, the user could potentially have access to some of the new user data found in the new bird directory.
In conclusion, the best option was to use hardlinks, which do not require access to the original source file, to create separate user and public data access points, and volume mounting to control which locations are made available to the user.
See also
Refer to DAC-149 for details on the design choices for managing the access to WPS outputs data.
Geoserver
The GeoServer handler is used to keep the internal representation on the GeoServer server along with the user workspace in sync with the rest of the platform.
If a new user is created on Magpie, a GeoServer workspace is automatically created for the user, along with a datastore directory in the user workspace to contain the different shapefiles of the user. Similarly, if the user is deleted on Magpie, the GeoServer workspace of the user is automatically deleted to keep the services synchronized.
The workspace and file permissions are also synchronized between Magpie and GeoServer. For example, if a permission is added or removed in Magpie, the file found in the user’s datastore must have corresponding permissions in order to reflect the actual user access permissions.
Since the Magpie permissions on a resource from a service of type GeoServer are not the same as traditional Unix
permissions (e.g.: rwx
) on the workspace/shapefiles, some design choices were done in order to have a coherent
synchronization :
Permission/Category definition
All permissions on Magpie on a resource from a service of type GeoServer are classified as either readable or
writable in order to associate them to the actual path permissions.
If the path receives a read
permission, all Magpie permissions fitting the readable category will be enabled.
If a Magpie permission from the readable category is added, the path will be updated to have read
permissions.
This update on the file system will trigger a synchronization with Magpie, to add all other readable permissions on
Magpie. For example, if the GetFeature
permission is added to a Layer
resource on Magpie, the associated
shapefile will receive read
permissions because GetFeature
is a permission from the readable category. Since the
file permissions are modified, it will trigger another event to synchronize permissions with Magpie, enabling,
on the Layer
resource, all other readable permissions : DescribeFeatureType
, DescribeStoredQueries
,
GetCapabilities
, etc. The same process would apply if we use permissions from the writable category in this last
example.
Permission creation conditions
Note that permissions are only added to Magpie if necessary. For example, if a file needs to allow a readable
permission on Magpie, but that permission already resolves to allow
because of a recursive permission on a parent
resource, no permission will be added. The permission is already resolving to the required permission and avoiding to
add unnecessary permissions will simplify permission solving.
Also, in the case where a user has all the readable permissions enabled on Magpie, for example, and a single one
of them is deleted, Cowbird will not change the file permissions since other permissions from the readable category
will still be found on Magpie. This means that a synchronization will not be triggered and Magpie permissions will
stay the same, meaning all the readable permissions activated except for the one removed.
If eventually a change is applied to the file (e.g.: changing the permissions from r--
to rw-
),
it would trigger a synchronization, and the one Magpie permission that was removed earlier would be re-enabled,
because of the read
permission found on the file.
Consequently, it is not recommended to have a partial usage of readable permissions on Magpie since there is a risk
that the disabled readable permissions will be eventually automatically enabled if an update on the file is done. If we
want to disable the read permissions via Magpie, it is better to disable the permissions from the readable category
all at once, which will trigger a change to remove the read
permission on the associated path and which will prevent
having eventual undesired re-enabled Magpie permissions.
The same would apply if we use write
permissions in this last example.
The files/directories permissions are only applied to others
(see Components - Usage of ‘others’ permissions section for details).
If a permission is applied to a group in Magpie, the GeoServer handler will detect the permission change but will
not do anything since Magpie groups are different than the groups found on the file system. Also, it would not make
sense to update the path associated to a resource for all the users of a group, since the path is supposed to be
associated to a single user anyway.
Note that even if group permission changes on Magpie are not handled by Cowbird, a group permission could still
have an impact on permission resolution. For example, if a shapefile needs to allow a readable permission on Magpie,
but that permission already resolves to allow
because of a group permission, no permission will be added.
File/Layer permissions
File events will only be processed in the case of the .shp
file, since it is considered to be the main
component of a shapefile. The other extensions associated with a shapefile will not be processed if they trigger an
event, and will only be updated in the case of a change on the .shp
file.
Shapefiles will only be assigned read
or write
permissions on the file system. execute
permissions are not
needed for shapefiles.
Directory/Workspace permissions
Workspaces will always keep their execute
permissions even if they don’t have any permissions enabled on Magpie.
This enables accessing the children files, in case the children resource has permissions enabled on Magpie.
Since a children resource has priority on Magpie if its permissions are enabled, it makes sense to allow the access
to the file on the file system too. Note that if the directory only has execute
permissions, the file will only be
accessible via a direct path or url, and it will not be accessible via a file browser, or on the JupyterLab file
browser. This should allow the user to still share his file using a path or url. To allow browsing the directory’s
content, the read
permission is also required on the directory, which can be obtained by enabling permissions of the
readable category on the corresponding workspace on Magpie.
Operations to avoid
Note that some operations should be avoided, as they are undesirable and not supported for now.
- Using subdirectories in the shapefile datastore :
Using subdirectories in the shapefile datastore directory is not supported for now. Only the shapefiles found directly under the datastore directory will be processed by the GeoServer handler and subdirectories will be ignored. This also corresponds to the design in Magpie where a
Workspace
resource can only have children resources of the typeLayer
, and cannot have aWorkspace
type resource as a children.
- Renaming a directory :
The directories associated with the GeoServer workspace are the user workspace (named by the user’s name) and the datastore directory (which uses a default value). Both of these values should never change and renaming them manually might break the monitoring, preventing Cowbird from receiving future file events.
- Renaming a shapefile (.shp only) :
This operation is actually supported, but it should be avoided if possible. It will trigger multiple events on the file system (an update on the parent directory, and a delete followed by a create event on the file), which should keep up to data info in GeoServer and Magpie by simply generating new resources. A risk with this is that the delete event will delete the other shapefile files, and the user could lose some data. It is better to have a copy of the shapefile before applying this operation. Note that renaming one of the other extension (not the .shp file) will not trigger any event since only the main file triggers events.
- Deleting a directory :
This operation will only display a warning in Cowbird’s logs. It should never be done manually, since it will create inconsistencies with the GeoServer workspace and the Magpie resources. The user workspace and the datastore directory should only be deleted when a user is deleted via Magpie.
Usage of others
permissions
With the GeoServer and the FileSystem handlers, the permissions applied on the files/directories are only applied
to others
, and the permissions on the user
and on the group
are not modified. The user and group associated
with the paths will be the admin user/group (root
by default), while the user who will interact with the paths, for
example in JupyterHub, is a distinct user, hence why the permissions are applied to others
. This will also prevent
the user from changing the permissions if he decides to interact with the terminal accessible via JupyterLab.
Note that, consequently, the concept of a Magpie group is not used on the file system for now, since the group on the file system does not correspond to the groups found on Magpie.