Managing your data

Data serves as the foundation for data analysis, especially in the field of bioinformatics. However, managing data poses numerous challenges, such as ensuring secure transfer and storage, optimizing data utilization during analysis, systematically matching input data types, and more.

To address these challenges, Bioinfopipe offers robust solutions. Leveraging the power of the AWS S3 storage system, users can securely store and efficiently access their data, utilizing specific storage schemes tailored to their usage frequency and cost considerations. Bioinfopipe also introduces a comprehensive file type system that facilitates the seamless matching of input data files and output files from upstream tools, enabling smooth data integration into the analysis workflow. Moreover, the platform incorporates a dataset functionality, empowering users to group related data files into datasets, which can be directly applied to various analysis tools. This streamlined approach proves particularly advantageous when dealing with large-scale analysis involving multiple input files.

In this article, you will learn how to browse data files, perform actions on them, manage file types, and handle datasets within the Bioinfopipe platform.

1. Browsing data files

To browse your data, follow these operations:

Go to the menu bar and click on Data -> Browse data. This will open the web GUI called 'Browse and manage your data' where you can access your data.
The data files are organized in a folder structure. In the main table, you will see a list of folders and files starting from your root folder, which is named after your user ID. You can open a folder to view its contents, including files and sub-folders.
To navigate back to a parent folder, simply click on the corresponding parent folder link in the 'Path directory links' section located above the button bar.
On the left pane, you will find the data folder tree structure with folder links. These links provide a quick way to access specific folders within the structure.

By following these steps, you can easily browse and navigate through your data files using the Bioinfopipe platform.

To let users get started quickly, there are 4 reserved folders:

__uploaded__ : You can put uploaded data files here.

__datasets__ : You can put datasets here.

__analysis__ : This is a project folder used to hold the analysis sessions.

__ sharedme__ : All data files shared with you will be put in this folder.

Four columns of properties are shown in main table:

Name : The folder name or file name.

Type : The folder type (Project, Regular) for folders or file types for data files.

Size : The size of data files.

Last modified : The time when latest changes happened on the file.

You can get more properties for a data file by clicking the 'Info' icon button in the right hand of the table. It will pop up a right pane showing following properties:

File name: The file name.

File ID : A pseudo ID specified under the namespace of user ID.

File path : The full path of a file.

File type : The file types of a file.

Size : File size.

Storage class : User-specified S3 storage class for the file.

Created by : The time when the file object was created.

Updated data : The time when the file object was updated.

Encryption : If you prefer the server-side encryption algorithm AES256 used when storing this object in Amazon S3.

Shared : If the file was shared with other users.

Owned by : Owner of the file.

S3 checked : The time when the file object in S3 last checked.

S3 object : If there is a S3 object related to the file.

1.1. Uploading data files

By clicking the 'Upload data' button you can choose upload data 'From local', 'From URLs' or 'Map S3 key' from dropdown menu.

From local

By choosing 'From local' you can upload your data files from your local machine, it will pop up a form titled as 'Upload files from local', which contains following fields:

Target folder : The folder where the uploaded data files will be put, it is the current folder by default.

File : By clicking 'Browse' button, it will pop up the your local OS file browser window for you to select the files you would like to upload.

Storage class : Here you can specify a S3 storage class for your data based on the how often you use the data, and also performance and cost, e.g. if you can choose 'GLACIER' if you rarely retrieve the data. The storage classes are described as follows:

STANDARD : Frequently accessed data (more than once a month) with millisecond access.
INTELLIGENET_TIERING : Data with unknown, changing, or unpredictable access patterns.
STANDARD_IA : Long-lived, infrequently accessed data (once a month) with millisecond access.
ONEZONE_IA : Recreatable, infrequently accessed data (once a month) with millisecond access.
GLACIER_IR : For long-lived archive data accessed once a quarter with instant retrieval in milliseconds.
GLACIER : For long-term backups and archives with retrieval option from 1 minute to 12 hours.
DEEP_ARCHIVE : For long-term data archiving that is accessed once or twice in a year and can be restored within 12 hours.
OUTPOSTS : Ideal for workloads with local data residency requirements, and to satisfy demanding performance needs by keeping data close to on-premises applications.
REDUCED_REDUNDANCY : enables customers to store noncritical, reproducible data at lower levels of redundancy than Amazon S3’s standard storage.

It applies 'INTELLIGENET_TIERING ' by default, which is ideal if access patterns are unpredictable or change over time. However, if you know your data will be accessed infrequently, STANDARD_IA or ONEZONE_IA may be cheaper. For long-term archival with minimal access, GLACIER_IR, GLACIER, or DEEP_ARCHIVE are significantly cheaper. For more information please check here.

From URLs

By choosing 'From URLs' you can upload your data files by transferring the data reside in FTP/HTTP servers, it will pop up a form titled as 'Upload files from FTP/HTTP', which contains following fields:

Target folder : The folder where the uploaded data files will be put, it is the current folder by default.

URL : Specify the URL address of file from FTP/HTTP or a directory URL from FTP. Don't forget put a '/' in end of URL when you put a directory URL, e.g. 'ftp://ftp.ncbi.nlm.nih.gov/toolbox/vms_util/'. When a directory is specified, all files and sub-directories will be uploaded recursively, and will keep the current and sub-directories in the target folder.

File Pattern : If a directory is specified in URL field, then you are able to upload subset of files by putting a file pattern with wildcards, e.g. '*.fastq.gz'. Multiple patterns are allowed and need to be comma-separated. Leave it empty then all files and sub-directories will be uploaded.

User name : Specify the user name if it is required by FTP/HTTP servers.

Password : Specify the password if it is required by FTP/HTTP servers.

By clicking 'Submit' button, it will create a job which applies Nextflow with tool Wget. Then a upload bar will be shown up under the main buttons. In the upload bar, you can find the uploading information including Job ID, URL and target folder path. Also there is job state at right hand of bar, the state can be:

Queuing : Waiting for job to send to AWS batch service.
Uploading : The uploading process is running.
Success : All files are successfully uploaded to the target folder. You can view the Wget standard output (non verbose) by clicking the 'View info' icon button.
Error : There is error in the process. You can check the further information by clicking the 'View info' icon button.
Cancelling : the job has being request to be cancelled, and waiting for cancelling. You can cancel a upload job when it is in 'Queuing' or 'Uploading' states by clicking the 'Remove/Cancel job' icon button.
Cancelled : the upload job was cancelled.

Once a upload job is successfully uploaded or cancelled, you can remove the job by clicking the 'Remove/Cancel job' icon button.

Map S3 key

You can directly transfer your data to the Bioinfopipe S3 bucket and map the files into the specified folder. The root folder must be named 'upload' within your Bioinfopipe bucket, see the example:

aws s3 cp /home/path/data_folder s3://bioinfopipe-org1/upload/project-1 --recursive

The example above shows an AWS command-line operation transferring data to the 'upload/project-1' folder within your S3 Bioinfopipe bucket 'bioinfopipe-org1'. The data will be stored in the S3 folder structure as 'upload/project-1/data_folder/sub-folder'.

By choosing 'Map S3 key', it will pop up a form titled as 'Map files from S3 key', which contains following fields:

Target folder : Specifies the folder where the uploaded data files will be placed. By default, this is set to the current folder.

S3 key : Specify the S3 key of a data path to map. If an S3 directory path is provided, all files and sub-directories within it will be mapped recursively, maintaining the directory structure in the target folder. Only the directories following the last '/' symbol in the S3 key will be preserved.

1.2. Creating a folder

To create a sub-folder inside a folder, just clicking 'Create folder' button, it will pop up a form with following fields:

Name : Specify the sub-folder name.

Type : Specify the folder type which can be 'Regular' (default) or'Project', the only difference is that 'Project' folders can hold analysis sessions.

You can edit the folder properties by clicking the 'Edit folder' icon button in the row related to the folder; and by clicking the corresponding 'Delete folder' icon button, the folder and all its sub-folders and files will be deleted.

2. Handling data files

In Bioinfopipe, data files are directly mapped to S3 file objects, and they are logically linked to folders. This means that a file can be easily moved to a different folder without changing its S3 key, since renaming S3 key is heavy operation in S3 system especially for big files.

Various actions can be performed on data files to effectively manage them. These actions include setting file types, moving/coping files, sharing files, and other actions. With these functionalities, users have the flexibility to organize their data files and perform necessary actions to efficiently handle their data within the Bioinfopipe platform.

2.1. Setting file types for files

Once you have uploaded your data files, it is recommended to set their file types promptly. File types play a crucial role when creating datasets or setting up instant analysis pipelines, as they allow the platform to determine if the output files of an upstream tool match the input file types of its downstream tools.

To set the file types manually, follow these steps:

Select one or multiple files by checking the corresponding checkboxes in the main table.
Click the 'Actions' button and choose 'Set file type' from the dropdown menu.
A modal popup titled 'Choose file types' will appear, displaying a tree of file types. From here, you can select one or multiple file types from various file type categories. If you are a Pro/Org user, you can also select your private file types.

Note: It is not possible to select both child file types and their parent or super-parents simultaneously. Specifying a file type means that the file can be categorized as that specific file type and all its descendants.

Alternatively, you can try setting file types automatically by selecting the 'AI Set File Type' option from the dropdown menu. This will attempt to identify file types based on filenames using AI LLMs.

2.2. Sharing your data files

If you are a subscribed user of Org plan, you can share your data to any other subscribed users of the same orgnization. So the others can directly work on your data without transferring your data to their storage space.

To share the data file, firstly select one or multiple files by checking corresponding checkboxes in the main table, then select the 'Share data' from dropdown menu by pressing 'Actions' button. It will pop up a form titled 'Share selected data' which has following fields:

Share to : The users or teams you would like to share to, you can put user email addresses or team IDs, all separated by ';'.

Share group : Or you can select a pre-defined sharing group in your account admin, which will override 'Share to' field.

Days for sharing : You can set sharing time span by days, it is 30 days by default.

2.3. Other actions

You can download or delete individual file/folder by clicking its related icon buttons in main table. You can also perform actions on selected files, folders and datasets, just select target files, folders, or datasets, then choose the action from dropdown menu by pressing 'Actions' button. The actions are described as follows:

Set file type : Set file types for files.

Move data to : To move a set of files/folders to a specified folder, so you can rearrange your files/folder to become more organised hierarchy structure.

Copy data to : To copy a set of files/folders to a specified folder. It is useful if you want save your data to a different a storage class.

Link data to : To move a set of files/datasets to a specified folder, where a link folder will be created to link the selected files/datasets. It is similar to soft link in Linux system. This is convenient to link a set of files to a project folder for analysis without actually moving files. This is good practice to keep your raw data in a fixed folder and link the files needed or create datasets for the projects.

Share data : Sharing your data with other users. Currently only users within a Org subscription can share data.

Delete data : You can delete a set of files/folder/datasets in bulk.

Transfer storage class : This action allows users to transfer to a new storage class for one or multiple files. It may take a while to transfer a large set of files in size.

3. Handling file types

In Bioinfopipe, each file object can be associated with a set of file types, which define the range of file types applicable to the file. Additionally, when configuring a tool, its input/output file mappings can be assigned a set of file types to specify the allowed file types for those inputs/outputs.

To access the file types management functionality, simply click the 'Manage file types' button in the Datafile browser. This will open a console page titled 'Manage file types,' where you can view the public file types and create your own file types if you are a Pro/Org user.

You can view file types in tables or in tree structure by clicking the witch button 'Table/Tree'. The columns in the table view are described as follows:

Numbering : Indicating the level of the file type in the tree structure.

Name : The name of the file type.

Parent : The parent name of the file type.

Created at : The time when the file type created.

Category : Indicating if the file type is category or not; a category is used to group a set of related file types.

By clicking the 'View details' icon button, you can view the more properties of selected file type, apart from above properties the other properties are:

ID : A pseudo ID specified under the namespace of user ID.

Order : The order under its parent.

Description : A brief description for this file type.

File extensions : Possible file extensions used for this file type.

Invisible : Indicating if allow this file type to be shown in modal popups.

Owner : The owner of this file type.

3.1. Creating a root category

Before creating your own file types, you need to create root categories for holding the sub file types. For example you can define 'Format type', 'Experiment type', and 'Area' as root categories. The principle is that the first root category should be file format, then following root categories should be the attributes about data.

To create a root category, clicking the button 'Create root category' which will pop up a form which contains following fields:

Name : The name of the root category.

Parent : The parent name of the root category which is fixed as your user ID.

Description : Put a brief description for the root category.

Order : The order of the root category, it is 0 by default which means it will automatically be set as the current largest order number plus one.

Category : It is fixed as category.

Attribute : Specify whether the root category is for type of file formats or file attributes. Check it if you want to create a file attribute category.

Invisible : Indicating if allowing this file type to be shown in the modal popups.

You can edit or delete already created root categories by clicking corresponding 'Edit' or 'Delete' icon buttons.

3.2. Creating a child file type

To create a child file type for a root category or parent file type, just clicking the corresponding 'Add child' icon button, then it will pop up a modal form which contains following fields:

Name : The name of the child file type.

Parent : The parent name of the child file type which is fixed as the selected parent file type.

File extensions : Put possible extensions for this file type, separated by ','.

Description : Put a brief description for the child file type.

Order : The order of the child file type, it is 0 by default which means it will automatically be set as the current largest order number plus one.

Category : Check it if you want define a file type category.

Invisible : Indicating if allowing this file type to be shown in the modal popups.

You can edit or delete already created any file types by clicking corresponding 'Edit' or 'Delete' icon buttons.

3.3. Scope of file types

We can assign a set of file types to data file objects, dataset objects, and input/output objects of tools in Bioinfopipe. This set of file types typically consists of two type of file properties:

File formats: Every set of file types must have at least one file format assigned. You can select a single file format or multiple formats from the same parent category, indicating that the file can be any of those formats. You can also select multiple formats from different parent categories.

Data attributes: Assigning data attributes to objects is optional. It is good practice to assign relevant data attributes to objects that have a known association with those attributes, such as sequence read types. You can select multiple attributes from the same parent category. Any unselected attributes will automatically be assigned to the object, indicating that the object can be associated with any of those unselected attributes.

The set of file types defines the scope of possible data types based on the specified file formats and data attributes. For example, if a set of file types includes two file formats (Fmt1 and Fmt2) and two sibling attributes (Attr1 and Attr2), the scope of data types will be constrained by the following rule:

( Fmt1 OR Fmt2 ) AND ( Attr1 OR Attr2 ) OR ( any of the unselected attributes )

By following this rule, we can determine if the scope of one set of file types is contained within the scope of another set of file types. It is valid for the scopes of any rules, by replacing file types with their descendants in each term within brackets from the above rule, to be contained within the scope of the above rule.

Using this concept, Bioinfopipe offers file type matching functionality in the analysis job settings. When selecting input files from the data selector or pipe-in links from upstream tools, users are informed about how the selected input files match the required file types of input parameters. This helps to avoid selecting incorrect input files or incompatible pipe-in links.

4. Handling datasets

A dataset is an object in Bioinfopipe that allows you to group a batch of input data files along with relevant metadata. It provides a way to organize and describe a set of data files comprehensively. By creating datasets, you can group your raw or processed data files together and add manual metadata to provide detailed descriptions of the data.

One of the advantages of using datasets is that they can be directly applied as inputs for tools in an analysis job. Instead of selecting individual files for each analysis job with multiple input files, you can simply use a dataset that contains the desired set of data files. This offers convenience and efficiency, particularly when you need to repeatedly apply the same set of data files in multiple analysis jobs.

By utilizing datasets, you can streamline your analysis workflow and avoid the need for repetitive file selection, ultimately saving time and effort.

Currently there are 3 predefined dataset types:

BasicSamples : A common dataset type for any smaple-based data files. The predefined columns for batch metadata includes: FileID, Unit, Label and SampleID. The predefined columns of sample metadata includes: SampleID, SampleName, TechReplicate, BioReplicate, Treatment and Condition.

PairedEndSeq : A dataset type for any paired-end sequencing data files with file types of SEQS and PairedEnd. The predefined columns for batch metadata includes: FileID-1, FileID-2, Unit, Label, SampleID, Run, Lane. The predefined columns of sample metadata includes: SampleID, SampleName, TechReplicate, BioReplicate, Treatment and Condition.

SingleEndSeq : A dataset type for any single-end sequencing data files with file types of SEQS and SingleEnd. The predefined columns for batch metadata includes: FileID, Unit, Label, SampleID, Run, Lane. The predefined columns of sample metadata includes: SampleID, SampleName, TechReplicate, BioReplicate, Treatment and Condition.

The columns defined above for batch metadata are:

FileID : The pseudo file ID.

FileID-1 : The pseudo file ID for forward paired-end reads.

FileID-2 : The pseudo file ID for reverse paired-end reads.

Unit : The unit number in the batch.

Label : Specify a unique label name for this unit, which will be used in creating output file name, no space allowed; try make this label as short as possible and also cover the necessary information about the experiment or sample.

SampleID : The sample ID defined in sample metadata.

The columns defined above for sample metadata are:

SampleID : Specify a unique sample ID.

SampleName : Specify a unique sample name.

TechReplicate : Can be number or word used to group technique replicates.

BioReplicate : Can be number or word used to group biological replicates.

Treatment : A wold used indicating a treatment group.

Condition : A wold used indicating a condition group.

4.1. Creating a dataset

To create a new dataset, you can click the button 'Create dataset' which will open a form page titled 'Create a new dataset' which contains 2 sub forms 'General settings' and 'Describe dataset', their fields are described as follows:

Name : Specify a dataset name.

Folder : Select a project folder or '__dataset__' folder to hold the dataset.

Dataset type : Specify a dataset type from predefined dataset types.

Description : Put a description for this dataset.

By clicking 'Create' button, a new dataset will be created and will open a dataset console for editing this dataset.

4.2. Adding data files to a dataset

After creating a new dataset, first thing is to add data files. Clicking the button 'Add datafile', it will open a modal popup showing a data file selector from which you can select the files you would like to add into the dataset. After clicking 'Save' button, you will see the selected data files shown in the table in the tab 'Data files'. Selected data files will also be shown in the table in the tab 'Metadata - batch', it will automatically assign the paired-end files into columns 'FileID-1' and 'FileID-2' based on the file name, and extracting part of name as Label.

You can remove one file or multiple files by clicking the 'Remove file' icon button in the tab 'Data files'.

4.3. Editing dataset metadata

There are 2 metadata tables for batch and samples respectively.

Metadata in tab 'Metadata - batch' shows list of batch where each row corresponds to a set of input files for a unit run. You can edit this table by clicking the pencil icon button, which will open a new page showing editable Excel-like table where you can edit metadata by modifying/adding new columns (attributes) and rows (data or samples), and finally save the table.

Metadata in tab 'Metadata - samples' shows list of samples where each row corresponds to a defined sample. You can edit this table by clicking the pencil icon button, which will open a new page showing editable Excel-like table where you can edit metadata by modifying/adding new columns (attributes) and rows (data or samples), and finally save the table.

Note: the values in columns 'Unit' and 'Label' should be unique for each row in metadata table of batch; and values in columns 'SampleID' and 'SampleName' should be unique for each row in metadata table of samples.

4.4. Copying as new dataset

By clicking the 'Copy as new' button, you are able to create a new dataset based on the current dataset, it will open a form page titled 'Copy dataset - <the current dataset nam>' where the fields are grouped into 3 sub-forms: 'General settings', 'Describe dataset' and 'Set new batch'. The fields are described as follows:

Name : The new dataset name.

Folder : Specify a project folder or reserved '__dataset__' folder for the new dataset.

Dataset type : Select a predefined dataset type for the new dataset.

Description : Put a description about this new dataset.

Meta columns : Here you can select 1 or 2 category columns to define a new dataset which will be created by merging batch units based on the combination of selected columns. You can rename the merged cells which are concatenated values by default. For example, you can choose 'SampleID' as merging column, then all units with the same sample are merged into one unit with a set of input files as a multiple input unit. Such dataset can be applied to the tool accepting multiple input files in a unit run.

User manual