5.1. UCTB.dataset package

5.1.1. UCTB.dataset.data_loader module

class UCTB.dataset.data_loader.NodeTrafficLoader(dataset, city=None, data_range='all', train_data_length='all', test_ratio=0.1, closeness_len=6, period_len=7, trend_len=4, target_length=1, graph='Correlation', threshold_distance=1000, threshold_correlation=0, threshold_interaction=500, normalize=True, workday_parser=<function is_work_day_america>, with_lm=True, with_tpe=False, data_dir=None, **kwargs)

Bases: object

The data loader that extracts and processes data from a DataSet object.

Parameters
  • dataset (str) – A string containing path of the dataset pickle file or a string of name of the dataset.

  • city (str or None) – None if dataset is file path, or a string of name of the city. Default: None

  • data_range – The range of data extracted from self.dataset to be further used. If set to 'all', all data in self.dataset will be used. If set to a float between 0.0 and 1.0, the relative former proportion of data in self.dataset will be used. If set to a list of two integers [start, end], the data from start day to (end - 1) day of data in self.dataset will be used. Default: 'all'

  • train_data_length – The length of train data. If set to 'all', all data in the split train set will be used. If set to int, the latest train_data_length days of data will be used as train set. Default: 'all'

  • test_ratio (float) – The ratio of test set as data will be split into train set and test set. Default: 0.1

  • closeness_len (int) – The length of closeness data history. The former consecutive closeness_len time slots of data will be used as closeness history. Default: 6

  • period_len (int) – The length of period data history. The data of exact same time slots in former consecutive period_len days will be used as period history. Default: 7

  • trend_len (int) – The length of trend data history. The data of exact same time slots in former consecutive trend_len weeks (every seven days) will be used as trend history. Default: 4

  • target_length (int) – The numbers of steps that need prediction by one piece of history data. Have to be 1 now. Default: 1

  • graph (str) – Types of graphs used in neural methods. Graphs should be a subset of { 'Correlation', 'Distance', 'Interaction', 'Line', 'Neighbor', 'Transfer' } and concatenated by '-', and dataset should have data of selected graphs. Default: 'Correlation'

  • threshold_distance (float) – Used in building of distance graph. If distance of two nodes in meters is larger than threshold_distance, the corresponding position of the distance graph will be 1 and otherwise 0.the corresponding Default: 1000

  • threshold_correlation (float) – Used in building of correlation graph. If the Pearson correlation coefficient is larger than threshold_correlation, the corresponding position of the correlation graph will be 1 and otherwise 0. Default: 0

  • threshold_interaction (float) – Used in building of interatction graph. If in the latest 12 months, the number of times of interaction between two nodes is larger than threshold_interaction, the corresponding position of the interaction graph will be 1 and otherwise 0. Default: 500

  • normalize (bool) – If True, do min-max normalization on data. Default: True

  • workday_parser – Used to build external features to be used in neural methods. Default: is_work_day_america

  • with_lm (bool) – If True, data loader will build graphs according to graph. Default: True

  • with_tpe (bool) – If True, data loader will build time position embeddings. Default: False

  • data_dir (str or None) – The dataset directory. If set to None, a directory will be created. If dataset is file path, data_dir should be None too. Default: None

dataset

The DataSet object storing basic data.

Type

DataSet

daily_slots

The number of time slots in one single day.

Type

int

station_number

The number of nodes.

Type

int

external_dim

The number of dimensions of external features.

Type

int

train_closeness

The closeness history of train set data. When with_tpe is False, its shape is [train_time_slot_num, station_number, closeness_len, 1]. On the dimension of closeness_len, data are arranged from earlier time slots to later time slots. If closeness_len is set to 0, train_closeness will be an empty ndarray. train_period, train_trend, test_closeness, test_period, test_trend have similar shape and construction.

Type

np.ndarray

train_y

The train set data. Its shape is [train_time_slot_num, station_number, 1]. test_y has similar shape and construction.

Type

np.ndarray

LM

If with_lm is True, the list of Laplacian matrices of graphs listed in graph.

Type

list

make_concat(node='all', is_train=True)

A function to concatenate all closeness, period and trend history data to use as inputs of models.

Parameters
  • node (int or 'all') – To specify the index of certain node. If set to 'all', return the concatenation result of all nodes. If set to an integer, it will be the index of the selected node. Default: 'all'

  • is_train (bool) – If set to True, train_closeness, train_period, and train_trend will be concatenated. If set to False, test_closeness, test_period, and test_trend will be concatenated. Default: True

Returns

Function returns an ndarray with shape as [time_slot_num, station_number, closeness_len + period_len + trend_len, 1], and time_slot_num is the temporal length of train set data if is_train is True or the temporal length of test set data if is_train is False. On the second dimension, data are arranged as earlier closeness -> later closeness -> earlier period -> later period -> earlier trend -> later trend.

Return type

np.ndarray

5.1.2. UCTB.dataset.dataset module

class UCTB.dataset.dataset.DataSet(dataset, city=None, data_dir=None)

Bases: object

An object storing basic data from a formatted pickle file.

See also Build your own datasets.

Parameters
  • dataset (str) – A string containing path of the dataset pickle file or a string of name of the dataset.

  • city (str or None) – None if dataset is file path, or a string of name of the city. Default: None

  • data_dir (str or None) – The dataset directory. If set to None, a directory will be created. If dataset is file path, data_dir should be None too. Default: None

data

The data directly from the pickle file. data may have a data['contribute_data'] dict to store supplementary data.

Type

dict

time_range

From data['TimeRange'] in the format of [YYYY-MM-DD, YYYY-MM-DD] indicating the time range of the data.

Type

list

time_fitness

From data['TimeFitness'] indicating how many minutes is a single time slot.

Type

int

node_traffic

Data recording the main stream data of the nodes in during the time range. From data['Node']['TrafficNode'] with shape as [time_slot_num, node_num].

Type

np.ndarray

node_monthly_interaction

Data recording the monthly interaction of pairs of nodes. Its shape is [month_num, node_num, node_num].It’s from data['Node']['TrafficMonthlyInteraction'] and is used to build interaction graph. Its an optional attribute and can be set as an empty list if interaction graph is not needed.

Type

np.ndarray

node_station_info

A dict storing the coordinates of nodes. It shall be formatted as {id (may be arbitrary): [id (when sorted, should be consistant with index of node_traffic), latitude, longitude, other notes]}. It’s from data['Node']['StationInfo'] and is used to build distance graph. Its an optional attribute and can be set as an empty list if distance graph is not needed.

Type

dict