5.1. UCTB.dataset package¶
5.1.1. UCTB.dataset.data_loader module¶
-
class
UCTB.dataset.data_loader.
NodeTrafficLoader
(dataset, city=None, data_range='all', train_data_length='all', test_ratio=0.1, closeness_len=6, period_len=7, trend_len=4, target_length=1, graph='Correlation', threshold_distance=1000, threshold_correlation=0, threshold_interaction=500, normalize=True, workday_parser=<function is_work_day_america>, with_lm=True, with_tpe=False, data_dir=None, **kwargs)¶ Bases:
object
The data loader that extracts and processes data from a
DataSet
object.- Parameters
dataset (str) – A string containing path of the dataset pickle file or a string of name of the dataset.
city (
str
orNone
) –None
if dataset is file path, or a string of name of the city. Default:None
data_range – The range of data extracted from
self.dataset
to be further used. If set to'all'
, all data inself.dataset
will be used. If set to a float between 0.0 and 1.0, the relative former proportion of data inself.dataset
will be used. If set to a list of two integers[start, end]
, the data from start day to (end - 1) day of data inself.dataset
will be used. Default:'all'
train_data_length – The length of train data. If set to
'all'
, all data in the split train set will be used. If set to int, the latesttrain_data_length
days of data will be used as train set. Default:'all'
test_ratio (float) – The ratio of test set as data will be split into train set and test set. Default: 0.1
closeness_len (int) – The length of closeness data history. The former consecutive
closeness_len
time slots of data will be used as closeness history. Default: 6period_len (int) – The length of period data history. The data of exact same time slots in former consecutive
period_len
days will be used as period history. Default: 7trend_len (int) – The length of trend data history. The data of exact same time slots in former consecutive
trend_len
weeks (every seven days) will be used as trend history. Default: 4target_length (int) – The numbers of steps that need prediction by one piece of history data. Have to be 1 now. Default: 1
graph (str) – Types of graphs used in neural methods. Graphs should be a subset of {
'Correlation'
,'Distance'
,'Interaction'
,'Line'
,'Neighbor'
,'Transfer'
} and concatenated by'-'
, and dataset should have data of selected graphs. Default:'Correlation'
threshold_distance (float) – Used in building of distance graph. If distance of two nodes in meters is larger than
threshold_distance
, the corresponding position of the distance graph will be 1 and otherwise 0.the corresponding Default: 1000threshold_correlation (float) – Used in building of correlation graph. If the Pearson correlation coefficient is larger than
threshold_correlation
, the corresponding position of the correlation graph will be 1 and otherwise 0. Default: 0threshold_interaction (float) – Used in building of interatction graph. If in the latest 12 months, the number of times of interaction between two nodes is larger than
threshold_interaction
, the corresponding position of the interaction graph will be 1 and otherwise 0. Default: 500normalize (bool) – If
True
, do min-max normalization on data. Default:True
workday_parser – Used to build external features to be used in neural methods. Default:
is_work_day_america
with_lm (bool) – If
True
, data loader will build graphs according tograph
. Default:True
with_tpe (bool) – If
True
, data loader will build time position embeddings. Default:False
data_dir (
str
orNone
) – The dataset directory. If set toNone
, a directory will be created. Ifdataset
is file path,data_dir
should beNone
too. Default:None
-
train_closeness
¶ The closeness history of train set data. When
with_tpe
isFalse
, its shape is [train_time_slot_num,station_number
,closeness_len
, 1]. On the dimension ofcloseness_len
, data are arranged from earlier time slots to later time slots. Ifcloseness_len
is set to 0, train_closeness will be an empty ndarray.train_period
,train_trend
,test_closeness
,test_period
,test_trend
have similar shape and construction.- Type
np.ndarray
-
train_y
¶ The train set data. Its shape is [train_time_slot_num,
station_number
, 1].test_y
has similar shape and construction.- Type
np.ndarray
-
make_concat
(node='all', is_train=True)¶ A function to concatenate all closeness, period and trend history data to use as inputs of models.
- Parameters
node (int or
'all'
) – To specify the index of certain node. If set to'all'
, return the concatenation result of all nodes. If set to an integer, it will be the index of the selected node. Default:'all'
is_train (bool) – If set to
True
,train_closeness
,train_period
, andtrain_trend
will be concatenated. If set toFalse
,test_closeness
,test_period
, andtest_trend
will be concatenated. Default: True
- Returns
Function returns an ndarray with shape as [time_slot_num,
station_number
,closeness_len
+period_len
+trend_len
, 1], and time_slot_num is the temporal length of train set data ifis_train
isTrue
or the temporal length of test set data ifis_train
isFalse
. On the second dimension, data are arranged asearlier closeness -> later closeness -> earlier period -> later period -> earlier trend -> later trend
.- Return type
np.ndarray
5.1.2. UCTB.dataset.dataset module¶
-
class
UCTB.dataset.dataset.
DataSet
(dataset, city=None, data_dir=None)¶ Bases:
object
An object storing basic data from a formatted pickle file.
See also Build your own datasets.
- Parameters
dataset (str) – A string containing path of the dataset pickle file or a string of name of the dataset.
city (str or
None
) –None
if dataset is file path, or a string of name of the city. Default:None
data_dir (str or
None
) – The dataset directory. If set toNone
, a directory will be created. Ifdataset
is file path,data_dir
should beNone
too. Default:None
-
data
¶ The data directly from the pickle file.
data
may have adata['contribute_data']
dict to store supplementary data.- Type
-
time_range
¶ From
data['TimeRange']
in the format of [YYYY-MM-DD, YYYY-MM-DD] indicating the time range of the data.- Type
-
node_traffic
¶ Data recording the main stream data of the nodes in during the time range. From
data['Node']['TrafficNode']
with shape as [time_slot_num, node_num].- Type
np.ndarray
-
node_monthly_interaction
¶ Data recording the monthly interaction of pairs of nodes. Its shape is [month_num, node_num, node_num].It’s from
data['Node']['TrafficMonthlyInteraction']
and is used to build interaction graph. Its an optional attribute and can be set as an empty list if interaction graph is not needed.- Type
np.ndarray
-
node_station_info
¶ A dict storing the coordinates of nodes. It shall be formatted as {id (may be arbitrary): [id (when sorted, should be consistant with index of
node_traffic
), latitude, longitude, other notes]}. It’s fromdata['Node']['StationInfo']
and is used to build distance graph. Its an optional attribute and can be set as an empty list if distance graph is not needed.- Type