Imputation
Overview
Imputation Scenarios
MPSTime supports univariate time-series imputation with three key patterns of missing data:
- Individual missing points (e.g., values missing at $t = 10, 20, 80$)
- Single contiguous blocks (e.g., values missing from $t = 25-70$)
- Multiple contiguous blocks (e.g., values missing from $t = 5-10$, $t = 25-50$ and $t = 80-90$)
MPSTime can also handle any combination of these patterns. For instance, you might need to impute a single contiguous block from $t = 10-30$, plus individual missing points at $t = 50$ and $t=80$.
Setup
The first step is to train an MPS. Here, we'll train an MPS in an unsupervised manner (no class labels) using a noisy trendy sinusoid.
# Fix rng seed
using Random
rng = Xoshiro(1)
# dataset size
ntimepoints = 100
ntrain_instances = 300
ntest_instances = 200
# generate the train and test datasets
X_train, _ = trendy_sine(ntimepoints, ntrain_instances; sigma=0.1, rng=rng);
X_test, _ = trendy_sine(ntimepoints, ntest_instances; sigma=0.1, rng=rng);
# hyper parameters and training
opts = MPSOptions(d=10, chi_max=40, sigmoid_transform=false);
mps, info, test_states= fitMPS(X_train, opts);
Next, we initialize an imputation problem. This does a lot of necessary pre-computation:
julia> imp = init_imputation_problem(mps, X_test);
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Summary:
- Dataset has 300 training samples and 200 testing samples.
Slicing MPS into individual states...
- 1 class(es) were detected.
- Time independent encoding - Legendre - detected.
- d = 10, chi_max = 40
Re-encoding the training data to get the encoding arguments...
Created 1 ImputationProblem struct(s) containing class-wise mps and test samples.
A summary of the imputation problem setup is printed to verify the model parameters and dataset information. For multi-class data, you can pass y_test
to init_imputation_problem
in order to exploit the labels / class information while doing imputation.
Imputing missing values
Single-block Imputation
Now, decide how you want to impute the missing data. The necessary options are:
class::Integer
: The class of the time-series instance we are going to impute, leave as zero for "unlabelled" data (i.e., all data belong to the same class).impute_sites
: The MPS sites (time points) that are missing (inclusive).instance_idx
: The time-series instance from the chosen class in the test set.method
: The imputation method to use. Can be trajectories (ITS), median, mode, mean, etc...
In this example, we will consider a single block of contiguous missing values, simulated from a missing-at-random mechanism (MAR). We will use the median to impute the missing values, as well as computing a 1-Nearest Neighbor Imputation (1-NNI) baseline for comparison:
class = 0
pm = 0.8 # 80% missing data
instance_idx = 59 # time series instance in test set
_, impute_sites = mar(X_test[instance_idx, :], pm; state=42) # simulate MAR mechanism
method = :median
imputed_ts, pred_err, target_ts, stats, plots = MPS_impute(
imp,
class,
instance_idx,
impute_sites,
method;
NN_baseline=true, # whether to also do a baseline imputation using 1-NNI
plot_fits=true, # whether to plot the fits
)
Several outputs are returned from MPS_impute
:
imputed_ts
: The imputed time-series instance, containing the original data points and the predicted values.pred_err
: The prediction error for each imputed value, given a known ground-truth.target_ts
: The original time-series instance containing missing values.stats
: A collection of statistical metrics (MAE and MAPE) evaluating imputation performance with respect to a ground truth. Includes baseline performance whenNN_baseline=true
.plots
: Stores plot object(s) in an array for visualization whenplot_fits=true
.
We can inspect the imputation performance in a summary table:
julia> using PrettyTables
julia> pretty_table(stats[1]; header=["Metric", "Value"], header_crayon= crayon"yellow bold", tf = tf_unicode_rounded);
╭─────────┬───────────╮
│ Metric │ Value │
├─────────┼───────────┤
│ MAE │ 0.0855211 │
│ MAPE │ 0.125649 │
│ NN_MAE │ 0.11304 │
│ NN_MAPE │ 0.168085 │
╰─────────┴───────────╯
Here, MAE and MAPE correspond to the mean absolute error (MAE) and mean absolute percentage error (MAPE) for the MPS, while the NN_ prefix corresponds to the same errors for the 1-NNI baseline.
To plot the imputed time series, we can call the plot function as follows:
julia> using Plots
julia> plot(plots...)
The solid orange line depicts the "ground-truth" (observed) time-series values, the dotted blue line is the MPS-imputed data points and the dotted red line is the 1-NNI baseline. The blue shading indicates the uncertainty due to encoding error.
There are a lot of other options, and many more imputation methods to choose from! See MPS_impute
for more details.
Multi-block Imputation
Building on the previous example of single-block imputation, MPSTime can also be used to impute missing values in multiple blocks of contiguous points. For example, consider missing points between $t = 10-25$, $t = 40-60$ and $t = 75-90$:
class = 0
impute_sites = vcat(collect(10:25), collect(40:60), collect(65:90))
instance_idx = 32
method = :median
imputed_ts, pred_err, target_ts, stats, plots = MPS_impute(
imp,
class,
instance_idx,
impute_sites,
method;
NN_baseline=true, # whether to also do a baseline imputation using 1-NNI
plot_fits=true, # whether to plot the fits
)
Individual Point Imputation
To impute individual points rather than ranges of consecutive points (blocks), we can simply pass their respective time points into the imputation function as a vector:
impute_sites = [10] # only impute t = 10
impute_sites = [10, 25, 50] # impute multiple individual points
Plotting Trajectories
To plot individual trajectories from the conditional distribution, use method=:ITS
. Here, we'll plot 10 randomly selected trajectories for the missing points by setting the num_trajectories
keyword:
class = 0
impute_sites = collect(10:90)
instance_idx = 59
method = :ITS
imputed_ts, pred_err, target_ts, stats, plots = MPS_impute(
imp,
class,
instance_idx,
impute_sites,
method;
NN_baseline=false, # whether to also do a baseline imputation using 1-NN
plot_fits=true, # whether to plot the fits
num_trajectories=10, # number of trajectories to plot
rejection_threshold=2.5 # limits how unlikely we allow the random trajectories to be.
# there are more options! see [`MPS_impute`](@ref)
)
plot(plots...)
Plotting cumulative distribution functions
It can be interesting to inspect the probability distribution being sampled from at each missing time point. To enable this, we provide the get_cdfs
function, which works very similarly to MPS_impute
, only it returns the CDF at each missing time point in the encoding domain.
cdfs, ts, pred_err, target = get_cdfs(
imp,
class,
instance_idx,
impute_sites
);
xvals = imp.x_guess_range.xvals[1:10:end]
plot(xvals, cdfs[1][1:10:end]; legend=:none)
p = last([plot!(xvals, cdfs[i][1:10:end]) for i in eachindex(cdfs)])
ylabel!("cdf(x)")
xlabel!("x_t")
title!("CDF at each time point.")
Docstrings
MPSTime.init_imputation_problem
— Methodinit_imputation_problem(W::TrainedMPS, X_test::AbstractMatrix, y_test::AbstractArray=zeros(Int, size(X_test,1)), [custom_encoding::MPSTime.Encoding]; <keyword arguments>) -> imp::ImputationProblem
init_imputation_problem(W::TrainedMPS, X_test::AbstractMatrix, [custom_encoding::MPSTime.Encoding]); <keyword arguments>) -> imp::ImputationProblem
Initialise an imputation problem using a trained MPS and relevent test data.
This involves a lot of pre-computation, which can be quite time intensive for data-driven bases. For unclassed/unsupervised data y_test
may be omitted. If the MPS was trained with a custom encoding, then this encoding must be passed to init_imputation_problem
.
Keyword Arguments
guess_range::Union{Nothing, Tuple{<:Real,<:Real}}=nothing
: The range of values that guesses are allowed to take. This range is applied to normalised, encoding-adjusted time-series data. To allow any guess, leave as nothing, or set to encoding.range (e.g. [(-1., 1.) for the legendre encoding]).dx::Float64 = 1e-4
: The spacing between possible guesses in normalised, encoding-adjusted units. When imputing missing data with an MPS method, the imputed values will be selected from range(guess_range...; step=dx)verbosity::Integer=1
: The verbosity of the initialisation process. Useful for debugging, or to completely suppress output.test_encoding::Bool=true
: Whether to double check the encoding and scaling options are correct. This is strongly recommended but has a slight performance cost, so may be disabled.
MPSTime.MPS_impute
— FunctionMPS_impute(imp::ImputationProblem,
class::Any,
instance::Integer,
missing_sites::AbstractVector{<:Integer},
method::Symbol=:median;
<keyword arguments>) -> (imputed_instance::Vector, errors::Vector, target::Vector, stats::Dict, p::Vector{Plots.Plot})
Impute the missing_sites
using an MPS-based approach, selecting the trajectory from the conditioned distribution with method
See init_imputation_problem
for constructing an ImputationProblem
instance out of a trained MPS. The
instance` number is relative to the class, so class 1, instance 2 would be distinct from class 2, instance 2.
Imputation Methods
:median
: For each missing value, compute the probability density function of the possible outcomes from the MPS, and choose the median. This method is the most robust to outliers. Keywords:get_wmad::Bool=true
: Whether to return an 'error' vector that computes the Weighted Median Absolute Deviation (WMAD) of each imputed value.
:mean
: For each missing value, compute the probability density function of the possible outcomes from the MPS, and choose the expected value. Keywords:get_std::Bool=true
: Whether to return an 'error' vector that computes standard deviation of each imputed value.
:mode
: For each missing value, choose the most likely outcome predicted by the MPS. Keywords:max_jump::Union{Number,Nothing}=nothing
: The largest jump allowed between two adjacent imputations. Leave asnothing
to allow any jump. Helpful to suppress 'spikes' caused by poor support near the encoding domain edges.
:ITS
: For each missing value, choose a value at random with probability weighted by the probability density function of the possible outcomes. Keywords:rseed::Integer=1
: Random seed for producing the trajectories.- `num_trajectories::Integer=1: Number of trajectories to compute.
rejection_threshold::Union{Float64, Symbol}=:none
: Number of WMADs allowed between adjacent points. Setting this low helps suppress rapidly varying trajectories that occur by bad luck.max_trials::Integer=10
: Number of attempts allowed to make guesses conform to rejection_threshold before giving up.
:kNearestNeighbour
: Select thek
nearest neighbours in the training set using Euclidean distance to the known data. Keyword:k
: Number of nearest neighbours to return. SeekNN_impute
Keyword Arguments
impute_order::Symbol=:forwards
: Whether to impute the missing values:forwards
(left to right) or:backwards
(right to left)NN_baseline::Bool=true
: Whether to also impute the missing data with a k-Nearest Neighbour baseline.n_baselines::Integer=1
: How many nearest neighbour baselines to compute.plot_fits::Bool=true
: Whether to make a plot showing the target timeseries, the missing values, and the imputed region. If false, then p will be an empty vector. The plot will show the NN_baseline (if it was computed), as well as every trajectory if using the :ITS method.get_metrics::Bool=true
: Whether to compute imputation metrics, if false, thenstats
, will be empty.full_metrics::Bool=false
: Whether to compute every metric (MAPE, SMAPE, MAE, MSE, RMSE) or just MAE and MAPE.print_metric_table::Bool=false
: Whether to print thestats
as a table.invert_transform::Bool=true
:, # Whether to undo the sigmoid transform/minmax normalisation before returning the imputed points. If this is false, imputed_instance, errors, target timeseries, stats, and plot y-axis will all be scaled by the data preprocessing / normalisation and fit to the encoding domain.kwargs...
: Extra keywords passed to the imputation method. See the Imputation Methods section.
MPSTime.get_cdfs
— Functionget_cdfs(imp::ImputationProblem,
class::Any,
instance::Integer,
missing_sites::AbstractVector{<:Integer},
method::Symbol=:median;
<keyword arguments>) -> (cdfs::Vector{Vector}, ts::Vector, pred_err::Vector, target_timeseries_full::Vector)
Impute the missing_sites
using an MPS-based approach, selecting the trajectory from the conditioned distribution with method
, and returns the cumulative distribution function used to infer each missing value.
See MPS_impute
for a list of imputation methods and keyword arguments (does not support plotting, stats, or kNN baselines). See init_imputation_problem
for constructing an ImputationProblem instance. The instance
number is relative to the class, so class 1, instance 2 would be distinct from class 2, instance 2.
MPSTime.trendy_sine
— Functiontrendy_sine(T::Int, n::Int; period=nothing, slope=nothing, phase=nothing, sigma=0.0,
rng=Random.GLOBAL_RNG, return_metadata=true) -> Tuple{Matrix{Float64}, Dict{Symbol, Any}}
Generate n
time series of length T
, each composed of a sine wave with an optional linear trend and Gaussian noise defined by the equation:
\[x_t = \sin\left(\frac{2\pi}{\tau}t + \psi\right) + \frac{mt}{T} + \sigma n_t\]
with period $\tau$, time point $\t$, linear trend slope $\m$, phase offset $\psi$, noise scale $\sigma$ and $\n_t \sim \N(0,1)$
Arguments
T::Int
: Length of each time seriesn::Int
: Number of time series instances to generate
Keyword Arguments
period
: Period of the sinusoid, τnothing
: Random values between 1.0 and 50.0 (default)Float64
: Fixed period for all time seriesTuple
: Bounds for uniform random values, e.g., (1.0, 20.0) → τ ~ U(1.0, 20.0)Vector
: Sample from discrete uniform distribution, e.g., τ ∈ 10, 20, 30
slope
: Linear trend gradient, mnothing
: Random values bewteen -5.0 and 5.0 (default)Float64
: Fixed slope for all time seriesTuple
: Bounds for uniform random values, e.g., (-3.0, 3.0) → m ~ U(-3.0, 3.0)Vector
: Sample from discrete uniform distribution, e.g., m ∈ -3.0, 0.0, 3.0
phase
: Phase offset, ψnothing
: Random values between 0 and 2π (default)Float64
: Fixed phase for all time seriesTuple
: Bounds for uniform random values, e.g., (0.0, π) → ψ ~ U(0.0, π)Vector
: Sample from discrete uniform distribution
sigma::Real
: Standard deviation of Gaussian noise, σ (default: 0.0)rng::AbstractRNG
: Random number generator for reproducibility (default: Random.GLOBAL_RNG)return_metadata::Bool
: Return generation parameters (default: true)
Returns
- Matrix{Float64} of shape (n, T)
- Dictionary of generation parameters (:period, :slope, :phase, :sigma, :T, :n)
MPSTime.state_space
— Functionstate_space(T::Int, n::Int, s::Int=2; sigma::Float64=0.3, rng::AbstractRNG}=Random.GLOBAL_RNG) -> Matrix{Float64}
Generate n
time series of length T
each from a state space model with residual terms drawn from a normal distribution N(0, sigma
) and lag order s
. Time series are generated from the following model:
\[\x_t = \mu_t + \theta_t + \eta_t \mu_t = \mu_{t-1} + \lambda_{t-1} + \xi_t \lambda_t = \lambda_{t-1} + \zeta_{t} \theta_t = \sum_{j=1}^{s-1} - \theta_{t-j} + \omega_t\]
where $\x_t$ is the $\t$-th value in the time series, and the residual terms $\eta_t$, $\xi_t$, $\zeta_t$ and $\omega_t$ are randomly drawn from a normal distribution $\N(0, \sigma)$.
Arguments
T
– Time series length.n
– Number of time-series instances.
Keyword Arguments
s
– Lag order (optional, default:2
).sigma
– Noise standard deviation (optional, default:0.3
).rng
– Random number generator of typeAbstractRNG
(optional, default:Random.GLOBAL_RNG
).
Returns
A Matrix{Float64} of shape (n, T) containing the simulated time-series instances.
Internal imputation methods:
Internal imputation methods
MPSTime.impute_median
— Functionimpute missing data points using the median of the conditional distribution (single site rdm ρ).
Arguments
class_mps::MPS
:opts::Options
: MPS parameters.enc_args::AbstractVector
x_guess_range::EncodedDataRange
timeseries::AbstractVector{<:Number}
: The input time series data that will be imputed.timeseries_enc::MPS
: The encoded version of the time series represented as a product state.imputation_sites::Vector{Int}
: Indices in the time series where imputation is to be performed.get_wmad::Bool
: Whether to compute the weighted median absolute deviation (WMAD) during imputation (default isfalse
).
Returns
A tuple containing:
median_values::Vector{Float64}
: The imputed median values at the specified imputation sites.wmad_value::Union{Nothing, Float64}
: The weighted median absolute deviation ifget_wmad
is true; otherwise,nothing
.
MPSTime.impute_ITS
— FunctionImpute a SINGLE trajectory using inverse transform sampling (ITS).
MPSTime.kNN_impute
— FunctionkNN_impute(imp::ImputationProblem,
class::Any, instance::Integer,
missing_sites::AbstractVector{<:Integer};
k::Integer=1) -> [neighbour1::Vector, neighbour2::Vector, ...]
Impute missing_sites
using the k
nearest neighbours in the test set, based on Euclidean distance.
See init_imputation_problem
for constructing an ImputationProblem instance. The instance
number is relative to the class, so class 1, instance 2 would be distinct from class 2, instance 2.