Imputation

Overview

Imputation Scenarios

MPSTime supports univariate time-series imputation with three key patterns of missing data:

  1. Individual missing points (e.g., values missing at $t = 10, 20, 80$)
  2. Single contiguous blocks (e.g., values missing from $t = 25-70$)
  3. Multiple contiguous blocks (e.g., values missing from $t = 5-10$, $t = 25-50$ and $t = 80-90$)

MPSTime can also handle any combination of these patterns. For instance, you might need to impute a single contiguous block from $t = 10-30$, plus individual missing points at $t = 50$ and $t=80$.

Setup

The first step is to train an MPS. Here, we'll train an MPS in an unsupervised manner (no class labels) using a noisy trendy sinusoid.

# Fix rng seed
using Random
rng = Xoshiro(1)

# dataset size
ntimepoints = 100
ntrain_instances = 300
ntest_instances = 200

# generate the train and test datasets
X_train, _ = trendy_sine(ntimepoints, ntrain_instances; sigma=0.1, rng=rng);
X_test, _ = trendy_sine(ntimepoints, ntest_instances; sigma=0.1, rng=rng);

# hyper parameters and training
opts = MPSOptions(d=10, chi_max=40, sigmoid_transform=false);
mps, info, test_states= fitMPS(X_train, opts);

Next, we initialize an imputation problem. This does a lot of necessary pre-computation:

julia> imp = init_imputation_problem(mps, X_test);

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
                         Summary:

 - Dataset has 300 training samples and 200 testing samples.
Slicing MPS into individual states...
 - 1 class(es) were detected.
 - Time independent encoding - Legendre - detected.
 - d = 10, chi_max = 40
Re-encoding the training data to get the encoding arguments...

 Created 1 ImputationProblem struct(s) containing class-wise mps and test samples.

A summary of the imputation problem setup is printed to verify the model parameters and dataset information. For multi-class data, you can pass y_test to init_imputation_problem in order to exploit the labels / class information while doing imputation.

Imputing missing values

Single-block Imputation

Now, decide how you want to impute the missing data. The necessary options are:

  • class::Integer: The class of the time-series instance we are going to impute, leave as zero for "unlabelled" data (i.e., all data belong to the same class).
  • impute_sites: The MPS sites (time points) that are missing (inclusive).
  • instance_idx: The time-series instance from the chosen class in the test set.
  • method: The imputation method to use. Can be trajectories (ITS), median, mode, mean, etc...

In this example, we will consider a single block of contiguous missing values, simulated from a missing-at-random mechanism (MAR). We will use the median to impute the missing values, as well as computing a 1-Nearest Neighbor Imputation (1-NNI) baseline for comparison:

class = 0
pm = 0.8 # 80% missing data
instance_idx = 59 # time series instance in test set
_, impute_sites = mar(X_test[instance_idx, :], pm; state=42) # simulate MAR mechanism
method = :median

imputed_ts, pred_err, target_ts, stats, plots = MPS_impute(
    imp,
    class, 
    instance_idx, 
    impute_sites, 
    method; 
    NN_baseline=true, # whether to also do a baseline imputation using 1-NNI
    plot_fits=true, # whether to plot the fits
)

Several outputs are returned from MPS_impute:

  • imputed_ts: The imputed time-series instance, containing the original data points and the predicted values.
  • pred_err: The prediction error for each imputed value, given a known ground-truth.
  • target_ts: The original time-series instance containing missing values.
  • stats: A collection of statistical metrics (MAE and MAPE) evaluating imputation performance with respect to a ground truth. Includes baseline performance when NN_baseline=true.
  • plots: Stores plot object(s) in an array for visualization when plot_fits=true.

We can inspect the imputation performance in a summary table:

julia> using PrettyTables
julia> pretty_table(stats[1]; header=["Metric", "Value"], header_crayon= crayon"yellow bold", tf = tf_unicode_rounded);
╭─────────┬───────────╮
│  Metric │     Value │
├─────────┼───────────┤
│     MAE │ 0.0855211 │
│    MAPE │  0.125649 │
│  NN_MAE │   0.11304 │
│ NN_MAPE │  0.168085 │
╰─────────┴───────────╯

Here, MAE and MAPE correspond to the mean absolute error (MAE) and mean absolute percentage error (MAPE) for the MPS, while the NN_ prefix corresponds to the same errors for the 1-NNI baseline.

To plot the imputed time series, we can call the plot function as follows:

julia> using Plots
julia> plot(plots...)

The solid orange line depicts the "ground-truth" (observed) time-series values, the dotted blue line is the MPS-imputed data points and the dotted red line is the 1-NNI baseline. The blue shading indicates the uncertainty due to encoding error.

There are a lot of other options, and many more imputation methods to choose from! See MPS_impute for more details.

Multi-block Imputation

Building on the previous example of single-block imputation, MPSTime can also be used to impute missing values in multiple blocks of contiguous points. For example, consider missing points between $t = 10-25$, $t = 40-60$ and $t = 75-90$:

class = 0
impute_sites = vcat(collect(10:25), collect(40:60), collect(65:90))
instance_idx = 32
method = :median

imputed_ts, pred_err, target_ts, stats, plots = MPS_impute(
    imp,
    class, 
    instance_idx, 
    impute_sites, 
    method; 
    NN_baseline=true, # whether to also do a baseline imputation using 1-NNI
    plot_fits=true, # whether to plot the fits
)

Individual Point Imputation

To impute individual points rather than ranges of consecutive points (blocks), we can simply pass their respective time points into the imputation function as a vector:

impute_sites = [10] # only impute t = 10
impute_sites = [10, 25, 50] # impute multiple individual points

Plotting Trajectories

To plot individual trajectories from the conditional distribution, use method=:ITS. Here, we'll plot 10 randomly selected trajectories for the missing points by setting the num_trajectories keyword:

class = 0
impute_sites = collect(10:90)
instance_idx = 59
method = :ITS

imputed_ts, pred_err, target_ts, stats, plots = MPS_impute(
    imp,
    class, 
    instance_idx, 
    impute_sites, 
    method; 
    NN_baseline=false, # whether to also do a baseline imputation using 1-NN
    plot_fits=true, # whether to plot the fits
    num_trajectories=10, # number of trajectories to plot
    rejection_threshold=2.5 # limits how unlikely we allow the random trajectories to be.
    # there are more options! see [`MPS_impute`](@ref)
)

plot(plots...)

Plotting cumulative distribution functions

It can be interesting to inspect the probability distribution being sampled from at each missing time point. To enable this, we provide the get_cdfs function, which works very similarly to MPS_impute, only it returns the CDF at each missing time point in the encoding domain.

cdfs, ts, pred_err, target = get_cdfs(
    imp, 
    class, 
    instance_idx, 
    impute_sites
    );

xvals = imp.x_guess_range.xvals[1:10:end]

plot(xvals, cdfs[1][1:10:end]; legend=:none)
p = last([plot!(xvals, cdfs[i][1:10:end]) for i in eachindex(cdfs)])
ylabel!("cdf(x)")
xlabel!("x_t")
title!("CDF at each time point.")

Docstrings

MPSTime.init_imputation_problemMethod
init_imputation_problem(W::TrainedMPS, X_test::AbstractMatrix, y_test::AbstractArray=zeros(Int, size(X_test,1)), [custom_encoding::MPSTime.Encoding]; <keyword arguments>) -> imp::ImputationProblem
init_imputation_problem(W::TrainedMPS, X_test::AbstractMatrix, [custom_encoding::MPSTime.Encoding]); <keyword arguments>) -> imp::ImputationProblem

Initialise an imputation problem using a trained MPS and relevent test data.

This involves a lot of pre-computation, which can be quite time intensive for data-driven bases. For unclassed/unsupervised data y_test may be omitted. If the MPS was trained with a custom encoding, then this encoding must be passed to init_imputation_problem.

Keyword Arguments

  • guess_range::Union{Nothing, Tuple{<:Real,<:Real}}=nothing: The range of values that guesses are allowed to take. This range is applied to normalised, encoding-adjusted time-series data. To allow any guess, leave as nothing, or set to encoding.range (e.g. [(-1., 1.) for the legendre encoding]).
  • dx::Float64 = 1e-4: The spacing between possible guesses in normalised, encoding-adjusted units. When imputing missing data with an MPS method, the imputed values will be selected from range(guess_range...; step=dx)
  • verbosity::Integer=1: The verbosity of the initialisation process. Useful for debugging, or to completely suppress output.
  • test_encoding::Bool=true: Whether to double check the encoding and scaling options are correct. This is strongly recommended but has a slight performance cost, so may be disabled.
source
MPSTime.MPS_imputeFunction
MPS_impute(imp::ImputationProblem, 
           class::Any, 
           instance::Integer, 
           missing_sites::AbstractVector{<:Integer}, 
           method::Symbol=:median; 
           <keyword arguments>) -> (imputed_instance::Vector, errors::Vector, target::Vector, stats::Dict, p::Vector{Plots.Plot})

Impute the missing_sites using an MPS-based approach, selecting the trajectory from the conditioned distribution with method

See init_imputation_problem for constructing an ImputationProbleminstance out of a trained MPS. Theinstance` number is relative to the class, so class 1, instance 2 would be distinct from class 2, instance 2.

Imputation Methods

  • :median: For each missing value, compute the probability density function of the possible outcomes from the MPS, and choose the median. This method is the most robust to outliers. Keywords:

    • get_wmad::Bool=true: Whether to return an 'error' vector that computes the Weighted Median Absolute Deviation (WMAD) of each imputed value.
  • :mean: For each missing value, compute the probability density function of the possible outcomes from the MPS, and choose the expected value. Keywords:

    • get_std::Bool=true: Whether to return an 'error' vector that computes standard deviation of each imputed value.
  • :mode: For each missing value, choose the most likely outcome predicted by the MPS. Keywords:

    • max_jump::Union{Number,Nothing}=nothing: The largest jump allowed between two adjacent imputations. Leave as nothing to allow any jump. Helpful to suppress 'spikes' caused by poor support near the encoding domain edges.
  • :ITS: For each missing value, choose a value at random with probability weighted by the probability density function of the possible outcomes. Keywords:

    • rseed::Integer=1: Random seed for producing the trajectories.
    • `num_trajectories::Integer=1: Number of trajectories to compute.
    • rejection_threshold::Union{Float64, Symbol}=:none: Number of WMADs allowed between adjacent points. Setting this low helps suppress rapidly varying trajectories that occur by bad luck.
    • max_trials::Integer=10: Number of attempts allowed to make guesses conform to rejection_threshold before giving up.
  • :kNearestNeighbour: Select the k nearest neighbours in the training set using Euclidean distance to the known data. Keyword:

    • k: Number of nearest neighbours to return. See kNN_impute

Keyword Arguments

  • impute_order::Symbol=:forwards: Whether to impute the missing values :forwards (left to right) or :backwards (right to left)
  • NN_baseline::Bool=true: Whether to also impute the missing data with a k-Nearest Neighbour baseline.
  • n_baselines::Integer=1: How many nearest neighbour baselines to compute.
  • plot_fits::Bool=true: Whether to make a plot showing the target timeseries, the missing values, and the imputed region. If false, then p will be an empty vector. The plot will show the NN_baseline (if it was computed), as well as every trajectory if using the :ITS method.
  • get_metrics::Bool=true: Whether to compute imputation metrics, if false, then stats, will be empty.
  • full_metrics::Bool=false: Whether to compute every metric (MAPE, SMAPE, MAE, MSE, RMSE) or just MAE and MAPE.
  • print_metric_table::Bool=false: Whether to print the stats as a table.
  • invert_transform::Bool=true:, # Whether to undo the sigmoid transform/minmax normalisation before returning the imputed points. If this is false, imputed_instance, errors, target timeseries, stats, and plot y-axis will all be scaled by the data preprocessing / normalisation and fit to the encoding domain.
  • kwargs...: Extra keywords passed to the imputation method. See the Imputation Methods section.
source
MPSTime.get_cdfsFunction
get_cdfs(imp::ImputationProblem, 
           class::Any, 
           instance::Integer, 
           missing_sites::AbstractVector{<:Integer}, 
           method::Symbol=:median; 
           <keyword arguments>) -> (cdfs::Vector{Vector}, ts::Vector, pred_err::Vector, target_timeseries_full::Vector)

Impute the missing_sites using an MPS-based approach, selecting the trajectory from the conditioned distribution with method, and returns the cumulative distribution function used to infer each missing value.

See MPS_impute for a list of imputation methods and keyword arguments (does not support plotting, stats, or kNN baselines). See init_imputation_problem for constructing an ImputationProblem instance. The instance number is relative to the class, so class 1, instance 2 would be distinct from class 2, instance 2.

source
MPSTime.trendy_sineFunction
trendy_sine(T::Int, n::Int; period=nothing, slope=nothing, phase=nothing, sigma=0.0, 
    rng=Random.GLOBAL_RNG, return_metadata=true) -> Tuple{Matrix{Float64}, Dict{Symbol, Any}}

Generate n time series of length T, each composed of a sine wave with an optional linear trend and Gaussian noise defined by the equation:

\[x_t = \sin\left(\frac{2\pi}{\tau}t + \psi\right) + \frac{mt}{T} + \sigma n_t\]

with period $\tau$, time point $\t$, linear trend slope $\m$, phase offset $\psi$, noise scale $\sigma$ and $\n_t \sim \N(0,1)$

Arguments

  • T::Int: Length of each time series
  • n::Int: Number of time series instances to generate

Keyword Arguments

  • period: Period of the sinusoid, τ
    • nothing: Random values between 1.0 and 50.0 (default)
    • Float64: Fixed period for all time series
    • Tuple: Bounds for uniform random values, e.g., (1.0, 20.0) → τ ~ U(1.0, 20.0)
    • Vector: Sample from discrete uniform distribution, e.g., τ ∈ 10, 20, 30
  • slope: Linear trend gradient, m
    • nothing: Random values bewteen -5.0 and 5.0 (default)
    • Float64: Fixed slope for all time series
    • Tuple: Bounds for uniform random values, e.g., (-3.0, 3.0) → m ~ U(-3.0, 3.0)
    • Vector: Sample from discrete uniform distribution, e.g., m ∈ -3.0, 0.0, 3.0
  • phase: Phase offset, ψ
    • nothing: Random values between 0 and 2π (default)
    • Float64: Fixed phase for all time series
    • Tuple: Bounds for uniform random values, e.g., (0.0, π) → ψ ~ U(0.0, π)
    • Vector: Sample from discrete uniform distribution
  • sigma::Real: Standard deviation of Gaussian noise, σ (default: 0.0)
  • rng::AbstractRNG: Random number generator for reproducibility (default: Random.GLOBAL_RNG)
  • return_metadata::Bool: Return generation parameters (default: true)

Returns

  • Matrix{Float64} of shape (n, T)
  • Dictionary of generation parameters (:period, :slope, :phase, :sigma, :T, :n)
source
MPSTime.state_spaceFunction
state_space(T::Int, n::Int, s::Int=2; sigma::Float64=0.3, rng::AbstractRNG}=Random.GLOBAL_RNG) -> Matrix{Float64}

Generate n time series of length T each from a state space model with residual terms drawn from a normal distribution N(0, sigma) and lag order s. Time series are generated from the following model:

\[\x_t = \mu_t + \theta_t + \eta_t \mu_t = \mu_{t-1} + \lambda_{t-1} + \xi_t \lambda_t = \lambda_{t-1} + \zeta_{t} \theta_t = \sum_{j=1}^{s-1} - \theta_{t-j} + \omega_t\]

where $\x_t$ is the $\t$-th value in the time series, and the residual terms $\eta_t$, $\xi_t$, $\zeta_t$ and $\omega_t$ are randomly drawn from a normal distribution $\N(0, \sigma)$.

Arguments

  • T – Time series length.
  • n – Number of time-series instances.

Keyword Arguments

  • s – Lag order (optional, default: 2).
  • sigma – Noise standard deviation (optional, default: 0.3).
  • rng – Random number generator of type AbstractRNG (optional, default: Random.GLOBAL_RNG).

Returns

A Matrix{Float64} of shape (n, T) containing the simulated time-series instances.

source

Internal imputation methods:

Internal imputation methods

MPSTime.impute_medianFunction

impute missing data points using the median of the conditional distribution (single site rdm ρ).

Arguments

  • class_mps::MPS:
  • opts::Options: MPS parameters.
  • enc_args::AbstractVector
  • x_guess_range::EncodedDataRange
  • timeseries::AbstractVector{<:Number}: The input time series data that will be imputed.
  • timeseries_enc::MPS: The encoded version of the time series represented as a product state.
  • imputation_sites::Vector{Int}: Indices in the time series where imputation is to be performed.
  • get_wmad::Bool: Whether to compute the weighted median absolute deviation (WMAD) during imputation (default is false).

Returns

A tuple containing:

  • median_values::Vector{Float64}: The imputed median values at the specified imputation sites.
  • wmad_value::Union{Nothing, Float64}: The weighted median absolute deviation if get_wmad is true; otherwise, nothing.
source
MPSTime.kNN_imputeFunction
kNN_impute(imp::ImputationProblem, 
           class::Any, instance::Integer, 
           missing_sites::AbstractVector{<:Integer}; 
           k::Integer=1) -> [neighbour1::Vector, neighbour2::Vector, ...]

Impute missing_sites using the k nearest neighbours in the test set, based on Euclidean distance.

See init_imputation_problem for constructing an ImputationProblem instance. The instance number is relative to the class, so class 1, instance 2 would be distinct from class 2, instance 2.

source