Sunday, May 22, 2022

Times Series Clustering with Dynamic Time Warping



Table of Contents



Section 1 - Problem Definition
Section 1.1 - Project Summary

Section 2 - Data Preparation
Section 2.1 - Working Directory and Required Libraries
Section 2.2 - Import Data

Section 3 - Exploratory Data Analysis
Section 3.1 - Plot of Time Series Data
Section 3.2 - Dynamic Time Warping (DTW)

Section 4 - Model Development
Section 4.1 - Creating Distance Measure Models
Section 4.2 - Time Series Distance Measure Visualizations

Section 5 - Evaluation





Section 1 - Problem Definition



Compare methods of time series clustering with Dynamic Time Warping (DTW), Euclidean distance, and a third measure. Included in the Data Science analysis: Required R Language Libraries, Data Importation, Exploratory Data Analysis, Distance Measure Model Development, Visualizations, and Model Evaluation with Cluster Validity Indices.



Section 1.1 - Project Summary

This project is implemented in R Markdown format, with the objective of:
1) Presenting a comparison of Dynamic Time Warping (DTW), Euclidean Distance, and a third measure.
2) Evaluating the time series distance measures via Cluster Validity Indices (CVIs).
3) Displaying the results of the time series distance measures via meaningful, target-oriented visualizations.
4) Descriptions of the Data Science techniques involved in creating the subsequent results.



Section 2 - Data Preparation

Section 2.1 - Working Directory and Required Libraries

The interactive programming of this Time Series Distance Measure Evaluation was accomplished with the RStudio Cloud Interactive Development Environment, (IDE). In addition to the basic capabilities of the R programming language, several R language packages of pre-programmed functions are used for the distance measure algorithms.

The “Set Working Directory” R language basic function is used to set the working directory to the directory with the source files. The function, “setwd()”, sets the filepath as the current working directory of the R environment. The permanence of the filepath varies with different operating systems, and the status of the R language Integrated Development Environment. The “Get Working Directory” function is used to verify that the working directory has been set to the right location.

Also, in this section the required R programming language packages are included in the package library. The function, “library()”, loads the R language packages into the session library of packages, in order to run the functions within the packages. The R packages included are packages for distance measure, and time series clustering.



# setwd("C:/Users/...")
# getwd()

library(tidyr)
library(dplyr)
library(knitr)
library(dtw)
library(BBmisc)
library(dtwclust)
library(TSdist)



Section 2.2 - Import Dataset

The dataset imported for distance measure technique evaluation is a dataframe of 35040 time series’ of electricity load data for the one year period of 1/1/2014 - 1/1/2015. The data is separated by semicolons. After importation, the date field is re-formatted for the R language. Then, two of the time series’ are extracted for the distance measure evaluation. In order to perform distance measuring of the entire 35040 time series’ simultaneously, millions of years of processing time would be required.

data <- read.csv("Electrictiy_load.csv", sep = ";")
data$date <- as.Date(data$date, format="%d.%m.%Y")

customer_1 <- ts(data$Customer1, start=c(2014, 1, 1),
                 end=c(2015, 1, 1), frequency=213)

customer_2 <- ts(data$Customer2, start=c(2014, 1, 1),
                 end=c(2015, 1, 1), frequency=213)



Section 3 - Exploratory Data Analysis

Exploratory Data Analysis is an approach for analyzing datasets to summarize their main characteristics, in order to decide on subsequent Time Series Clustering methods. The quality of the dataset should be examined to determine the usefulness of your available data. Irregardless of sophisication, a time series distance measure algorithm is limited by the accuracy of the data. If the data you are working with is collected or labeled by humans, reviewing a subset of data will help with estimation of possible mistakes via human error.

The data should also be reviewed for possible omitted values. Usually, omitted values are replaceable with the median value of the entire dataset column. However, the more omitted values that are within the dataset, the more the results of the Time Series Clustering is expected to be inaccurate. The dataset chosen for a Time Series Clustering should be the right type of data for the insights that are needed. If your company is selling electronics in the US and is planning on expanding into Europe, you should try to gather data that can aid in Time Series Clustering of both markets.


Section 3.1 - Plot of Time Series Data

In Figure 1, the two time series’ extracted from the electricity load dataset are plotted within the same time frame, in order to visualize the data that is then processed with time series distance measuring, and time series clustering.

xrange <- range(data$date[1]:data$date[35040])
yrange <- range(c(data$Customer1,data$Customer2))

plot(xrange, yrange, xaxt = "n", type="n",
     xlab="time",ylab="value",
     main="Figure 1. Plot of Time Series Data")
axis(1, data$date, format(data$date, "%b %y"), cex.axis = .7)
lines(data$Customer1, col='blue', type='l')
lines(data$Customer2, col='magenta', type='l')



Section 3.2 - Dynamic Time Warping (DTW)

In Figure 2, the dtw() function within the dtw package is used to calculate the distance between two vectors. Diagonal lines represent one-to-one matching. Vertical and horizontal lines represent many-to-one matching.

plot(dtw(customer_1, customer_2), xlab="customer_1", ylab="customer_2", main="Figure 2. DTW Matching")

In Figure 3, a “threeway” derivation of Figure 2 displays the plots on x and y axes. The “keep=TRUE” parameter of the dtw() function is required.

plot(dtw(customer_1, customer_2, keep=TRUE),
     xlab="customer_1", ylab="customer_2", type="threeway",
     main="Figure 3. Threeway DTW Matching Plot")

Figure 4 plots the DTW step patterns using type=“twoway”. The blue time series is customer_1, and the magenta time series is customer_2.

plot(dtw(customer_1,customer_2,keep=TRUE), type="twoway",
     col=c('blue', 'magenta'),
     main="Figure 4. Twoway DTW Matching Plot")



Section 4 - Model Development

In Section 4, the Dynamic Time Warping, Euclidean Distance, and Global Alignment Kernel models for Time Series Clustering are developed for the electricity load data. The “dtwclust” package for time series clustering allows for specification of “DTW”, “Euclidean”, or “GAK” distance measuring. Thereby, allowing for evaluation of these measures in parallel. The ranges of the two time series’ are normalized for effective evaluation within similar number scales. The clustering of the 214 time series dates are formatted for six clusters.



Section 4.1 - Creating Distance Measure Models

customer_data <- data.frame(customer_1, customer_2)
customer.data.norm <- BBmisc::normalize(customer_data,
                                        method="standardize")

dtw_clust <- tsclust(customer.data.norm, type="partitional",
                     k=6L, distance="dtw", centroid="pam")

euclidean_clust <- tsclust(customer.data.norm, type="partitional",
                           k=6L, distance="Euclidean",
                           centroid="pam")

gak_clust <- tsclust(customer.data.norm, type="partitional",
                           k=6L, distance="gak",
                           centroid="pam")



Section 4.2 - Time Series Distance Measure Visualizations

Figure 5 visualizes the series and centroid plot of six Dynamic Time Warping time series clusters. Figure 6 visualizes the series and centroid plot of six Euclidean Distance time series clusters. Figure 7 applies the same visualizations to the Global Alignment Kernel time series clusters. The dashed line represents the medoid time series. The electricity load data for the two customers are separated into six general trends that represent upward trends and downward trends of electricity usage.

Tables 1, 2 and 3 display the assignment of the six clusters for the 214 data sample dates in the electricity load data.

cat("Figure 5. DTW Time Series Clusters")
## Figure 5. DTW Time Series Clusters
plot(dtw_clust, type = "sc")

cat("Figure 6. Euclidean Time Series Clusters")
## Figure 6. Euclidean Time Series Clusters
plot(euclidean_clust, type = "sc")

cat("Figure 7. Global Alignment Kernel Time Series Clusters")
## Figure 7. Global Alignment Kernel Time Series Clusters
plot(gak_clust, type = "sc")

kable(t(cbind(customer.data.norm[,0], cluster = dtw_clust@cluster)),
      caption = "Table 1. DTW Cluster Assignments of Time Series
      Dates")
Table 1. DTW Cluster Assignments of Time Series Dates
cluster3365556366363333333555533333312221111111444111211122111122221144412222222222222221444422113666333365553333333366366655533363632222112111444111112222111222111114441122222222222211444412213333333365556663333333335555
kable(t(cbind(customer.data.norm[,0],
              cluster = euclidean_clust@cluster)),
      caption = "Table 2. Euclidean Cluster Assignments of Time
      Series Dates")
Table 2. Euclidean Cluster Assignments of Time Series Dates
cluster2251114244242222222511522222263336666666111666366633666633336611163333333333333335111533662444422251112422222244244511122244426333663666111666663333333333666631116633333333333366111163362222222251114442222222225111
kable(t(cbind(customer.data.norm[,0], cluster = gak_clust@cluster)),
      caption = "Table 3. Global Alignment Kernel Cluster
      Assignments of Time Series Dates")
Table 3. Global Alignment Kernel Cluster Assignments of Time Series Dates
cluster2213336266262222226133126222255555555555333555555544555544445533354444444444544445333145552666622213332222222266266133362266625444554555333555555544554444555543335544444444544455333354452222622213336662666222221333



In Figures 8 - 16, the six clusters are plotted with a combination of series and centroid, followed by a series plot showing the members of the first cluster, and a centroids plot showing the first cluster chosen as the medoid.

Figures 17, 18 and 19 are hierarchical dendrograms of the time series dates within the six clusters, for Dynamic Time Warping, Euclidean Distance, and Global Alignment Kernel Measures.

cat("Figure 8. DTW Series/Centroid Plot")
## Figure 8. DTW Series/Centroid Plot
plot(dtw_clust, type = "sc", clus = 1L)

cat("Figure 9. DTW Series Plot")
## Figure 9. DTW Series Plot
plot(dtw_clust, type = "series", clus = 1L)

cat("Figure 10. DTW Centriods Plot")
## Figure 10. DTW Centriods Plot
plot(dtw_clust, type = "centroids", clus = 1L)

cat("Figure 11. Euclidean Series/Centroid Plot")
## Figure 11. Euclidean Series/Centroid Plot
plot(euclidean_clust, type = "sc", clus = 1L)

cat("Figure 12. Euclidean Series Plot")
## Figure 12. Euclidean Series Plot
plot(euclidean_clust, type = "series", clus = 1L)

cat("Figure 13. Euclidean Centriods Plot")
## Figure 13. Euclidean Centriods Plot
plot(euclidean_clust, type = "centroids", clus = 1L)

cat("Figure 14. Global Alignment Kernel Series/Centroid Plot")
## Figure 14. Global Alignment Kernel Series/Centroid Plot
plot(gak_clust, type = "sc", clus = 1L)

cat("Figure 15. Global Alignment Kernel  Series Plot")
## Figure 15. Global Alignment Kernel  Series Plot
plot(gak_clust, type = "series", clus = 1L)

cat("Figure 16. Global Alignment Kernel  Centriods Plot")
## Figure 16. Global Alignment Kernel  Centriods Plot
plot(gak_clust, type = "centroids", clus = 1L)

set.seed(123)

clust.hier.dtw <- tsclust(customer.data.norm, type = "h",
                          k = 6L, distance = "dtw")
clust.hier.euclidean <- tsclust(customer.data.norm, type = "h",
                                k = 6L, distance = "euclidean")
clust.hier.gak <- tsclust(customer.data.norm, type = "h",
                                k = 6L, distance = "gak")

plot(clust.hier.dtw, main="Figure 17. DTW Dendrogram")

plot(clust.hier.euclidean, main="Figure 18. Euclidean Distance Dendrogram")

plot(clust.hier.gak, main="Figure 19. Global Alignment Kernel Distance Dendrogram")



Section 5 - Evaluation

For the final evaluation of Distance Measures for this project, the Cluster Validity Indices evaluation metric is chosen for evaluation of the accuracy of producing six clusters of time series data via Dynamic Time Warping, Euclidean Distance, and Global Alignment Kernel.



kable(cvi(dtw_clust), caption = "Table 4. DTW - Cluster Validity Indices")
Table 4. DTW - Cluster Validity Indices
x
Sil0.4161095
SF0.0290512
CH111.8179800
DB1.0677429
DBstar1.8132669
D0.0128444
COP0.1154342
kable(cvi(euclidean_clust), caption = "Table 5. Euclidean Distance - Cluster Validity Indices")
Table 5. Euclidean Distance - Cluster Validity Indices
x
Sil0.4290294
SF0.1217342
CH130.4489850
DB0.8211406
DBstar1.8619141
D0.0186354
COP0.1152264
kable(cvi(gak_clust), caption = "Table 6. Global Alignment Kernel Distance - Cluster Validity Indices")
Table 6. Global Alignment Kernel Distance - Cluster Validity Indices
x
Sil0.5571970
SF0.6153544
CH1100.6889099
DB1.0270383
DBstar10.0607217
D0.0006024
COP0.0433589

No comments: