## Introduction

When it comes to exploratory data analysis, we’ll often encounter data series with missing values. But the challenge is that how do we decide which time series to keep and how to score them. The most simple way to do is to compute the total percentage of the missing data. But this has a big flaw that it can’t differentiate the quality of the time series when they have the same amount of missing data points but positioned differently.

Let’s take a look at the following two data vectors: [NA,1,NA,1,NA,1,NA,1] and [1,1,1,1,NA,NA,NA,NA].

The recovery rate for these two vectors is different. The first time series is more often considered easier to impute, a.k.a estimate missing values. Because of the differences in these two series, I have come up with another method to score the quality of the series: porosity score. The concept is derived from environmental physics. What this does is to compute an adjusted porosity score of the time series vector by considering how the missing/bad data is positioned, the size of each block of missing data and adjust their impact on the overall dataset. Whether it is all discrete or continuously positioned every k index.

The porosity score proposed here will penalize the missing data block by its size. The bigger continuous hole it has, the worse the data is.

## Define function

The function is defined below as PorosityScore. By default, the function will return a PorosityScore with penalty turned on. This is recommended metric. What this means is that it penalize each block of missing data differently. For example, the penalty weight for a missing block size of 4 will be 4 while it will be 1 for block size 1. This makes sense because the bigger hole you have, the worse data it should be.

```
PorosityScore<- function(tsIn,tolerance =0,missingValue = NA,batch = FALSE, adjusted = FALSE, penalty = TRUE){
mVal = -99999999.9999
if(is.na(missingValue)) {
tsIn[is.na(tsIn)] <- mVal
}else{
mVal = missingValue
}
idx <- which(tsIn == mVal )
totalPorosity <- length(idx) / length(tsIn)
result <- list()
count <- 0
i = 1
while(i <= length(tsIn)) {
if(tsIn[i] == mVal){
count <- count + 1
}else{
if(count !=0){
result <- append(result,count)
}
count <- 0
}
i <- i +1
}
if(count !=0) {
result <- append(result,count)
}
if(length(result) ==0){
adjPorosity <- 0
PenaltyPorosity <- 0
blockSizeVec <- NA
sprintf("The average porosity is: %5.1f.", mean(blockSizeVec))
sprintf("The total and adjusted porosity score is:(%5.1f , %5.1f)", totalPorosity,adjPorosity)
resultlist <- list("total.porosity.score" = totalPorosity ,"adjusted.porosity.score" = adjPorosity,
"PenaltyPorosity"=PenaltyPorosity, "missing.blocksize" = blockSizeVec)
}else{
blockSizeVec <- sapply(result,sum)
AvgPorosity <- mean(blockSizeVec)
resVecAdj <- blockSizeVec[blockSizeVec>tolerance]
adjPorosity <- sum(resVecAdj)/length(tsIn)
PenaltyPorosity <- sum(blockSizeVec*resVecAdj)
sprintf("The average porosity is: %5.1f.", mean(blockSizeVec))
sprintf("The total and adjusted porosity score is:(%5.1f , %5.1f)", totalPorosity,adjPorosity)
resultlist <- list("total.porosity.score" = totalPorosity ,"adjusted.porosity.score" = adjPorosity,
"PenaltyPorosity"=PenaltyPorosity, "missing.blocksize" = blockSizeVec)
}
if(batch) {
if(adjusted){
return(adjPorosity)
}
if(penalty){
return(PenaltyPorosity)
}
}else{
return(resultlist)
}
}
```

## Example

Let’s look at the example

```
print("dataset one")
```

`## [1] "dataset one"`

```
a <- c(1,2,NA,3,NA,NA,4,5,6,7,8,NA,9,10,NA,NA)
result <- PorosityScore(a)
print(result)
```

```
## $total.porosity.score
## [1] 0.375
##
## $adjusted.porosity.score
## [1] 0.375
##
## $PenaltyPorosity
## [1] 10
##
## $missing.blocksize
## [1] 1 2 1 2
```

```
print("dataset two")
```

`## [1] "dataset two"`

```
a2 <- c(1,NA,2,3,4,NA,4,NA,6,NA,8,NA,9,10,NA)
result2 <- PorosityScore(a2)
print(result2)
```

```
## $total.porosity.score
## [1] 0.4
##
## $adjusted.porosity.score
## [1] 0.4
##
## $PenaltyPorosity
## [1] 6
##
## $missing.blocksize
## [1] 1 1 1 1 1 1
```

#```
```

## Conclusion

As we can see that the function can successfully distinguish time series with different missing patterns. In the above example, the first vector has a greater porosity score with a penalty. We can use this score to filter out numeric features with missing data by rank them.