SCM

[#6698] Using different values to indicate sparsity

Date:
2021-01-27 00:31
Priority:
3
State:
Open
Submitted by:
Andrew Lindsay (knacko)
Assigned to:
Nobody (None)
Product:
Software A
Operating System:
All
Component:
R
Summary:
Using different values to indicate sparsity

Detailed description
Using zero as the collapsing variable for sparse matrices is not useful if zero is an explicit value in your data. Zero might indicate no relationship between two variables. However, zeros in your data will be lost during retrieval from the sparse matrix.

> m <- sparseMatrix(c(1,2,2), c(1,2,3), x = c(0,100,100))
> m
2 x 3 sparse Matrix of class "dgCMatrix"

[1,] 0 . .
[2,] . 100 100
> m[1,]
[1] 0 0 0 <- Cannot discriminate datapoints

As well descriptive statistics (mean, SD, etc) are not possible on rows due to the zero value of missing data.

> m[2,]
[1] 0 100 100
> sd(m[2,])
[1] 57.73503 <- Should be zero

A fix would be to allow NA as an option for the collapsing variable. This allows zero as an explicit value and most functions have na.rm as a parameter.

Even more useful would be a variable collapsing value. In a sparse matrix with only binary data, if 1 was far more common than 0, it would reduce the overall size of the object.

Comments:

Message  ↓
Date: 2022-08-21 17:22
Sender: Mikael Jagan

It would be "nice" if <sparseMatrix>[drop=TRUE] would return a sparseVector preserving structural zeros rather than a traditional vector, which does not distinguish between structural and nonstructural zeros.

However, I suspect that such a change would break a lot of existing code. I guess that we could provide utilities, e.g., getSparseRow() and getSparseCol(), with that functionality, rather than changing our methods for the extract operator `[`...

In the mean time, you can use these no-check versions:

getSparseRow <- function(x, i) {
d <- x@Dim
v <- as(x, "sparseVector")
k <- seq.int(from = i, by = d[1L], length.out = d[2L])
v[k]
}

getSparseCol <- function(x, j) {
m <- x@Dim[1L]
v <- as(x, "sparseVector")
k <- seq.int(from = 1 + (j - 1) * m, length.out = m)
v[k]
}

Then you could do (for example):

> (r <- getSparseRow(m, 2))
sparse vector (nnz/length = 2/3) of class "dsparseVector"
[1] . 100 100
> sd(r@x)
[1] 0

Attached Files:

Changes

No Changes Have Been Made to This Item

Thanks to:
Vienna University of Economics and Business Powered By FusionForge