scorelingam

Overview

This package implements the estimation of a topological ordering for a Linear Structural Equation Model (SEM) with non-Gaussian errors, as outlined in Ruiz et. al (2022+):

G Ruiz, OH Madrid-Padilla, Q Zhou. “Sequentially learning the topological ordering of directed acyclic graphs with likelihood ratio scores.” Transactions on Machine Learning Research, 2022+. [Link]

Package Docummentation

See the manual for function documentation. See below for some examples.

Installation

The package devtools is needed for installation.

devtools::install_github(repo='gabriel-ruiz/scorelingam')

Load scorelingam once installed.

library(scorelingam)

An example for our sorting procedure: p = 5

Generate data

Linear SEM Parameters

p = 5; numRoots = 1; numParentsMin = 1; numParentsMax = 2
scaleParam = runif(n=p,min=0.5,max=1.2)
causalOrder = 5:1
lingamParams = randomWeightedAdjacencyMatrix(p=p,num.roots=numRoots,pa.min=numParentsMin,pa.max=numParentsMax,
                    pa.wt.min = 0.25,pa.wt.max = 0.9,prob.pos = 0.5,perm = causalOrder)

# weighted adjacency matrix
print(lingamParams$B)

##           [,1]       [,2]       [,3]      [,4] [,5]
## [1,] 0.0000000  0.0000000  0.0000000 0.0000000    0
## [2,] 0.3839051  0.0000000  0.0000000 0.0000000    0
## [3,] 0.4824674 -0.8666502  0.0000000 0.0000000    0
## [4,] 0.0000000  0.0000000  0.6784132 0.0000000    0
## [5,] 0.0000000  0.0000000 -0.5450611 0.6573171    0

Data matrix with n = 5000

When generating the data matrix, the possible options for the family argument are ‘laplace’, ‘logistic’, and ‘t’, in which case the additional argument df is needed (df=10 is the default)

n = 5000
X = generateData(B=lingamParams$B,scale = scaleParam,perm = causalOrder,n = n,family = 'laplace')
dim(X)

## [1] 5000    5

Obtain a topological ordering estimate

When estimating a topological ordering for the linear SEM, the possible options for the family argument are ‘laplace’, ‘logistic’, and ‘t’, in which case the additional argument df is needed (df=10 is the default).

Specify the neighborhood sets

# neighborhoods specified to be all other nodes
mbhat = lapply(1:ncol(X),function(j){(1:ncol(X))[-j]}) 

Obtain the ordering estimate

estOrder = scorelingam(X=X,mb=mbhat,numUpdates=ncol(X),family='laplace')
print(estOrder)

##      [,1]
## [1,]    5
## [2,]    4
## [3,]    3
## [4,]    2
## [5,]    1

Check the accuracy of the ordering

Calculate the proportion of parents sorted after a child (lower is better).

# check errors (ideally close to zero)
checkSortingErrors(estOrder=estOrder,A=lingamParams$B)

## [1] 0

Check accuracy of estimated weighted adjacency matrix

paHat = getParents(mb=mbhat,ordering=estOrder)
(Bhat = getWeights(X=scale(X,scale=F,center=T),pa=paHat))

##              [,1]        [,2]       [,3]      [,4] [,5]
## [1,]  0.000000000  0.00000000  0.0000000 0.0000000    0
## [2,]  0.391906115  0.00000000  0.0000000 0.0000000    0
## [3,]  0.494511285 -0.85062512  0.0000000 0.0000000    0
## [4,] -0.008728058 -0.03435524  0.6737897 0.0000000    0
## [5,]  0.007557937  0.02683172 -0.5371846 0.6568952    0

# maximum entry-wise difference in absolute value
norm(x=lingamParams$B-Bhat,type = 'i')

## [1] 0.0477068

A higher dimensional example: p = 10, 000

Generate data

Linear SEM Parameters

p = 10000;numRoots = 100; numParentsMin = 1; numParentsMax = 2
scaleParam = runif(n=p,min=0.5,max=1.2)
causalOrder = c(seq(2,p,by=2),seq(1,p-1,by=2))
lingamParams = randomWeightedAdjacencyMatrix(p=p,num.roots=numRoots,pa.min=numParentsMin,
                                             pa.max=numParentsMax, pa.wt.min = 0.25,
                                             pa.wt.max = 0.9,prob.pos = 0.5,perm = causalOrder)

Data matrix with n = 5000

When generating the data matrix, the possible options for the family argument are ‘laplace’, ‘logistic’, and ‘t’, in which case the additional argument df is needed (df=10 is the default)

n = 5000
X = generateData(B=lingamParams$B,scale = scaleParam,perm = causalOrder,n = n,family = 'laplace')
dim(X)

## [1]  5000 10000

Obtain a topological ordering estimate and time the algorithm.

When estimating a topological ordering for the Linear SEM, the possible options for the family argument are ‘laplace’, ‘logistic’, and ‘t’, in which case the additional argument df is needed (df=10 is the default).

Specify the neighborhood sets

# neighborhoods specified to be true markov blankets
mbhat = moralize(lingamParams$B) 

Obtain the ordering estimate

start = Sys.time()
estOrder = scorelingam(X=X,mb=mbhat,numUpdates=5,family='laplace')
end = Sys.time()
difftime(end,start,units='mins')

## Time difference of 2.0367 mins

Check accuracy of estimated ordering

Calculate the proportion of parents sorted after child (lower is better).

checkSortingErrors(estOrder=estOrder,A=lingamParams$B)

## [1] 0.08549762