Advanced Usage¶
Once you have installed (for Python users) or downloaded (for C++ users) the package, the module can be used in two different ways, either by importing genif
into Python or by
including the respective sources into your C++ project.
Python¶
Basic operation¶
The main functionality of this package is implemented in the genif.GeneralizedIsolationForest
class. For convenience, this class follows the well-established principle
of librarys like scikit-learn
, which require classifiers to provide a genif.GeneralizedIsolationForest.fit()
, genif.GeneralizedIsolationForest.fit_predict()
and a genif.GeneralizedIsolationForest.predict()
method. A very basic example using this library can hence be given as follows:
import numpy as np
from genif import GeneralizedIsolationForest
# Create some random demo data.
N = 1000
d = 50
X = np.random.random((N, d))
# Create the GIF classifier.
gif = GeneralizedIsolationForest(k=10, n_models=50, sample_size=256,
kernel="rbf", kernel_scaling=[0.05], sigma=0.01)
# Fit the classifier and make predictions.aa
y_pred = gif.fit_predict(X)
For this example, we chose to divide every data region into k=10
subregions. The algorithm will fit a total of 50 trees, each considering a sample of the provided dataset
containing 256 observations. To decide, when tree induction is terminated, the algorithm relies on a RBF kernel which is scaled to 0.05. Tree induction is terminated, when the
pairwise average kernel value of observations in a particular subregion exceeds 0.01.
The GIF algorithm internally fits a forest of Generalized Isolation Trees and estimates inlier probabilities for each found data region. After the procedure finished, the returned
valued is assigned to y_pred
, which contains a vector of N = 1000
entries each describing the probability for every input data vector to be inlying or not. High probability
values, which are near one, therefore indicate conforming (i.e. “normal”) behaviour. Conversely, probability values near zero indicate non-conforming (i.e. “anomalous”) behaviour.
Independent training and testing¶
It is also possible to fit the GIF on one dataset, while using another dataset for the actual predictions you want to make. In this case, you will need to call
genif.GeneralizedIsolationForest.fit()
and genif.GeneralizedIsolationForest.predict()
independently:
import numpy as np
from genif import GeneralizedIsolationForest
# Create some random demo data (1).
X_training = np.random.random((1000, 50))
X_testing = np.random.random((10, 50))
# Create the GIF classifier.
gif = GeneralizedIsolationForest(k=10, n_models=50, sample_size=256,
kernel="rbf", kernel_scaling=[0.05], sigma=0.01)
# Fit the classifier and make predictions.
y_pred = gif.fit(X_training).predict(X_testing)
Warning
Remember, that calling predict
without prior call to fit
results in receiving an exception.
Vary used kernels¶
You may also want to choose another kernel to check for tree induction termination. Besides the RBF kernel, the class of Matèrn kernels is supported with
\(\nu \in \left\lbrace 1/2, 3/2, 5/2 \right\rbrace\), which can be selected in code by replacing rbf
with matern-d1
, matern-d3
, matern-d5
respectively. Please
keep in mind, that the Matèrn kernels expect the scaling vector to contain as many entries as the input vectors have dimensions. Thus, GIF may be called like that:
import numpy as np
from genif import GeneralizedIsolationForest
# Create some random demo data.
d = 50
X = np.random.random((1000, d))
# Create the GIF classifier.
gif = GeneralizedIsolationForest(k=10, n_models=50, sample_size=256,
kernel="rbf", kernel_scaling=np.repeat(0.5, d), sigma=0.01)
# Fit the classifier and make predictions.
y_pred = gif.fit_predict(X)
Remember that GIF returns probability values, which you want to be binarized. In this case you will need to find an appropriate probability threshold, which you can apply to the prediction vector for binarization.
C++¶
Using the C++ interface might be interesting for those users, which want to embed this algorithm either in their existing programs or which want to add more functionality to the existing sources (what we highly appreciate! Merge requests are always welcome.). For the C++ part of this section, we will discuss the general project setup routine rather than the parametrization options for GIF. If you’re interested in those, please take a look into the Python subsection above as the necessary parameters are quite the same.
Using the library within other projects¶
The genif
sources are distributed as a “header-only” library within the CMake project model. Hence, no explicit compilation or linking is needed. For this section, we will
assume, that your project is also organized as a CMake project.
To include GIF in your package follow these steps:
Optional: Create a separate subdirectory holding library folders (e.g.
lib
).Recursively clone GIF source code repository by issueing either
git clone --recurse-submodules git@github.com:philippjh/genif.git
or (for submodule enthusiasts)
git submodule add --recurse-submodules git@github.com:philippjh/genif.git && git submodule update --init --recursive
Add the subdirectory to your
CMakeLists.txt
file (i.e.add_subdirectory(lib/genif)
).Link “your” target to the “virtual” target
libgenif
, which makes all necessary header files available to your project. This can be accomplished bytarget_link_libraries(yourtarget PUBLIC libgenif)
.
You are ready to use the GIF library within your C++ project. All GIF-related symbols are packed into the genif
namespace, hence do not forget to either prepend genif::
or
use an using namespace
directive.
A short demonstrational listing may be given as follows:
#include <iostream>
#include <genif/gif/GeneralizedIsolationForest.h>
int main() {
// Create some parameters.
const unsigned int k = 10;
const unsigned int nModels = 100;
const unsigned int sampleSize = 256;
const std::string kernelId = "rbf";
const Eigen::VectorXd kernelScaling = Eigen::VectorXd::Random(1);
const double sigma = 0.02;
const int workerCount = -1;
// Create some random data to classify.
const unsigned int N = 1000;
const unsigned int d = 50;
auto X = Eigen::MatrixXd::Random(N, d);
genif::GeneralizedIsolationForest gif(k, nModels, sampleSize, kernelId, kernelScaling, sigma, workerCount);
auto yPred = gif.fitPredict(X);
std::cout << "Prediction:" << std::endl << std::endl << yPred << std::endl;
return 0;
}
As you can see, GIF uses the Eigen library for matrix-vector operations, which is included automatically, when you add the library to your CMakeLists.txt
.