Machine learning classes¶
These classes are simple wrappers around machine learning classes to perform basic tasks. While they are also designed to be accessed in C++ by O₂scl, they do not require the installation of O₂scl to be functional.
Interpolators¶
o2sclpy.interpm_sklearn_gp: Gaussian process interpolation from scikit-learno2sclpy.interpm_sklearn_mlpr: Multilayer perceptron regression from scikit-learno2sclpy.interpm_sklearn_dtr: Decision-tree regression from scikit-learno2sclpy.interpm_tf_dnn: Regression using a simple Tensorflow neural networko2sclpy.interpm_torch_dnnRegression using a simple Torch neural network
Classifiers¶
o2sclpy.classify_sklearn_gnbGaussian naive Bayes classifier from scikit-learno2sclpy.classify_sklearn_mlpcMultilayer perceptron classifier from scikit-learno2sclpy.classify_sklearn_dtcDecision-tree classification from scikit-learn
Probability density functions¶
Gaussian mixture model in
o2sclpy.gmm_sklearnand Bayesian Gaussian mixture model ino2sclpy.bgmm_sklearn.Kernel density estimators:
o2sclpy.kde_sklearnando2sclpy.kde_scipy.Normalizing flows using Torch and the
nflowspackage:o2sclpy.nflows_nsf.
Class documentation¶
- class o2sclpy.bgmm_sklearn¶
Use scikit-learn to generate a Bayesian Gaussian mixture model of a specified set of data.
This is an experimental interface to provide easier interaction with C++.
- components(v)¶
For a point (or set of points) specified in
v, use the Gaussian mixture at to compute the density (or densities) of each component as a contiguous numpy array. Each array will have entries which sum to 1.
- get_data()¶
Return the properties of the Gaussian mixture model as contiguous numpy arrays. This function returns, in order, the weights, the means, the covariances, the precisions (the inverse of the covariances), and the Cholesky decomposition of the precisions.
- log_pdf(x)¶
Return the per-sample average log likelihood of the data as a single floating point value given the vector or vectors specified in x.
- o2graph_to_bgmm(o2scl, amp, link, args)¶
The function providing the ‘to-bgmm’ command for o2graph.
- predict(v)¶
Predict the labels (the index of the Gaussian) given a vector or vectors v and return them in a one-dimensional numpy array with data type int64.
- sample(n_samples=1)¶
Sample the Gaussian mixture model, returning a tuple with two components, the first being an 2D array of the coordinates of the new samples and the second being a 1D array of the labels for each new sample.
- score_samples(x)¶
Given a vector (or list of vectors) in
x, return the log likelihood at each point as a numpy array.
- set_data(in_data, verbose=0, n_components=2, covariance_type='full', tol=0.001, reg_covar=1e-06, max_iter=100, n_init=1)¶
Fit the mixture model with the specified input data, a numpy array of shape (n_samples,n_coordinates)
- set_data_str(in_data, options)¶
Set the input and output data to train the interpolator, using a string to specify the keyword arguments.
- class o2sclpy.classify_sklearn_dtc¶
Classify a data set using scikit-learn’s decision tree classifier.
See https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html .
- eval(v)¶
Evaluate the classifier at point
v. Ifself.outformatis equal tolist, then the output is a Python list, otherwise, the output is a numpy array.
- eval_list(v)¶
Evaluate the classifier at the array of points stored in
v.
- load(filename, obj_prefix)¶
Load the classifer from an HDF5 file named
filenameas a string namedobj_prefix.
- save(filename, obj_prefix='classify_sklearn_dtc')¶
Save the classifer to an HDF5 file named
filenameas a string namedobj_prefix.
- set_data(in_data, out_data, outformat='numpy', verbose=0, test_size=0.0, criterion='gini', splitter='best', max_depth=None, max_features=None, random_state=None)¶
Set the input and output data to train the classifier
The variable
in_datashould be an array of shape(n_points,n_dim), andout_datacan be of shape(n_points)or(n_points,1).AWS, 12/4/24: I’m not sure if this class works with more than one output label.
- set_data_str(in_data, out_data, options)¶
Set the input and output data to train the interpolator, using a string to specify the keyword arguments.
- verbose = 0¶
Verbosity parameter (default 0)
- class o2sclpy.classify_sklearn_gnb¶
Classify a data set using scikit-learn’s Gaussian naive Bayes classifier.
See https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html .
- eval(v)¶
Evaluate the classifier at point
v.
- eval_list(v)¶
Evaluate the classifier at the array of points stored in
v.
- load(filename, obj_prefix='classify_sklearn_gnb')¶
Load the classifer from an HDF5 file named
filenameas a string namedobj_prefix.
- save(filename, obj_prefix='classify_sklearn_gnb')¶
Save the classifer to an HDF5 file named
filenameas a string namedobj_prefix.
- set_data(in_data, out_data, outformat='numpy', test_size=0.0, priors=None, var_smoothing=1e-09, verbose=0, transform_in='none')¶
Set the input and output data to train the interpolator
The variable
in_datashould be an array of shape(n_points,n_dim), andout_datacan be of shape(n_points)or(n_points,1).AWS, 12/4/24: I’m not sure if this class works with more than one output label.
- set_data_str(in_data, out_data, options)¶
Set the input and output data to train the interpolator, using a string to specify the keyword arguments.
- class o2sclpy.classify_sklearn_mlpc¶
Classify a data set using scikit-learn’s multi-layer perceptron classifier.
See https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html .
- eval(v)¶
Evaluate the classifier at point
v.
- eval_list(v)¶
Evaluate the classifier at the array of points stored in
v.
- load(filename, obj_prefix)¶
Load the classifer from an HDF5 file named
filenameas a string namedobj_prefix.
- save(filename, obj_prefix='classify_sklearn_mlpc')¶
Save the classifer to an HDF5 file named
filenameas a string namedobj_prefix.
- set_data(in_data, out_data, transform_in='none', outformat='numpy', test_size=0.0, hlayers=(100,), activation='relu', solver='adam', alpha=0.0001, batch_size='auto', learning_rate='constant', max_iter=200, random_state=None, verbose=False, early_stopping=False, n_iter_no_change=10, tol=0.0001)¶
Set the input and output data to train the interpolator
The variable
in_datashould be an array of shape(n_points,n_dim), andout_datacan be of shape(n_points)or(n_points,1).AWS, 12/4/24: I don’t think this class works with than one output label.
- set_data_str(in_data, out_data, options)¶
Set the input and output data to train the interpolator, using a string to specify the keyword arguments.
- class o2sclpy.gmm_sklearn¶
Use scikit-learn to generate a Gaussian mixture model of a specified set of data.
This is an experimental interface to provide easier interaction with C++.
- components(v)¶
For a point (or set of points) specified in
v, use the Gaussian mixture at to compute the density (or densities) of each component as a contiguous numpy array. Each array will have entries which sum to 1.
- get_data()¶
Return the properties of the Gaussian mixture model as contiguous numpy arrays. This function returns, in order, the weights, the means, the covariances, the precisions (the inverse of the covariances), and the Cholesky decomposition of the precisions.
- log_pdf(x)¶
Return the per-sample average log likelihood of the data as a single floating point value given the vector or vectors specified in x.
- o2graph_to_gmm(o2scl, amp, link, args)¶
The function providing the ‘to-gmm’ command for o2graph.
- predict(v)¶
Predict the labels (the index of the Gaussian) given a vector or vectors v and return them in a one-dimensional numpy array with data type int64.
- sample(n_samples=1)¶
Sample the Gaussian mixture model, returning a tuple with two components, the first being an 2D array of the coordinates of the new samples and the second being a 1D array of the labels for each new sample.
- score_samples(x)¶
Given a vector (or list of vectors) in
x, return the log likelihood at each point as a numpy array.
- set_data(in_data, verbose=0, n_components=2, covariance_type='full', tol=0.001, reg_covar=1e-06, max_iter=100, n_init=1)¶
Fit the mixture model with the specified input data, a numpy array of shape (n_samples,n_coordinates)
- set_data_str(in_data, options)¶
Set the input and output data to train the interpolator, using a string to specify the keyword arguments.
- class o2sclpy.nflows_nsf¶
Neural spline flow probability density distribution from normflows which uses pytorch
This class is experimental.
This code was originally based on https://github.com/VincentStimper/normalizing-flows/blob/master/examples/circular_nsf.ipynb .
- log_pdf(x)¶
Return the log likelihood
The value
xcan be a single point, expressed as a one-dimensional list or numpy array, or a series of points specified as a numpy array.If
xcontains only one point, then only a single floating point value is returned. Otherwise, the return type is a list or numpy array, depending on the value ofoutformat.
- pdf(x)¶
Return the likelihood
- sample(n_samples=1)¶
Sample the distribution
The output is a list or numpy array, depending on which option was specified to set_data() or set_data_str(). The list or numpy array is only one-dimensional if
n_samplesis 1.
- set_data(in_data, verbose=0, num_layers=20, num_hidden_channels=128, max_iter=20000, outformat='numpy', adam_lr=0.0001, adam_decay=0.0001)¶
Fit the mixture model with the specified input data, a numpy array of shape (n_samples,n_coordinates)
adam_lr is Adam learning rate (pytorch default is 1.0e-3) adam_decay is the Adam weight decay (pytorch default is 0)
- set_data_str(in_data, options='')¶
Set the input and output data to train the interpolator, using a string to specify the keyword arguments.
- class o2sclpy.kde_sklearn¶
Use scikit-learn to generate a KDE.
This is an experimental interface to provide easier interaction with C++.
Todo
Fix the comparison between sklearn and scipy, making sure they both produce the same log_pdf() in the correct conditions. Ensure the integral is normalized when appropriate.
See https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KernelDensity.html .
- get_bandwidth()¶
Return the bandwidth
- log_pdf(x)¶
Return the log likelihood
- pdf(x)¶
Return the likelihood
- sample(n_samples=1)¶
Sample the Gaussian mixture model
- set_data(in_data, bw_array, verbose=0, kernel='gaussian', metric='euclidean', outformat='numpy', transform='unit', bandwidth='none')¶
Fit the mixture model with the specified input data, a numpy array of shape (n_samples,n_coordinates)
- set_data_str(in_data, bw_array, options)¶
Set the input and output data to train the interpolator, using a string to specify the keyword arguments.
- class o2sclpy.kde_scipy¶
Use scipy to generate a KDE
This is an experimental and very simplifed interface, mostly to provide easier interaction with C++.
- get_bandwidth()¶
Return the bandwidth
- log_pdf(x)¶
Return the log likelihood
- pdf(x)¶
Return the likelihood
- sample(n_samples=1)¶
Sample the Gaussian mixture model
- set_data(in_data, verbose=0, weights=None, outformat='numpy', bw_method=None, transform='unit')¶
Fit the mixture model with the specified input data, a numpy array of shape (n_samples,n_coordinates)
- set_data_str(in_data, weights, options)¶
Set the input and output data to train the interpolator, using a string to specify the keyword arguments.
- string_to_dict(s)¶
Convert a string to a dictionary, converting strings to values when necessary.
- class o2sclpy.interpm_sklearn_dtr¶
Interpolate one or many multidimensional data sets using scikit-learn’s decision tree regression.
See https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html .
- eval(v)¶
Evaluate the regression at point
v.
- eval_list(v)¶
Evaluate the GP at point
v.
- load(filename, obj_name)¶
Load the interpolation settings from a file
- outformat = 'numpy'¶
Output format, either ‘native’, ‘c++’, or ‘list’ (default ‘native’)
- save(filename, obj_name)¶
Save the interpolation settings to an HDF5 file
- score = 0.0¶
The most recent score value given a non-zero test size returned by set_data()
- set_data(in_data, out_data, outformat='numpy', verbose=0, test_size=0.0, criterion='squared_error', splitter='best', transform_in='none', transform_out='none', max_depth=None, random_state=None)¶
Set the input and output data to train the interpolator
- set_data_str(in_data, out_data, options)¶
Set the input and output data to train the interpolator, using a string to specify the keyword arguments.
- verbose = 0¶
Verbosity parameter (default 0)
- class o2sclpy.interpm_sklearn_gp¶
Interpolate one or many multimensional data sets using a Gaussian process from scikit-learn
AWS, 3/12/25: I think sklearn uses the log of the marginal likelihood as the optimization function.
The variables
verboseandoutformatcan be changed at any time.Todo
Calculate derivatives
Allow sampling, as done in interpm_krige
Allow different minimizers?
- apply(v, f)¶
Apply the kernel-like function
fto the training data and return the result (doesn’t work yet).
- eval(v)¶
Evaluate the GP at point
v.The input
vshould be a one-dimensional numpy array and the output is a one-dimensional numpy array, unless outformat islist, in which case the output is a Python list.
- eval_list(v)¶
Evaluate the GP at the list of points given in
v. The inputvshould be a two-dimensional numpy array of size(n_points,n_inputs).If
outformatisnative, then the output is a two-dimensional numpy array. Ifoutformatislist, then the output is a list. Ifoutformatisc++, then the output is a continuous one-dimensional numpy array.
- eval_unc(v)¶
Evaluate the GP and its uncertainty at point
v.# AWS, 3/27/24: Keep in mind that # o2scl::interpm_python.eval_unc() expects the return type to # be a tuple of numpy arrays.
- load(filename, obj_name)¶
Load the interpolation settings from a string named
obj_namestored in an HDF5 file namedfilename.
- outformat = 'native'¶
Output format, either ‘native’, ‘c++’, or ‘list’ (default ‘native’)
- save(filename, obj_name)¶
Save the interpolation settings to an HDF5 file.
This function uses the sklearn get_params() function to obtain the sklearn parameters. A tuple is created using the class parameters and the sklearn parameters and this tuple is pickled to a string. Finally, this function stores that string with name
obj_nameto the HDF5 file namedfilename.
- score = 0.0¶
The most recent score value given a non-zero test size returned by set_data()
- set_data(in_data, out_data, kernel=None, test_size=0.0, normalize_y=True, transform_in='none', alpha=1e-10, outformat='native', verbose=0, random_state=None)¶
Set the input and output data to train the Gaussian process. The variable in_data should be a numpy array with shape
(n_points,in_dim)and out_data should be a numpy array with shape(n_points,out_dim). (Sklearn calls these shapes(n_samples,n_features)and(n_samples,n_targets)).If kernel is
None, then the default kernel,1.0*RBF(1.0,(1e-2,1e2))is used.The value
alphais added to the diagonal elements of the kernel matrix during fitting.
- set_data_str(in_data, out_data, options)¶
Set the input and output data to train the interpolator, using a string to specify the keyword arguments.
The GP kernel, if specified, should be the last option specified in the string (this enables easier parsing of the option string). The eval() function is used to convert the string to a sklearn kernel.
- verbose = 0¶
Verbosity parameter (default 0)
- class o2sclpy.interpm_sklearn_mlpr¶
Interpolate one or many multidimensional data sets using scikit-learn’s multi-layer perceptron regressor.
- eval(v)¶
Evaluate the MLP at point
v.
- eval_list(v)¶
Evaluate the GP at point
v.
- eval_unc(v)¶
Empty function because this interpolator does not currently provide uncertainties
- load(filename, obj_name)¶
Load the interpolation settings from a file
- outformat = 'numpy'¶
Output format, either ‘native’, ‘c++’, or ‘list’ (default ‘native’)
- save(filename, obj_name)¶
Save the interpolation settings to an HDF5 file
- score = 0.0¶
The most recent score value given a non-zero test size returned by set_data()
- set_data(in_data, out_data, outformat='numpy', test_size=0.0, hlayers=(100,), activation='relu', transform_in='none', transform_out='none', solver='adam', alpha=0.0001, batch_size='auto', learning_rate='adaptive', max_iter=500, random_state=1, verbose=0, early_stopping=True, tol=0.0001, n_iter_no_change=10)¶
Set the input and output data to train the interpolator.
Activation functions are ‘identity’, ‘logistic’, ‘tanh’, or ‘relu’.
- set_data_str(in_data, out_data, options)¶
Set the input and output data to train the interpolator, using a string to specify the keyword arguments.
- verbose = 0¶
Verbosity parameter (default 0)
- class o2sclpy.interpm_tf_dnn¶
Interpolate one or many multimensional data sets using a neural network from TensorFlow
This is a simple implementation of a neural network with early stopping.
The variables
verboseandoutformatcan be changed at any time.Todo
Calculate derivatives
‘native’ output format?
add_data() for successive improvements
Allow user to control CPU vs. GPU
Allow user to control early stopping monitor
- check_gpu()¶
Check if Tensorflow is likely to use the GPU
- eval(v)¶
Evaluate the NN at point
v.The input
vshould be a one-dimensional numpy array and the output is a one-dimensional numpy array, unless outformat islist, in which case the output is a Python list.
- eval_list(v)¶
Evaluate the neural network at the list of points given in
v.
- eval_unc(v)¶
Empty function because this interpolator does not currently provide uncertainties
- load(filename)¶
Load interpolator from a pair of
.kerasand.o2files.
- outformat = 'numpy'¶
Output format, either ‘numpy’ or ‘list’ (default ‘numpy’)
- save(filename)¶
Save the interpolation settings to a pair of files. A
.kerasfile for the TensorFlow model and a.o2file for additional data.
- set_data(in_data, out_data, outformat='numpy', verbose=0, activations=['relu'], batch_size=None, epochs=100, transform_in='none', transform_out='none', test_size=0.0, evaluate=False, hlayers=[8, 8], loss='mean_squared_error', es_min_delta=0.0001, es_patience=100, es_start=50, tf_logs='1', tf_onednn_opts='1')¶
Set the input and output data to train the interpolator
Some activation functions are: ‘relu’, ‘sigmoid’, ‘tanh’. If the number of activation functions specified in
activationsis smaller than the number of layers, then the activation function list is reused using the modulus operator.The keyword argument
tf_logsspecifies the value of the environment variableTF_CPP_MIN_LOG_LEVEL.
- set_data_str(in_data, out_data, options)¶
Set the input and output data to train the interpolator, using a string to specify the keyword arguments.
- verbose = 0¶
Verbosity parameter (default 0)
- class o2sclpy.interpm_torch_dnn¶
Interpolate one or many multidimensional data sets using PyTorch.
Todo
Calculate second derivatives
More activation functions
move function_approx class outside of function
better handling of torch tensors as input and output
‘native’ output format
partial derivatives inefficient because always computes gradient
add_data() for successive improvements
Allow user to control CPU vs. GPU
- deriv(v, i)¶
Evaluate the derivative of the NN at point
vwith respect to the variable with indexi
- eval(v)¶
Evaluate the NN at point
v.
- eval_list(v)¶
Evaluate the NN at the list of points given in
v.
- eval_unc(v)¶
Empty function because this interpolator does not currently provide uncertainties
- load(filename, device=None)¶
Load the interpolation settings from a file
- outformat = 'numpy'¶
Output format, either ‘native’, ‘c++’, or ‘list’ (default ‘native’)
- save(filename)¶
Save the interpolation settings to a file
(No custom object support)
- set_data(in_data, out_data, outformat='numpy', verbose=0, hlayers=[8, 8], epochs=100, transform_in='none', transform_out='none', test_size=0.0, activation='relu', patience=20, device=None, seed=None, layer_norm=True)¶
Early stopping is set with patience, and if patience is 0 then the training never stops early.
- verbose = 0¶
Verbosity parameter (default 0)