Tutorial: Categorize support issues using multiclass classification with ML.NET
This sample tutorial illustrates using ML.NET to create a GitHub issue classifier to train a model that classifies and predicts the Area label for a GitHub issue via a .NET console application using C# in Visual Studio.
In this tutorial, you learn how to:
- Prepare your data
- Transform the data
- Train the model
- Evaluate the model
- Predict with the trained model
- Deploy and Predict with a loaded model
You can find the source code for this tutorial at the dotnet/samples repository.
Prerequisites
- Visual Studio 2022 with the ".NET Desktop Development" workload installed.
- The GitHub issues tab separated file (issues_train.tsv).
- The GitHub issues test tab separated file (issues_test.tsv).
Create a console application
Create a project
Create a C# Console Application called "GitHubIssueClassification". Select Next.
Choose .NET 7 as the framework to use. Select Create.
Create a directory named Data in your project to save your data set files:
In Solution Explorer, right-click on your project and select Add > New Folder. Type "Data" and press Enter.
Create a directory named Models in your project to save your model:
In Solution Explorer, right-click on your project and select Add > New Folder. Type "Models" and press Enter.
Install the Microsoft.ML NuGet Package:
Note
This sample uses the latest stable version of the NuGet packages mentioned unless otherwise stated.
In Solution Explorer, right-click on your project and select Manage NuGet Packages. Choose "nuget.org" as the Package source, select the Browse tab, search for Microsoft.ML and select the Install button. Select the OK button on the Preview Changes dialog and then select the I Accept button on the License Acceptance dialog if you agree with the license terms for the packages listed.
Prepare your data
Download the issues_train.tsv and the issues_test.tsv data sets and save them to the Data folder you created previously. The first dataset trains the machine learning model and the second can be used to evaluate how accurate your model is.
In Solution Explorer, right-click each of the *.tsv files and select Properties. Under Advanced, change the value of Copy to Output Directory to Copy if newer.
Create classes and define paths
Add the following additional using
directives to the top of the Program.cs file:
using Microsoft.ML;
using GitHubIssueClassification;
Create three global fields to hold the paths to the recently downloaded files, and global variables for the MLContext
,DataView
, and PredictionEngine
:
_trainDataPath
has the path to the dataset used to train the model._testDataPath
has the path to the dataset used to evaluate the model._modelPath
has the path where the trained model is saved._mlContext
is the MLContext that provides processing context._trainingDataView
is the IDataView used to process the training dataset._predEngine
is the PredictionEngine<TSrc,TDst> used for single predictions.
Add the following code to the line directly below the using
directives to specify those paths and the other variables:
string _appPath = Path.GetDirectoryName(Environment.GetCommandLineArgs()[0]) ?? ".";
string _trainDataPath = Path.Combine(_appPath, "..", "..", "..", "Data", "issues_train.tsv");
string _testDataPath = Path.Combine(_appPath, "..", "..", "..", "Data", "issues_test.tsv");
string _modelPath = Path.Combine(_appPath, "..", "..", "..", "Models", "model.zip");
MLContext _mlContext;
PredictionEngine<GitHubIssue, IssuePrediction> _predEngine;
ITransformer _trainedModel;
IDataView _trainingDataView;
Create some classes for your input data and predictions. Add a new class to your project:
In Solution Explorer, right-click the project, and then select Add > New Item.
In the Add New Item dialog box, select Class and change the Name field to GitHubIssueData.cs. Then, select the Add button.
The GitHubIssueData.cs file opens in the code editor. Add the following
using
directive to the top of GitHubIssueData.cs:
using Microsoft.ML.Data;
Remove the existing class definition and add the following code, which has two classes GitHubIssue
and IssuePrediction
, to the GitHubIssueData.cs file:
public class GitHubIssue
{
[LoadColumn(0)]
public string? ID { get; set; }
[LoadColumn(1)]
public string? Area { get; set; }
[LoadColumn(2)]
public required string Title { get; set; }
[LoadColumn(3)]
public required string Description { get; set; }
}
public class IssuePrediction
{
[ColumnName("PredictedLabel")]
public string? Area;
}
The label
is the column you want to predict. The identified Features
are the inputs you give the model to predict the Label.
Use the LoadColumnAttribute to specify the indices of the source columns in the data set.
GitHubIssue
is the input dataset class and has the following String fields:
- the first column
ID
(GitHub Issue ID) - the second column
Area
(the prediction for training) - the third column
Title
(GitHub issue title) is the firstfeature
used for predicting theArea
- the fourth column
Description
is the secondfeature
used for predicting theArea
IssuePrediction
is the class used for prediction after the model has been trained. It has a single string
(Area
) and a PredictedLabel
ColumnName
attribute. The PredictedLabel
is used during prediction and evaluation. For evaluation, an input with training data, the predicted values, and the model are used.
All ML.NET operations start in the MLContext class. Initializing mlContext
creates a new ML.NET environment that can be shared across the model creation workflow objects. It's similar, conceptually, to DBContext
in Entity Framework
.
Initialize variables
Initialize the _mlContext
global variable with a new instance of MLContext
with a random seed (seed: 0
) for repeatable/deterministic results across multiple trainings. Replace the Console.WriteLine("Hello World!")
line with the following code:
_mlContext = new MLContext(seed: 0);
Load the data
ML.NET uses the IDataView interface as a flexible, efficient way of describing numeric or text tabular data. IDataView
can load either text files or in real time (for example, SQL database or log files).
To initialize and load the _trainingDataView
global variable in order to use it for the pipeline, add the following code after the mlContext
initialization:
_trainingDataView = _mlContext.Data.LoadFromTextFile<GitHubIssue>(_trainDataPath,hasHeader: true);
The LoadFromTextFile() defines the data schema and reads in the file. It takes in the data path variables and returns an IDataView
.
Add the following after calling the LoadFromTextFile()
method:
var pipeline = ProcessData();
The ProcessData
method executes the following tasks:
- Extracts and transforms the data.
- Returns the processing pipeline.
Create the ProcessData
method at the bottom of the Program.cs file using the following code:
IEstimator<ITransformer> ProcessData()
{
}
Extract features and transform the data
As you want to predict the Area GitHub label for a GitHubIssue
, use the MapValueToKey() method to transform the Area
column into a numeric key type Label
column (a format accepted by classification algorithms) and add it as a new dataset column:
var pipeline = _mlContext.Transforms.Conversion.MapValueToKey(inputColumnName: "Area", outputColumnName: "Label")
Next, call mlContext.Transforms.Text.FeaturizeText
, which transforms the text (Title
and Description
) columns into a numeric vector for each called TitleFeaturized
and DescriptionFeaturized
. Append the featurization for both columns to the pipeline with the following code:
.Append(_mlContext.Transforms.Text.FeaturizeText(inputColumnName: "Title", outputColumnName: "TitleFeaturized"))
.Append(_mlContext.Transforms.Text.FeaturizeText(inputColumnName: "Description", outputColumnName: "DescriptionFeaturized"))
The last step in data preparation combines all of the feature columns into the Features column using the Concatenate() method. By default, a learning algorithm processes only features from the Features column. Append this transformation to the pipeline with the following code:
.Append(_mlContext.Transforms.Concatenate("Features", "TitleFeaturized", "DescriptionFeaturized"))
Next, append a AppendCacheCheckpoint to cache the DataView so when you iterate over the data multiple times using the cache might get better performance, as with the following code:
.AppendCacheCheckpoint(_mlContext);
Warning
Use AppendCacheCheckpoint for small/medium datasets to lower training time. Do NOT use it (remove .AppendCacheCheckpoint()) when handling very large datasets.
Return the pipeline at the end of the ProcessData
method.
return pipeline;
This step handles preprocessing/featurization. Using additional components available in ML.NET can enable better results with your model.
Build and train the model
Add the following call to the BuildAndTrainModel
method as the next line after the call to the ProcessData()
method:
var trainingPipeline = BuildAndTrainModel(_trainingDataView, pipeline);
The BuildAndTrainModel
method executes the following tasks:
- Creates the training algorithm class.
- Trains the model.
- Predicts area based on training data.
- Returns the model.
Create the BuildAndTrainModel
method, just after the declaration of the ProcessData()
method, using the following code:
IEstimator<ITransformer> BuildAndTrainModel(IDataView trainingDataView, IEstimator<ITransformer> pipeline)
{
}
About the classification task
Classification is a machine learning task that uses data to determine the category, type, or class of an item or row of data and is frequently one of the following types:
- Binary: either A or B.
- Multiclass: multiple categories that can be predicted by using a single model.
For this type of problem, use a Multiclass classification learning algorithm, since your issue category prediction can be one of multiple categories (multiclass) rather than just two (binary).
Append the machine learning algorithm to the data transformation definitions by adding the following as the first line of code in BuildAndTrainModel()
:
var trainingPipeline = pipeline.Append(_mlContext.MulticlassClassification.Trainers.SdcaMaximumEntropy("Label", "Features"))
.Append(_mlContext.Transforms.Conversion.MapKeyToValue("PredictedLabel"));
The SdcaMaximumEntropy is your multiclass classification training algorithm. This is appended to the pipeline
and accepts the featurized Title
and Description
(Features
) and the Label
input parameters to learn from the historic data.
Train the model
Fit the model to the splitTrainSet
data and return the trained model by adding the following as the next line of code in the BuildAndTrainModel()
method:
_trainedModel = trainingPipeline.Fit(trainingDataView);
The Fit()
method trains your model by transforming the dataset and applying the training.
The PredictionEngine is a convenience API, which allows you to pass in and then perform a prediction on a single instance of data. Add this as the next line in the BuildAndTrainModel()
method:
_predEngine = _mlContext.Model.CreatePredictionEngine<GitHubIssue, IssuePrediction>(_trainedModel);
Predict with the trained model
Add a GitHub issue to test the trained model's prediction in the Predict
method by creating an instance of GitHubIssue
:
GitHubIssue issue = new GitHubIssue() {
Title = "WebSockets communication is slow in my machine",
Description = "The WebSockets communication used under the covers by SignalR looks like is going slow in my development machine.."
};
Use the Predict() function makes a prediction on a single row of data:
var prediction = _predEngine.Predict(issue);
Use the model: prediction results
Display GitHubIssue
and corresponding Area
label prediction in order to share the results and act on them accordingly. Create a display for the results using the following Console.WriteLine() code:
Console.WriteLine($"=============== Single Prediction just-trained-model - Result: {prediction.Area} ===============");
Return the model trained to use for evaluation
Return the model at the end of the BuildAndTrainModel
method.
return trainingPipeline;
Evaluate the model
Now that you've created and trained the model, you need to evaluate it with a different dataset for quality assurance and validation. In the Evaluate
method, the model created in BuildAndTrainModel
is passed in to be evaluated. Create the Evaluate
method, just after BuildAndTrainModel
, as in the following code:
void Evaluate(DataViewSchema trainingDataViewSchema)
{
}
The Evaluate
method executes the following tasks:
- Loads the test dataset.
- Creates the multiclass evaluator.
- Evaluates the model and create metrics.
- Displays the metrics.
Add a call to the new method, right under the BuildAndTrainModel
method call, using the following code:
Evaluate(_trainingDataView.Schema);
As you did previously with the training dataset, load the test dataset by adding the following code to the Evaluate
method:
var testDataView = _mlContext.Data.LoadFromTextFile<GitHubIssue>(_testDataPath,hasHeader: true);
The Evaluate() method computes the quality metrics for the model using the specified dataset. It returns a MulticlassClassificationMetrics object that contains the overall metrics computed by multiclass classification evaluators.
To display the metrics to determine the quality of the model, you need to get them first.
Notice the use of the Transform() method of the machine learning _trainedModel
global variable (an ITransformer) to input the features and return predictions. Add the following code to the Evaluate
method as the next line:
var testMetrics = _mlContext.MulticlassClassification.Evaluate(_trainedModel.Transform(testDataView));
The following metrics are evaluated for multiclass classification:
Micro Accuracy - Every sample-class pair contributes equally to the accuracy metric. You want Micro Accuracy to be as close to one as possible.
Macro Accuracy - Every class contributes equally to the accuracy metric. Minority classes are given equal weight as the larger classes. You want Macro Accuracy to be as close to one as possible.
Log-loss - see Log Loss. You want Log-loss to be as close to zero as possible.
Log-loss reduction - Ranges from [-inf, 1.00], where 1.00 is perfect predictions and 0 indicates mean predictions. You want Log-loss reduction to be as close to one as possible.
Display the metrics for model validation
Use the following code to display the metrics, share the results, and then act on them:
Console.WriteLine($"*************************************************************************************************************");
Console.WriteLine($"* Metrics for Multi-class Classification model - Test Data ");
Console.WriteLine($"*------------------------------------------------------------------------------------------------------------");
Console.WriteLine($"* MicroAccuracy: {testMetrics.MicroAccuracy:0.###}");
Console.WriteLine($"* MacroAccuracy: {testMetrics.MacroAccuracy:0.###}");
Console.WriteLine($"* LogLoss: {testMetrics.LogLoss:#.###}");
Console.WriteLine($"* LogLossReduction: {testMetrics.LogLossReduction:#.###}");
Console.WriteLine($"*************************************************************************************************************");
Save the model to a file
Once satisfied with your model, save it to a file to make predictions at a later time or in another application. Add the following code to the Evaluate
method.
SaveModelAsFile(_mlContext, trainingDataViewSchema, _trainedModel);
Create the SaveModelAsFile
method below your Evaluate
method.
void SaveModelAsFile(MLContext mlContext,DataViewSchema trainingDataViewSchema, ITransformer model)
{
}
Add the following code to your SaveModelAsFile
method. This code uses the Save
method to serialize and store the trained model as a zip file.
mlContext.Model.Save(model, trainingDataViewSchema, _modelPath);
Deploy and Predict with a model
Add a call to the new method, right under the Evaluate
method call, using the following code:
PredictIssue();
Create the PredictIssue
method, just after the Evaluate
method (and just before the SaveModelAsFile
method), using the following code:
void PredictIssue()
{
}
The PredictIssue
method executes the following tasks:
- Loads the saved model
- Creates a single issue of test data.
- Predicts area based on test data.
- Combines test data and predictions for reporting.
- Displays the predicted results.
Load the saved model into your application by adding the following code to the PredictIssue
method:
ITransformer loadedModel = _mlContext.Model.Load(_modelPath, out var modelInputSchema);
Add a GitHub issue to test the trained model's prediction in the Predict
method by creating an instance of GitHubIssue
:
GitHubIssue singleIssue = new GitHubIssue() { Title = "Entity Framework crashes", Description = "When connecting to the database, EF is crashing" };
As you did previously, create a PredictionEngine
instance with the following code:
_predEngine = _mlContext.Model.CreatePredictionEngine<GitHubIssue, IssuePrediction>(loadedModel);
The PredictionEngine is a convenience API, which allows you to perform a prediction on a single instance of data. PredictionEngine
is not thread-safe. It's acceptable to use in single-threaded or prototype environments. For improved performance and thread safety in production environments, use the PredictionEnginePool
service, which creates an ObjectPool
of PredictionEngine
objects for use throughout your application. See this guide on how to use PredictionEnginePool
in an ASP.NET Core Web API.
Note
PredictionEnginePool
service extension is currently in preview.
Use the PredictionEngine
to predict the Area GitHub label by adding the following code to the PredictIssue
method for the prediction:
var prediction = _predEngine.Predict(singleIssue);
Use the loaded model for prediction
Display Area
in order to categorize the issue and act on it accordingly. Create a display for the results using the following Console.WriteLine() code:
Console.WriteLine($"=============== Single Prediction - Result: {prediction.Area} ===============");
Results
Your results should be similar to the following. As the pipeline processes, it displays messages. You may see warnings, or processing messages. These messages have been removed from the following results for clarity.
=============== Single Prediction just-trained-model - Result: area-System.Net ===============
*************************************************************************************************************
* Metrics for Multi-class Classification model - Test Data
*------------------------------------------------------------------------------------------------------------
* MicroAccuracy: 0.738
* MacroAccuracy: 0.668
* LogLoss: .919
* LogLossReduction: .643
*************************************************************************************************************
=============== Single Prediction - Result: area-System.Data ===============
Congratulations! You've now successfully built a machine learning model for classifying and predicting an Area label for a GitHub issue. You can find the source code for this tutorial at the dotnet/samples repository.
Next steps
In this tutorial, you learned how to:
- Prepare your data
- Transform the data
- Train the model
- Evaluate the model
- Predict with the trained model
- Deploy and Predict with a loaded model
Advance to the next tutorial to learn more
Feedback
https://aka.ms/ContentUserFeedback.
Coming soon: Throughout 2024 we will be phasing out GitHub Issues as the feedback mechanism for content and replacing it with a new feedback system. For more information see:Submit and view feedback for