Imbalanced dataset binary classification The 2019 Stack Overflow Developer Survey Results Are InAre unbalanced datasets problematic, and (how) does oversampling (purport to) help?Imbalanced data classification using boosting algorithmsBinary classification in imbalanced dataClassification algorithms for handling Imbalanced data setsWhat is the effect of training a model on an imbalanced dataset & using it on a balanced dataset?imbalanced binary classification with skewed featuresCross validation and imbalanced learningimbalanced datasetcross validation gives wrong resultsData augmentation or weighted loss function for imbalanced classes?Handling imbalanced data for classification
Finding the area between two curves with Integrate
Did any laptop computers have a built-in 5 1/4 inch floppy drive?
What do these terms in Caesar's Gallic Wars mean?
Is it okay to consider publishing in my first year of PhD?
"as much details as you can remember"
Short story: man watches girlfriend's spaceship entering a 'black hole' (?) forever
Are there any other methods to apply to solving simultaneous equations?
Worn-tile Scrabble
Why not take a picture of a closer black hole?
Output the Arecibo Message
What could be the right powersource for 15 seconds lifespan disposable giant chainsaw?
Are spiders unable to hurt humans, especially very small spiders?
Correct punctuation for showing a character's confusion
How did passengers keep warm on sail ships?
If my opponent casts Ultimate Price on my Phantasmal Bear, can I save it by casting Snap or Curfew?
Is it ethical to upload a automatically generated paper to a non peer-reviewed site as part of a larger research?
Short story: child made less intelligent and less attractive
How do you keep chess fun when your opponent constantly beats you?
Can withdrawing asylum be illegal?
Can I have a signal generator on while it's not connected?
If I score a critical hit on an 18 or higher, what are my chances of getting a critical hit if I roll 3d20?
Does adding complexity mean a more secure cipher?
What information about me do stores get via my credit card?
What to do when moving next to a bird sanctuary with a loosely-domesticated cat?
Imbalanced dataset binary classification
The 2019 Stack Overflow Developer Survey Results Are InAre unbalanced datasets problematic, and (how) does oversampling (purport to) help?Imbalanced data classification using boosting algorithmsBinary classification in imbalanced dataClassification algorithms for handling Imbalanced data setsWhat is the effect of training a model on an imbalanced dataset & using it on a balanced dataset?imbalanced binary classification with skewed featuresCross validation and imbalanced learningimbalanced datasetcross validation gives wrong resultsData augmentation or weighted loss function for imbalanced classes?Handling imbalanced data for classification
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
$begingroup$
I am new in ML & DS and i have a dataset with an imbalance of 9:1 for Binary Classification,as an assignment. Could you please guide me in this regard? Also Which classifier is best for Imbalanced Binary Classification?
Regrds.
machine-learning classification binary-data unbalanced-classes
New contributor
$endgroup$
add a comment |
$begingroup$
I am new in ML & DS and i have a dataset with an imbalance of 9:1 for Binary Classification,as an assignment. Could you please guide me in this regard? Also Which classifier is best for Imbalanced Binary Classification?
Regrds.
machine-learning classification binary-data unbalanced-classes
New contributor
$endgroup$
$begingroup$
Related: Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?
$endgroup$
– Stephan Kolassa
Apr 8 at 19:10
add a comment |
$begingroup$
I am new in ML & DS and i have a dataset with an imbalance of 9:1 for Binary Classification,as an assignment. Could you please guide me in this regard? Also Which classifier is best for Imbalanced Binary Classification?
Regrds.
machine-learning classification binary-data unbalanced-classes
New contributor
$endgroup$
I am new in ML & DS and i have a dataset with an imbalance of 9:1 for Binary Classification,as an assignment. Could you please guide me in this regard? Also Which classifier is best for Imbalanced Binary Classification?
Regrds.
machine-learning classification binary-data unbalanced-classes
machine-learning classification binary-data unbalanced-classes
New contributor
New contributor
New contributor
asked Apr 8 at 10:31
Sid_MirzaSid_Mirza
112
112
New contributor
New contributor
$begingroup$
Related: Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?
$endgroup$
– Stephan Kolassa
Apr 8 at 19:10
add a comment |
$begingroup$
Related: Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?
$endgroup$
– Stephan Kolassa
Apr 8 at 19:10
$begingroup$
Related: Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?
$endgroup$
– Stephan Kolassa
Apr 8 at 19:10
$begingroup$
Related: Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?
$endgroup$
– Stephan Kolassa
Apr 8 at 19:10
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
You got off on the wrong foot by conceptualizing this as a classification problem. The fact that $Y$ is binary has nothing to do with trying to make classifications. And when the balance of $Y$ is far from 1:1 you need to think about modeling tendencies for $Y$, not modeling $Y$. In other words, the appropriate task is to estimate $P(Y=1 | X)$ using a model such as the binary logistic regression model. The logistic model is a direct probability estimator. Details may be found here and here.
Once you have a validated probability model and a utility/cost/loss function you can generate optimum decisions. The probabilities help to trade off the consequences of wrong decisions.
$endgroup$
$begingroup$
Thanks Sir Frank Harrell, The dataset is in floating point values but the target is in binary form as you said 'Y'. i applied Linear Regression, Random Forests,Decision Tree and some ensemble methods but the Linear regression gave an AUC score of 78.2% whereas random forests and LightGBM performed better. Now i want to increase the AUC score. Here is the list of parameters i used for lgb:
$endgroup$
– Sid_Mirza
Apr 8 at 17:18
$begingroup$
params = "objective" : "binary", "metric" : "auc", "boosting": 'gbdt', "max_depth" : -1, "num_leaves" : 13, "learning_rate" : 0.01, "bagging_freq": 5, "bagging_fraction" : 0.4, "feature_fraction" : 0.05, "min_data_in_leaf": 80, "min_sum_heassian_in_leaf": 10, "tree_learner": "serial", "boost_from_average": "false", "bagging_seed" : random_state, "verbosity" : 1, "seed": random_state
$endgroup$
– Sid_Mirza
Apr 8 at 17:21
$begingroup$
Is the binary target a derivation from a floating point continuous outcome variable? If so you will need to go back to that variable and not use an information-losing dichotomization.
$endgroup$
– Frank Harrell
Apr 9 at 4:11
$begingroup$
Yes, all the 100+ attributes have continuous values on the basis of which, we have to classify the target in binary form either yes or no.
$endgroup$
– Sid_Mirza
2 days ago
$begingroup$
I assume by that you mean that the target originated as binary in its rawest form. You are still trying to cast the problem inappropriately as classification. You cannot do anything but estimate tendencies, nor should you. Once you have probability estimates you can make optimum decisions given the loss function.
$endgroup$
– Frank Harrell
yesterday
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "65"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sid_Mirza is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f401800%2fimbalanced-dataset-binary-classification%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
You got off on the wrong foot by conceptualizing this as a classification problem. The fact that $Y$ is binary has nothing to do with trying to make classifications. And when the balance of $Y$ is far from 1:1 you need to think about modeling tendencies for $Y$, not modeling $Y$. In other words, the appropriate task is to estimate $P(Y=1 | X)$ using a model such as the binary logistic regression model. The logistic model is a direct probability estimator. Details may be found here and here.
Once you have a validated probability model and a utility/cost/loss function you can generate optimum decisions. The probabilities help to trade off the consequences of wrong decisions.
$endgroup$
$begingroup$
Thanks Sir Frank Harrell, The dataset is in floating point values but the target is in binary form as you said 'Y'. i applied Linear Regression, Random Forests,Decision Tree and some ensemble methods but the Linear regression gave an AUC score of 78.2% whereas random forests and LightGBM performed better. Now i want to increase the AUC score. Here is the list of parameters i used for lgb:
$endgroup$
– Sid_Mirza
Apr 8 at 17:18
$begingroup$
params = "objective" : "binary", "metric" : "auc", "boosting": 'gbdt', "max_depth" : -1, "num_leaves" : 13, "learning_rate" : 0.01, "bagging_freq": 5, "bagging_fraction" : 0.4, "feature_fraction" : 0.05, "min_data_in_leaf": 80, "min_sum_heassian_in_leaf": 10, "tree_learner": "serial", "boost_from_average": "false", "bagging_seed" : random_state, "verbosity" : 1, "seed": random_state
$endgroup$
– Sid_Mirza
Apr 8 at 17:21
$begingroup$
Is the binary target a derivation from a floating point continuous outcome variable? If so you will need to go back to that variable and not use an information-losing dichotomization.
$endgroup$
– Frank Harrell
Apr 9 at 4:11
$begingroup$
Yes, all the 100+ attributes have continuous values on the basis of which, we have to classify the target in binary form either yes or no.
$endgroup$
– Sid_Mirza
2 days ago
$begingroup$
I assume by that you mean that the target originated as binary in its rawest form. You are still trying to cast the problem inappropriately as classification. You cannot do anything but estimate tendencies, nor should you. Once you have probability estimates you can make optimum decisions given the loss function.
$endgroup$
– Frank Harrell
yesterday
add a comment |
$begingroup$
You got off on the wrong foot by conceptualizing this as a classification problem. The fact that $Y$ is binary has nothing to do with trying to make classifications. And when the balance of $Y$ is far from 1:1 you need to think about modeling tendencies for $Y$, not modeling $Y$. In other words, the appropriate task is to estimate $P(Y=1 | X)$ using a model such as the binary logistic regression model. The logistic model is a direct probability estimator. Details may be found here and here.
Once you have a validated probability model and a utility/cost/loss function you can generate optimum decisions. The probabilities help to trade off the consequences of wrong decisions.
$endgroup$
$begingroup$
Thanks Sir Frank Harrell, The dataset is in floating point values but the target is in binary form as you said 'Y'. i applied Linear Regression, Random Forests,Decision Tree and some ensemble methods but the Linear regression gave an AUC score of 78.2% whereas random forests and LightGBM performed better. Now i want to increase the AUC score. Here is the list of parameters i used for lgb:
$endgroup$
– Sid_Mirza
Apr 8 at 17:18
$begingroup$
params = "objective" : "binary", "metric" : "auc", "boosting": 'gbdt', "max_depth" : -1, "num_leaves" : 13, "learning_rate" : 0.01, "bagging_freq": 5, "bagging_fraction" : 0.4, "feature_fraction" : 0.05, "min_data_in_leaf": 80, "min_sum_heassian_in_leaf": 10, "tree_learner": "serial", "boost_from_average": "false", "bagging_seed" : random_state, "verbosity" : 1, "seed": random_state
$endgroup$
– Sid_Mirza
Apr 8 at 17:21
$begingroup$
Is the binary target a derivation from a floating point continuous outcome variable? If so you will need to go back to that variable and not use an information-losing dichotomization.
$endgroup$
– Frank Harrell
Apr 9 at 4:11
$begingroup$
Yes, all the 100+ attributes have continuous values on the basis of which, we have to classify the target in binary form either yes or no.
$endgroup$
– Sid_Mirza
2 days ago
$begingroup$
I assume by that you mean that the target originated as binary in its rawest form. You are still trying to cast the problem inappropriately as classification. You cannot do anything but estimate tendencies, nor should you. Once you have probability estimates you can make optimum decisions given the loss function.
$endgroup$
– Frank Harrell
yesterday
add a comment |
$begingroup$
You got off on the wrong foot by conceptualizing this as a classification problem. The fact that $Y$ is binary has nothing to do with trying to make classifications. And when the balance of $Y$ is far from 1:1 you need to think about modeling tendencies for $Y$, not modeling $Y$. In other words, the appropriate task is to estimate $P(Y=1 | X)$ using a model such as the binary logistic regression model. The logistic model is a direct probability estimator. Details may be found here and here.
Once you have a validated probability model and a utility/cost/loss function you can generate optimum decisions. The probabilities help to trade off the consequences of wrong decisions.
$endgroup$
You got off on the wrong foot by conceptualizing this as a classification problem. The fact that $Y$ is binary has nothing to do with trying to make classifications. And when the balance of $Y$ is far from 1:1 you need to think about modeling tendencies for $Y$, not modeling $Y$. In other words, the appropriate task is to estimate $P(Y=1 | X)$ using a model such as the binary logistic regression model. The logistic model is a direct probability estimator. Details may be found here and here.
Once you have a validated probability model and a utility/cost/loss function you can generate optimum decisions. The probabilities help to trade off the consequences of wrong decisions.
answered Apr 8 at 11:59
Frank HarrellFrank Harrell
55.9k3110245
55.9k3110245
$begingroup$
Thanks Sir Frank Harrell, The dataset is in floating point values but the target is in binary form as you said 'Y'. i applied Linear Regression, Random Forests,Decision Tree and some ensemble methods but the Linear regression gave an AUC score of 78.2% whereas random forests and LightGBM performed better. Now i want to increase the AUC score. Here is the list of parameters i used for lgb:
$endgroup$
– Sid_Mirza
Apr 8 at 17:18
$begingroup$
params = "objective" : "binary", "metric" : "auc", "boosting": 'gbdt', "max_depth" : -1, "num_leaves" : 13, "learning_rate" : 0.01, "bagging_freq": 5, "bagging_fraction" : 0.4, "feature_fraction" : 0.05, "min_data_in_leaf": 80, "min_sum_heassian_in_leaf": 10, "tree_learner": "serial", "boost_from_average": "false", "bagging_seed" : random_state, "verbosity" : 1, "seed": random_state
$endgroup$
– Sid_Mirza
Apr 8 at 17:21
$begingroup$
Is the binary target a derivation from a floating point continuous outcome variable? If so you will need to go back to that variable and not use an information-losing dichotomization.
$endgroup$
– Frank Harrell
Apr 9 at 4:11
$begingroup$
Yes, all the 100+ attributes have continuous values on the basis of which, we have to classify the target in binary form either yes or no.
$endgroup$
– Sid_Mirza
2 days ago
$begingroup$
I assume by that you mean that the target originated as binary in its rawest form. You are still trying to cast the problem inappropriately as classification. You cannot do anything but estimate tendencies, nor should you. Once you have probability estimates you can make optimum decisions given the loss function.
$endgroup$
– Frank Harrell
yesterday
add a comment |
$begingroup$
Thanks Sir Frank Harrell, The dataset is in floating point values but the target is in binary form as you said 'Y'. i applied Linear Regression, Random Forests,Decision Tree and some ensemble methods but the Linear regression gave an AUC score of 78.2% whereas random forests and LightGBM performed better. Now i want to increase the AUC score. Here is the list of parameters i used for lgb:
$endgroup$
– Sid_Mirza
Apr 8 at 17:18
$begingroup$
params = "objective" : "binary", "metric" : "auc", "boosting": 'gbdt', "max_depth" : -1, "num_leaves" : 13, "learning_rate" : 0.01, "bagging_freq": 5, "bagging_fraction" : 0.4, "feature_fraction" : 0.05, "min_data_in_leaf": 80, "min_sum_heassian_in_leaf": 10, "tree_learner": "serial", "boost_from_average": "false", "bagging_seed" : random_state, "verbosity" : 1, "seed": random_state
$endgroup$
– Sid_Mirza
Apr 8 at 17:21
$begingroup$
Is the binary target a derivation from a floating point continuous outcome variable? If so you will need to go back to that variable and not use an information-losing dichotomization.
$endgroup$
– Frank Harrell
Apr 9 at 4:11
$begingroup$
Yes, all the 100+ attributes have continuous values on the basis of which, we have to classify the target in binary form either yes or no.
$endgroup$
– Sid_Mirza
2 days ago
$begingroup$
I assume by that you mean that the target originated as binary in its rawest form. You are still trying to cast the problem inappropriately as classification. You cannot do anything but estimate tendencies, nor should you. Once you have probability estimates you can make optimum decisions given the loss function.
$endgroup$
– Frank Harrell
yesterday
$begingroup$
Thanks Sir Frank Harrell, The dataset is in floating point values but the target is in binary form as you said 'Y'. i applied Linear Regression, Random Forests,Decision Tree and some ensemble methods but the Linear regression gave an AUC score of 78.2% whereas random forests and LightGBM performed better. Now i want to increase the AUC score. Here is the list of parameters i used for lgb:
$endgroup$
– Sid_Mirza
Apr 8 at 17:18
$begingroup$
Thanks Sir Frank Harrell, The dataset is in floating point values but the target is in binary form as you said 'Y'. i applied Linear Regression, Random Forests,Decision Tree and some ensemble methods but the Linear regression gave an AUC score of 78.2% whereas random forests and LightGBM performed better. Now i want to increase the AUC score. Here is the list of parameters i used for lgb:
$endgroup$
– Sid_Mirza
Apr 8 at 17:18
$begingroup$
params = "objective" : "binary", "metric" : "auc", "boosting": 'gbdt', "max_depth" : -1, "num_leaves" : 13, "learning_rate" : 0.01, "bagging_freq": 5, "bagging_fraction" : 0.4, "feature_fraction" : 0.05, "min_data_in_leaf": 80, "min_sum_heassian_in_leaf": 10, "tree_learner": "serial", "boost_from_average": "false", "bagging_seed" : random_state, "verbosity" : 1, "seed": random_state
$endgroup$
– Sid_Mirza
Apr 8 at 17:21
$begingroup$
params = "objective" : "binary", "metric" : "auc", "boosting": 'gbdt', "max_depth" : -1, "num_leaves" : 13, "learning_rate" : 0.01, "bagging_freq": 5, "bagging_fraction" : 0.4, "feature_fraction" : 0.05, "min_data_in_leaf": 80, "min_sum_heassian_in_leaf": 10, "tree_learner": "serial", "boost_from_average": "false", "bagging_seed" : random_state, "verbosity" : 1, "seed": random_state
$endgroup$
– Sid_Mirza
Apr 8 at 17:21
$begingroup$
Is the binary target a derivation from a floating point continuous outcome variable? If so you will need to go back to that variable and not use an information-losing dichotomization.
$endgroup$
– Frank Harrell
Apr 9 at 4:11
$begingroup$
Is the binary target a derivation from a floating point continuous outcome variable? If so you will need to go back to that variable and not use an information-losing dichotomization.
$endgroup$
– Frank Harrell
Apr 9 at 4:11
$begingroup$
Yes, all the 100+ attributes have continuous values on the basis of which, we have to classify the target in binary form either yes or no.
$endgroup$
– Sid_Mirza
2 days ago
$begingroup$
Yes, all the 100+ attributes have continuous values on the basis of which, we have to classify the target in binary form either yes or no.
$endgroup$
– Sid_Mirza
2 days ago
$begingroup$
I assume by that you mean that the target originated as binary in its rawest form. You are still trying to cast the problem inappropriately as classification. You cannot do anything but estimate tendencies, nor should you. Once you have probability estimates you can make optimum decisions given the loss function.
$endgroup$
– Frank Harrell
yesterday
$begingroup$
I assume by that you mean that the target originated as binary in its rawest form. You are still trying to cast the problem inappropriately as classification. You cannot do anything but estimate tendencies, nor should you. Once you have probability estimates you can make optimum decisions given the loss function.
$endgroup$
– Frank Harrell
yesterday
add a comment |
Sid_Mirza is a new contributor. Be nice, and check out our Code of Conduct.
Sid_Mirza is a new contributor. Be nice, and check out our Code of Conduct.
Sid_Mirza is a new contributor. Be nice, and check out our Code of Conduct.
Sid_Mirza is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Cross Validated!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f401800%2fimbalanced-dataset-binary-classification%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
Related: Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?
$endgroup$
– Stephan Kolassa
Apr 8 at 19:10