{"id":417,"date":"2017-08-09T23:53:51","date_gmt":"2017-08-09T15:53:51","guid":{"rendered":"http:\/\/vinta.ws\/code\/?p=417"},"modified":"2026-02-18T01:20:35","modified_gmt":"2026-02-17T17:20:35","slug":"feature-engineering","status":"publish","type":"post","link":"https:\/\/vinta.ws\/code\/feature-engineering.html","title":{"rendered":"Feature Engineering \u7279\u5fb5\u5de5\u7a0b\u4e2d\u5e38\u898b\u7684\u65b9\u6cd5"},"content":{"rendered":"<p>Feature Engineering \u662f\u628a raw data \u8f49\u63db\u6210 features \u7684\u6574\u500b\u904e\u7a0b\u7684\u7e3d\u7a31\u3002\u57fa\u672c\u4e0a\u7279\u5fb5\u5de5\u7a0b\u5c31\u662f\u500b\u624b\u85dd\u6d3b\uff0c\u8b1b\u6c42\u7684\u662f\u5275\u9020\u529b\u3002<\/p>\n<p>\u672c\u6587\u4e0d\u5b9a\u671f\u66f4\u65b0\u4e2d\u3002<\/p>\n<h2>Missing Value Imputation<\/h2>\n<p>\u6700\u7c21\u55ae\u66b4\u529b\u7684\u505a\u6cd5\u7576\u7136\u5c31\u662f\u76f4\u63a5 drop \u6389\u90a3\u4e9b\u542b\u6709\u7f3a\u5931\u503c\u7684 rows\u3002<\/p>\n<p>\u91dd\u5c0d numerical \u7279\u5fb5\u7684\u7f3a\u5931\u503c\uff0c\u53ef\u4ee5\u7528\u4ee5\u4e0b\u65b9\u5f0f\u53d6\u4ee3\uff1a<\/p>\n<ul>\n<li><code>0<\/code>\uff0c\u7f3a\u9ede\u662f\u53ef\u80fd\u6703\u6df7\u6dc6\u5176\u4ed6\u672c\u4f86\u5c31\u662f 0 \u7684\u6578\u503c<\/li>\n<li><code>-999<\/code>\uff0c\u7528\u67d0\u500b\u6b63\u5e38\u60c5\u6cc1\u4e0b\u4e0d\u6703\u51fa\u73fe\u7684\u6578\u503c\u4ee3\u66ff\uff0c\u4f46\u662f\u9078\u5f97\u4e0d\u597d\u53ef\u80fd\u6703\u8b8a\u6210\u7570\u5e38\u503c\uff0c\u8981\u7279\u5225\u5c0d\u5f85<\/li>\n<li>Mean\uff0c\u5e73\u5747\u6578<\/li>\n<li>Median\uff0c\u4e2d\u4f4d\u6578\uff0c\u8ddf\u5e73\u5747\u6578\u76f8\u6bd4\uff0c\u4e0d\u6703\u88ab\u7570\u5e38\u503c\u5e72\u64fe<\/li>\n<\/ul>\n<p>\u91dd\u5c0d categorical \u7279\u5fb5\u7684\u7f3a\u5931\u503c\uff0c\u53ef\u4ee5\u7528\u4ee5\u4e0b\u65b9\u5f0f\u53d6\u4ee3\uff1a<\/p>\n<ul>\n<li>Mode\uff0c\u773e\u6578\uff0c\u6700\u5e38\u898b\u7684\u503c<\/li>\n<li>\u6539\u6210 &quot;Others&quot; \u4e4b\u985e\u7684\u503c<\/li>\n<\/ul>\n<p>\u5047\u8a2d\u4f60\u8981\u586b\u88dc <code>age<\/code> \u9019\u500b\u7279\u5fb5\uff0c\u7136\u5f8c\u4f60\u6709\u5176\u4ed6\u4f8b\u5982 <code>gender<\/code> \u9019\u6a23\u7684\u7279\u5fb5\uff0c\u4f60\u53ef\u4ee5\u5206\u5225\u8a08\u7b97\u7537\u6027\u548c\u5973\u6027\u7684 <code>age<\/code> \u7684 mean\u3001median \u548c mode \u4f86\u586b\u88dc\u7f3a\u5931\u503c\uff1b\u66f4\u8907\u96dc\u4e00\u9ede\u7684\u65b9\u5f0f\u662f\uff0c\u4f60\u53ef\u4ee5\u628a\u6c92\u6709\u7f3a\u5931\u503c\u7684\u6578\u64da\u6311\u51fa\u4f86\uff0c\u7528\u5b83\u5011\u4f86\u8a13\u7df4\u4e00\u500b regression \u6216 classification \u6a21\u578b\uff0c\u7528\u9019\u500b\u6a21\u578b\u4f86\u9810\u6e2c\u7f3a\u5931\u503c\u3002<\/p>\n<p>\u4e0d\u904e\u5176\u5be6\u6709\u4e9b\u6f14\u7b97\u6cd5\u662f\u53ef\u4ee5\u5bb9\u8a31\u7f3a\u5931\u503c\u7684\uff0c\u9019\u6642\u5019\u53ef\u4ee5\u65b0\u589e\u4e00\u500b <code>has_missing_value<\/code> \u6b04\u4f4d\uff08\u7a31\u70ba NA indicator column\uff09\u3002<\/p>\n<p>ref:<br \/>\n<a href=\"http:\/\/adataanalyst.com\/machine-learning\/comprehensive-guide-feature-engineering\/\">http:\/\/adataanalyst.com\/machine-learning\/comprehensive-guide-feature-engineering\/<\/a><br \/>\n<a href=\"https:\/\/stats.stackexchange.com\/questions\/28860\/why-adding-an-na-indicator-column-instead-of-value-imputation-for-randomforest\">https:\/\/stats.stackexchange.com\/questions\/28860\/why-adding-an-na-indicator-column-instead-of-value-imputation-for-randomforest<\/a><\/p>\n<h2>Outliers Detection<\/h2>\n<p>\u767c\u73fe\u96e2\u7fa4\u503c\u6700\u76f4\u89c0\u7684\u65b9\u5f0f\u5c31\u662f\u756b\u5716\u8868\uff0c\u91dd\u5c0d\u55ae\u4e00\u7279\u5fb5\u53ef\u4ee5\u4f7f\u7528 box plot\uff1b\u5169\u5169\u7279\u5fb5\u5247\u53ef\u4ee5\u4f7f\u7528 scatter plot\u3002<\/p>\n<p>\u8655\u7f6e\u96e2\u7fa4\u503c\u7684\u65b9\u5f0f\u901a\u5e38\u662f\u76f4\u63a5\u522a\u9664\u6216\u662f\u505a\u8b8a\u63db\uff08\u4f8b\u5982 log transformation \u6216 binning\uff09\uff0c\u7576\u7136\u4f60\u4e5f\u53ef\u4ee5\u5957\u7528\u8655\u7406\u7f3a\u5931\u503c\u7684\u65b9\u5f0f\u3002<\/p>\n<p>ref:<br \/>\n<a href=\"https:\/\/www.analyticsvidhya.com\/blog\/2016\/01\/guide-data-exploration\/\">https:\/\/www.analyticsvidhya.com\/blog\/2016\/01\/guide-data-exploration\/<\/a><br \/>\n<a href=\"https:\/\/www.douban.com\/note\/413022836\/\">https:\/\/www.douban.com\/note\/413022836\/<\/a><\/p>\n<h2>Duplicate Entries Removal<\/h2>\n<p>duplicate \u6216 redundant \u5c24\u5176\u6307\u7684\u662f\u90a3\u4e9b features \u90fd\u4e00\u6a23\uff0c\u4f46\u662f target variable \u537b\u4e0d\u540c\u7684\u6578\u64da\u3002<\/p>\n<h2>Feature Scaling \u7279\u5fb5\u7e2e\u653e<\/h2>\n<h3>Standardization \u6a19\u6e96\u5316<\/h3>\n<p>\u539f\u59cb\u8cc7\u6599\u4e2d\uff0c\u56e0\u70ba\u5404\u500b\u7279\u5fb5\u7684\u542b\u7fa9\u548c\u55ae\u4f4d\u4e0d\u540c\uff0c\u6bcf\u500b\u7279\u5fb5\u7684\u53d6\u503c\u7bc4\u570d\u53ef\u80fd\u6703\u5dee\u7570\u5f88\u5927\u3002\u4f8b\u5982\u67d0\u500b\u4e8c\u5143\u7279\u5fb5\u7684\u7bc4\u570d\u662f 0 \u6216 1\uff0c\u53e6\u4e00\u500b\u50f9\u683c\u7279\u5fb5\u7684\u7bc4\u570d\u53ef\u80fd\u662f [0, 1000000]\uff0c\u7531\u65bc\u53d6\u503c\u7bc4\u570d\u76f8\u5dee\u904e\u5927\u5c0e\u81f4\u4e86\u6a21\u578b\u53ef\u80fd\u6703\u66f4\u504f\u5411\u65bc\u53d6\u503c\u7bc4\u570d\u8f03\u5927\u7684\u90a3\u500b\u7279\u5fb5\u3002\u89e3\u6c7a\u7684\u8fa6\u6cd5\u5c31\u662f\u628a\u5404\u7a2e\u4e0d\u540c scale \u7684\u7279\u5fb5\u8f49\u63db\u6210\u540c\u6a23\u7684 scale\uff0c\u7a31\u70ba\u6a19\u6e96\u5316\u6216\u6b63\u898f\u5316\u3002<\/p>\n<p>\u72f9\u7fa9\u4f86\u8aaa\uff0c\u6a19\u6e96\u5316\u5c08\u9580\u6307\u7684\u662f\u900f\u904e\u8a08\u7b97 z-score\uff0c\u8b93\u6578\u64da\u7684 mean \u70ba 0\u3001 variance \u70ba 1\u3002<\/p>\n<p>ref:<br \/>\n<a href=\"https:\/\/spark.apache.org\/docs\/latest\/ml-features.html#standardscaler\">https:\/\/spark.apache.org\/docs\/latest\/ml-features.html#standardscaler<\/a><br \/>\n<a href=\"http:\/\/scikit-learn.org\/stable\/modules\/preprocessing.html#standardization-or-mean-removal-and-variance-scaling\">http:\/\/scikit-learn.org\/stable\/modules\/preprocessing.html#standardization-or-mean-removal-and-variance-scaling<\/a><br \/>\n<a href=\"https:\/\/www.quora.com\/How-bad-is-it-to-standardize-dummy-variables\">https:\/\/www.quora.com\/How-bad-is-it-to-standardize-dummy-variables<\/a><\/p>\n<h3>Normalization \u6b78\u4e00\u5316\u3001\u6b63\u898f\u5316<\/h3>\n<p>\u6b78\u4e00\u5316\u662f\u6307\u628a\u6bcf\u500b\u6a23\u672c\u7e2e\u653e\u5230\u55ae\u4f4d\u7bc4\u6578\uff08\u6bcf\u500b\u6a23\u672c\u7684\u7bc4\u6578\u70ba 1\uff09\uff0c\u9069\u7528\u65bc\u8a08\u7b97 dot product \u6216\u8005\u5169\u500b\u6a23\u672c\u4e4b\u9593\u7684\u76f8\u4f3c\u6027\u3002\u9664\u4e86\u6a19\u6e96\u5316\u3001\u6b78\u4e00\u5316\u4e4b\u5916\uff0c\u5176\u4ed6\u9084\u6709\u900f\u904e\u6700\u5927\u3001\u6700\u5c0f\u503c\uff0c\u628a\u6578\u64da\u7684\u7bc4\u570d\u7e2e\u653e\u5230 [0, 1] \u6216 [-1, 1] \u7684\u5340\u9593\u7e2e\u653e\u6cd5\uff0c\u4e0d\u904e\u9019\u500b\u65b9\u6cd5\u5bb9\u6613\u53d7\u7570\u5e38\u503c\u7684\u5f71\u97ff\u3002<\/p>\n<p>\u6a19\u6e96\u5316\u662f\u5206\u5225\u5c0d\u55ae\u4e00\u7279\u5fb5\u9032\u884c\uff08\u91dd\u5c0d column\uff09\uff1b\u6b78\u4e00\u5316\u662f\u5c0d\u6bcf\u500b observation \u9032\u884c\uff08\u91dd\u5c0d row\uff09\u3002<\/p>\n<p>\u5c0d SVM\u3001logistic regression \u6216\u5176\u4ed6\u4f7f\u7528 squared loss function \u7684\u6f14\u7b97\u6cd5\u4f86\u8aaa\uff0c\u9700\u8981 standardization\uff1b\u5c0d Vector Space Model \u4f86\u8aaa\uff0c\u9700\u8981 normalization\uff1b\u81f3\u65bc tree-based \u7684\u6f14\u7b97\u6cd5\uff0c\u57fa\u672c\u4e0a\u90fd\u4e0d\u9700\u8981\u6a19\u6e96\u5316\u6216\u6b78\u4e00\u5316\uff0c\u5b83\u5011\u5c0d scale \u4e0d\u654f\u611f\u3002<\/p>\n<p>ref:<br \/>\n<a href=\"https:\/\/spark.apache.org\/docs\/latest\/ml-features.html#normalizer\">https:\/\/spark.apache.org\/docs\/latest\/ml-features.html#normalizer<\/a><br \/>\n<a href=\"http:\/\/scikit-learn.org\/stable\/modules\/preprocessing.html#normalization\">http:\/\/scikit-learn.org\/stable\/modules\/preprocessing.html#normalization<\/a><br \/>\n<a href=\"https:\/\/www.qcloud.com\/community\/article\/689521\">https:\/\/www.qcloud.com\/community\/article\/689521<\/a><\/p>\n<h2>Feature Transformation \u7279\u5fb5\u8b8a\u63db<\/h2>\n<p>\u4ee5\u4e0b\u9069\u7528 continuous \u7279\u5fb5\uff1a<\/p>\n<h3>Rounding<\/h3>\n<p>\u67d0\u4e9b\u7cbe\u5ea6\u6709\u5230\u5c0f\u6578\u9ede\u5f8c\u7b2c n \u4f4d\u7684\u7279\u5fb5\uff0c\u5982\u679c\u4f60\u5176\u5be6\u4e0d\u9700\u8981\u90a3\u9ebc\u7cbe\u78ba\uff0c\u53ef\u4ee5\u8003\u616e <code>round(value * m)<\/code> \u6216 <code>round(log(value))<\/code> \u9019\u6a23\u7684\u505a\u6cd5\uff0c\u751a\u81f3\u53ef\u4ee5\u628a round \u4e4b\u5f8c\u7684\u6578\u503c\u7576\u6210 categorical \u7279\u5fb5\u3002<\/p>\n<pre class=\"line-numbers\"><code class=\"language-txt\">confidence  round(confidence * 10)\n0.9594      10\n0.1254      1\n0.1854      2\n0.5454      5\n0.3655      4<\/code><\/pre>\n<h3>Log Transformation<\/h3>\n<p>\u56e0\u70ba x \u8d8a\u5927\uff0clog(x) \u589e\u9577\u7684\u901f\u5ea6\u5c31\u8d8a\u6162\uff0c\u6240\u4ee5\u53d6 log \u7684\u610f\u7fa9\u662f\u53ef\u4ee5 compress \u5927\u6578\u548c expand \u5c0f\u6578\uff0c\u63db\u53e5\u8a71\u8aaa\u5c31\u662f\u58d3\u7e2e &quot;long tail&quot; \u548c\u5c55\u958b &quot;head&quot;\u3002\u5047\u8a2d x \u539f\u672c\u7684\u7bc4\u570d\u662f [100, 1000]\uff0c<code>log(x, 10)<\/code> \u4e4b\u5f8c\u7684\u7bc4\u570d\u5c31\u8b8a\u6210 [2, 3] \u4e86\u3002\u4e5f\u5e38\u5e38\u4f7f\u7528 <code>log(1 + x)<\/code> \u6216 <code>log(x \/ (1 - x))<\/code>\u3002<\/p>\n<p>\u53e6\u5916\u4e00\u7a2e\u985e\u4f3c\u7684\u505a\u6cd5\u662f square root \u5e73\u65b9\u6839\u6216 cube root \u7acb\u65b9\u6839\uff08\u53ef\u4ee5\u7528\u5728\u8ca0\u6578\uff09\u3002<\/p>\n<p>ref:<br \/>\n<a href=\"https:\/\/www.safaribooksonline.com\/library\/view\/mastering-feature-engineering\/9781491953235\/ch02.html\">https:\/\/www.safaribooksonline.com\/library\/view\/mastering-feature-engineering\/9781491953235\/ch02.html<\/a><\/p>\n<h3>Binarization \u4e8c\u503c\u5316<\/h3>\n<p>\u5c0d\u6578\u503c\u578b\u7684\u6578\u64da\u8a2d\u5b9a\u4e00\u500b threshold\uff0c\u5927\u65bc\u5c31\u8ce6\u503c\u70ba 1\u3001\u5c0f\u65bc\u5c31\u8ce6\u503c\u70ba 0\u3002\u4f8b\u5982 <code>score<\/code>\uff0c\u5982\u679c\u4f60\u53ea\u95dc\u5fc3\u300c\u53ca\u683c\u300d\u6216\u300c\u4e0d\u53ca\u683c\u300d\uff0c\u53ef\u4ee5\u76f4\u63a5\u628a\u6210\u7e3e\u5c0d\u61c9\u5230 1\uff08<code>score &gt;= 60<\/code>\uff09\u548c 0\uff08<code>score &lt; 60<\/code>\uff09\u3002\u6216\u662f\u4f60\u8981\u505a\u5564\u9152\u92b7\u91cf\u5206\u6790\uff0c\u4f60\u53ef\u4ee5\u65b0\u589e\u4e00\u500b <code>age &gt;= 18<\/code> \u7684\u7279\u5fb5\u4f86\u6a19\u793a\u51fa\u5df2\u6210\u5e74\u3002<\/p>\n<p>\u4f60\u6709\u4e00\u500b <code>color<\/code> \u7684 categorical \u7279\u5fb5\uff0c\u5982\u679c\u4f60\u4e0d\u5728\u4e4e\u5be6\u969b\u4e0a\u662f\u4ec0\u9ebc\u984f\u8272\u7684\u8a71\uff0c\u5176\u5be6\u4e5f\u53ef\u4ee5\u6539\u6210 <code>has_color<\/code>\u3002<\/p>\n<p>ref:<br \/>\n<a href=\"https:\/\/spark.apache.org\/docs\/latest\/ml-features.html#binarizer\">https:\/\/spark.apache.org\/docs\/latest\/ml-features.html#binarizer<\/a><\/p>\n<h3>Binning<\/h3>\n<p>\u4e5f\u7a31\u70ba bucketization\u3002<\/p>\n<p>\u4ee5 <code>age<\/code> \u9019\u6a23\u7684\u7279\u5fb5\u70ba\u4f8b\uff0c\u4f60\u53ef\u4ee5\u628a\u6240\u6709\u5e74\u9f61\u62c6\u5206\u6210 n \u6bb5\uff0c0-20 \u6b72\u300120-40 \u6b72\u300140-60 \u6b72\u7b49\u6216\u662f 0-18 \u6b72\u300118-40 \u6b72\u300140-70 \u6b72\u7b49\uff08\u7b49\u8ddd\u6216\u7b49\u91cf\uff09\uff0c\u7136\u5f8c\u628a\u500b\u5225\u7684\u5e74\u9f61\u5c0d\u61c9\u5230\u67d0\u4e00\u6bb5\uff0c\u5047\u8a2d 26 \u6b72\u662f\u5c0d\u61c9\u5230\u7b2c\u4e8c\u500b bucket\uff0c\u90a3\u65b0\u7279\u5fb5\u7684\u503c\u5c31\u662f 2\u3002\u9019\u7a2e\u65b9\u5f0f\u662f\u4eba\u70ba\u5730\u6307\u5b9a\u6bcf\u500b bucket \u7684\u908a\u754c\u503c\uff0c\u9084\u6709\u53e6\u5916\u4e00\u7a2e\u62c6\u5206\u6cd5\u662f\u6839\u64da\u6578\u64da\u7684\u5206\u4f48\u4f86\u62c6\uff0c\u7a31\u70ba quantization \u6216 quantile binning\uff0c\u4f60\u53ea\u9700\u8981\u6307\u5b9a bucket \u7684\u6578\u91cf\u5373\u53ef\u3002<\/p>\n<p>\u540c\u6a23\u7684\u6982\u5ff5\u61c9\u7528\u5230\u5176\u4ed6\u5730\u65b9\uff0c\u53ef\u4ee5\u628a datetime \u7279\u5fb5\u62c6\u5206\u6210\u4e0a\u5348\u3001\u4e2d\u5348\u3001\u4e0b\u5348\u548c\u665a\u4e0a\uff1b\u5982\u679c\u662f categorical \u7279\u5fb5\uff0c\u5247\u53ef\u4ee5\u5148 <code>SELECT count() ... GROUP BY<\/code>\uff0c\u7136\u5f8c\u628a\u51fa\u73fe\u6b21\u6578\u5c0f\u65bc\u67d0\u500b threshold \u7684\u503c\u6539\u6210 &quot;Other&quot; \u4e4b\u985e\u7684\u3002\u6216\u8005\u662f\u4f60\u6709\u4e00\u500b <code>occupation<\/code> \u7279\u5fb5\uff0c\u5982\u679c\u4f60\u5176\u5be6\u4e0d\u9700\u8981\u975e\u5e38\u6e96\u78ba\u7684\u8077\u696d\u8cc7\u8a0a\u7684\u8a71\uff0c\u53ef\u4ee5\u628a &quot;Web Developer&quot;\u3001&quot;iOS Developer&quot; \u6216 &quot;DBA&quot; \u9019\u4e9b\u500b\u5225\u7684\u8cc7\u6599\u90fd\u6539\u6210 &quot;Software Engineer&quot;\u3002<\/p>\n<p>binarization \u548c binning \u90fd\u662f\u5c0d continuous \u7279\u5fb5\u505a discretization \u96e2\u6563\u5316\uff0c\u589e\u5f37\u6a21\u578b\u7684\u975e\u7dda\u6027\u6cdb\u5316\u80fd\u529b\u3002<\/p>\n<p>ref:<br \/>\n<a href=\"https:\/\/spark.apache.org\/docs\/latest\/ml-features.html#bucketizer\">https:\/\/spark.apache.org\/docs\/latest\/ml-features.html#bucketizer<\/a><br \/>\n<a href=\"https:\/\/spark.apache.org\/docs\/latest\/ml-features.html#quantilediscretizer\">https:\/\/spark.apache.org\/docs\/latest\/ml-features.html#quantilediscretizer<\/a><br \/>\n<a href=\"https:\/\/github.com\/collectivemedia\/spark-ext#optimal-binning\">https:\/\/github.com\/collectivemedia\/spark-ext#optimal-binning<\/a><br \/>\n<a href=\"https:\/\/www.qcloud.com\/community\/article\/689521\">https:\/\/www.qcloud.com\/community\/article\/689521<\/a><\/p>\n<p>\u4ee5\u4e0b\u9069\u7528 categorical \u7279\u5fb5\uff1a<\/p>\n<p>ref:<br \/>\n<a href=\"https:\/\/en.wikipedia.org\/wiki\/Categorical_variable\">https:\/\/en.wikipedia.org\/wiki\/Categorical_variable<\/a><br \/>\n<a href=\"https:\/\/www.safaribooksonline.com\/library\/view\/introduction-to-machine\/9781449369880\/ch04.html\">https:\/\/www.safaribooksonline.com\/library\/view\/introduction-to-machine\/9781449369880\/ch04.html<\/a><\/p>\n<h3>Integer Encoding<\/h3>\n<p>\u4e5f\u7a31\u70ba label encoding\u3002<\/p>\n<p>\u628a\u6bcf\u500b category \u5c0d\u61c9\u5230\u6578\u5b57\uff0c\u4e00\u7a2e\u505a\u6cd5\u662f\u96a8\u6a5f\u5c0d\u61c9\u5230 0, 1, 2, 3, 4 \u7b49\u6578\u5b57\uff1b\u53e6\u5916\u4e00\u7a2e\u505a\u6cd5\u662f\u4f9d\u7167\u8a72\u503c\u51fa\u73fe\u7684\u983b\u7387\u5927\u5c0f\u7684\u9806\u5e8f\u4f86\u7d66\u503c\uff0c\u4f8b\u5982\u6700\u5e38\u51fa\u73fe\u7684\u503c\u7d66 0\uff0c\u4f9d\u5e8f\u7d66 1, 2, 3 \u7b49\u7b49\u3002\u5982\u679c\u662f\u91dd\u5c0d\u4e00\u4e9b\u5728\u67d0\u7a2e\u7a0b\u5ea6\u4e0a\u6709\u6b21\u5e8f\u7684 categorical \u7279\u5fb5\uff08\u7a31\u70ba ordinal\uff09\uff0c\u4f8b\u5982\u300c\u947d\u77f3\u6703\u54e1\u300d\u300c\u767d\u91d1\u6703\u54e1\u300d\u300c\u9ec3\u91d1\u6703\u54e1\u300d\u300c\u666e\u901a\u6703\u54e1\u300d\uff0c\u76f4\u63a5 mapping \u6210\u6578\u5b57\u53ef\u80fd\u6c92\u4ec0\u9ebc\u554f\u984c\uff0c\u4f46\u662f\u5982\u679c\u662f\u985e\u4f3c <code>color<\/code> \u6216 <code>city<\/code> \u9019\u6a23\u7684\u6c92\u6709\u660e\u986f\u5927\u5c0f\u7684\u7279\u5fb5\u7684\u8a71\uff0c\u9084\u662f\u7528 one-hot encoding \u6bd4\u8f03\u5408\u9069\u3002\u4e0d\u904e\u5982\u679c\u7528\u7684\u662f tree-based \u7684\u6f14\u7b97\u6cd5\u5c31\u7121\u6240\u8b02\u4e86\u3002<\/p>\n<p>\u6709\u4e9b categorical \u7279\u5fb5\u4e5f\u53ef\u80fd\u6703\u7528\u6578\u5b57\u8868\u793a\uff08\u4f8b\u5982 id\uff09\uff0c\u8ddf continuous \u7279\u5fb5\u7684\u5dee\u5225\u662f\uff0c\u6578\u503c\u7684\u5dee\u7570\u6216\u5927\u5c0f\u5c0d categorical \u7279\u5fb5\u4f86\u8aaa\u6c92\u6709\u592a\u5927\u7684\u610f\u7fa9\u3002<\/p>\n<p>ref:<br \/>\n<a href=\"http:\/\/breezedeus.github.io\/2014\/11\/15\/breezedeus-feature-processing.html\">http:\/\/breezedeus.github.io\/2014\/11\/15\/breezedeus-feature-processing.html<\/a><br \/>\n<a href=\"http:\/\/phunters.lofter.com\/post\/86d56_194e956\">http:\/\/phunters.lofter.com\/post\/86d56_194e956<\/a><\/p>\n<h3>One-hot Encoding (OHE)<\/h3>\n<p>\u5982\u679c\u67d0\u500b\u7279\u5fb5\u6709 m \u7a2e\u503c\uff08\u4f8b\u5982 Taipei, Beijing, Tokyo\uff09\uff0c\u90a3\u5b83 one-hot encode \u4e4b\u5f8c\u5c31\u6703\u8b8a\u6210\u9577\u5ea6\u70ba m \u7684\u5411\u91cf\uff1a<\/p>\n<pre class=\"line-numbers\"><code class=\"language-txt\">city    city_Taipei city_Beijing city_tokyo\nTaipei  1           0            0\nBeijing 0           1            0\nTokyo   0           0            1<\/code><\/pre>\n<p>\u4f60\u4e5f\u53ef\u4ee5\u6539\u7528 Dummy coding\uff0c\u9019\u6a23\u5c31\u53ea\u9700\u8981\u7522\u751f\u9577\u5ea6\u70ba m -1 \u7684\u5411\u91cf\uff1a<\/p>\n<pre class=\"line-numbers\"><code class=\"language-txt\">city    city_Taipei city_Beijing\nTaipei  1           0\nBeijing 0           1\nTokyo   0           0<\/code><\/pre>\n<p>OHE \u7684\u7f3a\u9ede\u662f\u5bb9\u6613\u9020\u6210\u7279\u5fb5\u7684\u7dad\u5ea6\u5927\u5e45\u589e\u52a0\u548c\u6c92\u8fa6\u6cd5\u8655\u7406\u4e4b\u524d\u6c92\u898b\u904e\u7684\u503c\u3002<\/p>\n<p>ref:<br \/>\n<a href=\"http:\/\/scikit-learn.org\/stable\/modules\/preprocessing.html#preprocessing-categorical-features\">http:\/\/scikit-learn.org\/stable\/modules\/preprocessing.html#preprocessing-categorical-features<\/a><br \/>\n<a href=\"https:\/\/blog.myyellowroad.com\/using-categorical-data-in-machine-learning-with-python-from-dummy-variables-to-deep-category-66041f734512\">https:\/\/blog.myyellowroad.com\/using-categorical-data-in-machine-learning-with-python-from-dummy-variables-to-deep-category-66041f734512<\/a><\/p>\n<h3>Bin-counting<\/h3>\n<p>\u4f8b\u5982\u5728 Computational Advertising \u4e2d\uff0c\u5982\u679c\u4f60\u6709\u91dd\u5c0d\u6bcf\u500b user \u7684\u300c\u5ee3\u544a\u66dd\u5149\u6578\uff08\u5305\u542b\u9ede\u64ca\u548c\u672a\u9ede\u64ca\uff09\u300d\u548c\u300c\u5ee3\u544a\u9ede\u64ca\u6578\u300d\uff0c\u4f60\u5c31\u53ef\u4ee5\u7b97\u51fa\u6bcf\u500b user \u7684\u300c\u9ede\u64ca\u7387\u300d\uff0c\u7136\u5f8c\u7528\u9019\u500b\u6a5f\u7387\u4f86\u8868\u793a\u6bcf\u500b user\uff0c\u53cd\u4e4b\u4e5f\u53ef\u4ee5\u5c0d ad id \u4f7f\u7528\u985e\u4f3c\u7684\u505a\u6cd5\u3002<\/p>\n<pre class=\"line-numbers\"><code class=\"language-txt\">ad_id   ad_views  ad_clicks  ad_ctr\n412533  18339     1355       0.074\n423334  335       12         0.036\n345664  1244      132        0.106\n349833  35387     1244       0.035<\/code><\/pre>\n<p>ref:<br \/>\n<a href=\"https:\/\/blogs.technet.microsoft.com\/machinelearning\/2015\/02\/17\/big-learning-made-easy-with-counts\/\">https:\/\/blogs.technet.microsoft.com\/machinelearning\/2015\/02\/17\/big-learning-made-easy-with-counts\/<\/a><\/p>\n<p>\u63db\u500b\u601d\u8def\uff0c\u5982\u679c\u4f60\u6709\u4e00\u500b <code>brand<\/code> \u7684\u7279\u5fb5\uff0c\u7136\u5f8c\u4f60\u53ef\u4ee5\u5f9e user \u7684\u8cfc\u8cb7\u8a18\u9304\u4e2d\u627e\u51fa\u8cfc\u8cb7 A \u54c1\u724c\u7684\u4eba\uff0c\u6709 70% \u7684\u4eba\u6703\u8cfc\u8cb7 B \u54c1\u724c\u3001\u6709 40% \u7684\u4eba\u6703\u8cfc\u8cb7 C \u54c1\u724c\uff1b\u8cfc\u8cb7 D \u54c1\u724c\u7684\u4eba\uff0c\u6709 10% \u7684\u4eba\u6703\u8cfc\u8cb7 A \u54c1\u724c\u548c E \u54c1\u724c\uff0c\u4f60\u53ef\u4ee5\u6bcf\u500b\u54c1\u724c\u8868\u793a\u6210\u9019\u6a23\uff1a<\/p>\n<pre class=\"line-numbers\"><code class=\"language-txt\">brand  A    B    C    D    E\nA      1.0  0.7  0.4  0.0  0.0\nB      ...\nC      ...\nD      0.1  0.0  0.0  1.0  0.1\nE      ...<\/code><\/pre>\n<p>ref:<br \/>\n<a href=\"http:\/\/phunters.lofter.com\/post\/86d56_194e956\">http:\/\/phunters.lofter.com\/post\/86d56_194e956<\/a><\/p>\n<h3>LabelCount Encoding<\/h3>\n<p>\u985e\u4f3c Bin-cunting \u7684\u505a\u6cd5\uff0c\u4e00\u6a23\u662f\u5229\u7528\u73fe\u6709\u7684 count \u6216\u5176\u4ed6\u7d71\u8a08\u4e0a\u7684\u8cc7\u6599\uff0c\u5dee\u5225\u5728\u65bc LabelCount Encoding \u6700\u5f8c\u7528\u7684\u662f\u6b21\u5e8f\u800c\u4e0d\u662f\u6578\u503c\u672c\u8eab\u3002\u512a\u9ede\u662f\u5c0d\u7570\u5e38\u503c\u4e0d\u654f\u611f\u3002<\/p>\n<pre class=\"line-numbers\"><code class=\"language-txt\">ad_id   ad_clicks  ad_rank\n412533  1355       1\n423334  12         4\n345664  132        3\n349833  1244       2<\/code><\/pre>\n<p>ref:<br \/>\n<a href=\"https:\/\/www.slideshare.net\/gabrielspmoreira\/feature-engineering-getting-most-out-of-data-for-predictive-models-tdc-2017\/47\">https:\/\/www.slideshare.net\/gabrielspmoreira\/feature-engineering-getting-most-out-of-data-for-predictive-models-tdc-2017\/47<\/a><\/p>\n<h3>Count Vectorization<\/h3>\n<p>\u9664\u4e86\u53ef\u4ee5\u7528\u5728 text \u7279\u5fb5\u4e4b\u5916\uff0c\u5982\u679c\u4f60\u6709 comma-seperated \u7684 categorical \u7279\u5fb5\u4e5f\u53ef\u4ee5\u4f7f\u7528\u9019\u500b\u65b9\u6cd5\u3002\u4f8b\u5982\u96fb\u5f71\u985e\u578b <code>genre<\/code>\uff0c\u88e1\u982d\u7684\u503c\u9577\u9019\u6a23 <code>Action,Sci-Fi,Drama<\/code>\uff0c\u5c31\u53ef\u4ee5\u5148\u7528 <code>RegexTokenizer<\/code> \u8f49\u6210 <code>Array(&quot;action&quot;, &quot;sci-fi&quot;, &quot;drama&quot;)<\/code>\uff0c\u518d\u7528 <code>CountVectorizer<\/code> \u8f49\u6210 vector\u3002<\/p>\n<p>ref:<br \/>\n<a href=\"https:\/\/spark.apache.org\/docs\/latest\/ml-features.html#countvectorizer\">https:\/\/spark.apache.org\/docs\/latest\/ml-features.html#countvectorizer<\/a><\/p>\n<h3>Feature Hashing<\/h3>\n<p>\u4ee5 user id \u70ba\u4f8b\uff0c\u900f\u904e\u4e00\u500b hash function \u628a\u6bcf\u4e00\u500b user id \u6620\u5c04\u5230 <code>(hashed1_, hashed_2, ..., hashed_m)<\/code> \u7684\u67d0\u500b\u503c\u3002\u6307\u5b9a m &lt;&lt; user id \u7684\u53d6\u503c\u7bc4\u570d\uff0c\u6240\u4ee5\u7f3a\u9ede\u662f\u6703\u6709 collision\uff08\u5982\u679c\u4f60\u7684 model \u8db3\u5920 robust\uff0c\u5012\u4e5f\u662f\u53ef\u4ee5\u4e0d\u7ba1\uff09\uff0c\u512a\u9ede\u662f\u53ef\u4ee5\u826f\u597d\u5730\u8655\u7406\u4e4b\u524d\u6c92\u898b\u904e\u7684\u503c\u548c\u7f55\u898b\u7684\u503c\u3002\u7576\u7136\u4e0d\u53ea\u53ef\u4ee5 hash \u55ae\u4e00\u503c\uff0c\u4e5f\u53ef\u4ee5 hash \u4e00\u500b vector\u3002<\/p>\n<p>\u4f60\u53ef\u4ee5\u628a feature hashing \u8868\u793a\u70ba\u55ae\u4e00\u6b04\u4f4d\u7684\u6578\u503c\uff08\u4f8b\u5982 <code>2<\/code>\uff09\u6216\u662f\u985e\u4f3c one-hot encoding \u90a3\u6a23\u7684\u591a\u6b04\u4f4d\u7684 binary \u8868\u793a\u6cd5\uff08\u4f8b\u5982 <code>[0, 0, 1]<\/code>\uff09\u3002<\/p>\n<pre class=\"line-numbers\"><code class=\"language-py\">import hashlib\n\ndef hash_func(s, n_bins=100000):\n    s = s.encode('utf-8')\n    return int(hashlib.md5(s).hexdigest(), 16) % (n_bins - 1) + 1\n\nprint(hash_func('some categorical value'))<\/code><\/pre>\n<p>ref:<br \/>\n<a href=\"https:\/\/github.com\/apache\/spark\/pull\/18513\">https:\/\/github.com\/apache\/spark\/pull\/18513<\/a><br \/>\n<a href=\"https:\/\/spark.apache.org\/docs\/latest\/ml-features.html#feature-transformation\">https:\/\/spark.apache.org\/docs\/latest\/ml-features.html#feature-transformation<\/a><br \/>\n<a href=\"https:\/\/www.slideshare.net\/gabrielspmoreira\/feature-engineering-getting-most-out-of-data-for-predictive-models-tdc-2017\/42\">https:\/\/www.slideshare.net\/gabrielspmoreira\/feature-engineering-getting-most-out-of-data-for-predictive-models-tdc-2017\/42<\/a><\/p>\n<h3>Mean Encoding<\/h3>\n<p>ref:<br \/>\n<a href=\"https:\/\/zhuanlan.zhihu.com\/p\/26308272\">https:\/\/zhuanlan.zhihu.com\/p\/26308272<\/a><\/p>\n<h3>Category Embedding<\/h3>\n<p>ref:<br \/>\n<a href=\"https:\/\/arxiv.org\/abs\/1604.06737\">https:\/\/arxiv.org\/abs\/1604.06737<\/a><br \/>\n<a href=\"https:\/\/www.slideshare.net\/HJvanVeen\/feature-engineering-72376750\/17\">https:\/\/www.slideshare.net\/HJvanVeen\/feature-engineering-72376750\/17<\/a><br \/>\n<a href=\"https:\/\/blog.myyellowroad.com\/using-categorical-data-in-machine-learning-with-python-from-dummy-variables-to-deep-category-42fd0a43b009\">https:\/\/blog.myyellowroad.com\/using-categorical-data-in-machine-learning-with-python-from-dummy-variables-to-deep-category-42fd0a43b009<\/a><\/p>\n<h3>User Profile \u7528\u6236\u756b\u50cf<\/h3>\n<p>\u4f7f\u7528\u7528\u6236\u756b\u50cf\u4f86\u8868\u793a\u6bcf\u500b user id\uff0c\u4f8b\u5982\u7528\u6236\u7684\u5e74\u9f61\u3001\u6027\u5225\u3001\u8077\u696d\u3001\u6536\u5165\u3001\u5c45\u4f4f\u5730\u3001\u504f\u597d\u7684\u5404\u7a2e tag \u7b49\uff0c\u628a\u6bcf\u500b user \u8868\u793a\u6210\u4e00\u500b feature vector\u3002\u9664\u4e86\u55ae\u4e00\u7dad\u5ea6\u7684\u7279\u5fb5\u4e4b\u5916\uff0c\u4e5f\u53ef\u4ee5\u5efa\u7acb\u300c\u7528\u6236\u807d\u904e\u7684\u6b4c\u90fd\u662f\u54ea\u4e9b\u66f2\u98a8\u300d\u3001\u300c\u7528\u6236\uff0830 \u5929\u5167\uff09\u700f\u89bd\u904e\u7684\u6587\u7ae0\u90fd\u662f\u4ec0\u9ebc\u5206\u985e\uff0c\u4ee5 TF-IDF \u7684\u65b9\u5f0f\u8868\u9054\u3002\u6216\u8005\u662f\u628a\u7528\u6236\u6240\u6709\u559c\u6b61\u6587\u7ae0\u5c0d\u61c9\u7684\u5411\u91cf\u7684\u5e73\u5747\u503c\u4f5c\u70ba\u6b64\u7528\u6236\u7684 profile\u3002\u6bd4\u5982\u67d0\u500b\u7528\u6236\u7d93\u5e38\u95dc\u6ce8\u8207\u63a8\u85a6\u7cfb\u7d71\u6709\u95dc\u7684\u6587\u7ae0\uff0c\u90a3\u9ebc\u4ed6\u7684 profile \u4e2d &quot;CB&quot;\u3001&quot;CF&quot; \u548c &quot;\u63a8\u85a6\u7cfb\u7d71&quot; \u5c0d\u61c9\u7684\u6b0a\u91cd\u503c\u5c31\u6703\u8f03\u9ad8\u3002<\/p>\n<p>ref:<br \/>\n<a href=\"https:\/\/mp.weixin.qq.com\/s\/w87-dyG9Ap9xJ_HZu0Qn-w\">https:\/\/mp.weixin.qq.com\/s\/w87-dyG9Ap9xJ_HZu0Qn-w<\/a><br \/>\n<a href=\"https:\/\/medium.com\/unstructured\/how-feature-engineering-can-help-you-do-well-in-a-kaggle-competition-part-i-9cc9a883514d\">https:\/\/medium.com\/unstructured\/how-feature-engineering-can-help-you-do-well-in-a-kaggle-competition-part-i-9cc9a883514d<\/a><\/p>\n<h4>Rare Categorical Variables<\/h4>\n<p>\u5148\u8a08\u7b97\u597d\u6bcf\u4e00\u7a2e category \u7684\u6578\u91cf\uff0c\u7136\u5f8c\u628a\u5c0f\u65bc\u67d0\u500b threshold \u7684 category \u90fd\u6539\u6210 &quot;Others&quot; \u4e4b\u985e\u7684\u503c\u3002\u6216\u662f\u4f7f\u7528 clustering \u6f14\u7b97\u6cd5\u4f86\u9054\u5230\u540c\u6a23\u7684\u76ee\u7684\u3002\u4f60\u4e5f\u53ef\u4ee5\u76f4\u63a5\u5efa\u7acb\u4e00\u500b\u65b0\u7684 binary feature \u53eb\u505a rare\uff0c\u8981\u4f86\u6a19\u793a\u90a3\u4e9b\u76f8\u5c0d\u5c11\u898b\u7684\u8cc7\u6599\u9ede\u3002<\/p>\n<h4>Unseen Categorical Variables<\/h4>\n<p>\u4ee5 Spark ML \u70ba\u4f8b\uff0c\u7576\u4f60\u7528 training set \u7684\u8cc7\u6599 fit \u4e86\u4e00\u500b <code>StringIndexer<\/code>\uff08\u548c <code>OneHotEncoder<\/code>\uff09\uff0c\u628a\u5b83\u62ff\u53bb\u7528\u5728 test set \u4e0a\u6642\uff0c\u6709\u4e00\u5b9a\u7684\u6a5f\u7387\u4f60\u6703\u9047\u5230\u67d0\u4e9b categorical \u7279\u5fb5\u7684\u503c\u53ea\u5728 test set \u51fa\u73fe\uff0c\u6240\u4ee5\u5c0d\u53ea\u898b\u904e training set \u7684 transformer \u4f86\u8aaa\uff0c\u9019\u4e9b\u5c31\u662f\u6240\u8b02\u7684 unseen values\u3002<\/p>\n<p>\u5c0d\u4ed8 unseen values \u901a\u5e38\u6709\u5e7e\u7a2e\u505a\u6cd5\uff1a<\/p>\n<ul>\n<li>\u7528\u6574\u500b training set + test set \u4f86\u7de8\u78bc categorical \u7279\u5fb5<\/li>\n<li>\u76f4\u63a5\u6368\u68c4\u542b\u6709 unseen values \u7684\u90a3\u7b46\u8cc7\u6599<\/li>\n<li>\u628a unseen values \u6539\u6210 &quot;Others&quot; \u4e4b\u985e\u7684\u5df2\u77e5\u503c\u3002<code>StringIndexer<\/code> \u7684 <code>.setHandleInvalid(&quot;keep&quot;)<\/code> \u57fa\u672c\u4e0a\u5c31\u662f\u9019\u7a2e\u505a\u6cd5<\/li>\n<\/ul>\n<p>\u5982\u679c\u63a1\u7528\u7b2c\u4e00\u7a2e\u65b9\u5f0f\uff0c\u4e00\u4f46\u4f60\u628a\u9019\u500b transformer \u62ff\u5230 production \u53bb\u7528\u6642\uff0c\u7121\u53ef\u907f\u514d\u5730\u9084\u662f\u6703\u9047\u5230 unseen values\u3002\u4e0d\u904e\u901a\u5e38\u7dda\u4e0a\u7684 feature engineering \u6703\u6709\u5225\u7684\u505a\u6cd5\uff0c\u4f8b\u5982\u4e8b\u5148\u628a user \u6216 item \u7684\u5404\u9805\u7279\u5fb5\u90fd\u7b97\u597d\uff08\u5b9a\u671f\u66f4\u65b0\u6216\u662f data \u7522\u751f\u7684\u6642\u5019\u89f8\u767c\uff09\uff0c\u7136\u5f8c\u4ee5 id \u70ba key \u5b58\u9032 Redis \u4e4b\u985e\u7684 NoSQL \u88e1\uff0cmodel \u8981\u7528\u7684\u6642\u5019\u76f4\u63a5\u7528 user id \/ item id \u62ff\u5230\u8655\u7406\u597d\u7684 feature vector\u3002<\/p>\n<p>ref:<br \/>\n<a href=\"https:\/\/stackoverflow.com\/questions\/34681534\/spark-ml-stringindexer-handling-unseen-labels\">https:\/\/stackoverflow.com\/questions\/34681534\/spark-ml-stringindexer-handling-unseen-labels<\/a><\/p>\n<h4>Large Categorical Variables<\/h4>\n<p>\u91dd\u5c0d\u90a3\u7a2e\u975e\u5e38\u5927\u7684 categorical \u7279\u5fb5\uff08\u4f8b\u5982 id \u985e\u7684\u7279\u5fb5\uff09\uff0c\u5982\u679c\u4f60\u7528\u7684\u662f logistic regression\uff0c\u5176\u5be6\u53ef\u4ee5\u786c\u4e0a one-hot encoding\u3002\u4e0d\u7136\u5c31\u662f\u5229\u7528\u4e0a\u9762\u63d0\u5230\u7684 feature hashing \u6216 bin counting \u7b49\u65b9\u5f0f\uff1b\u5982\u679c\u662f GBDT \u7684\u8a71\uff0c\u751a\u81f3\u53ef\u4ee5\u76f4\u63a5\u7528 id \u786c\u4e0a\uff0c\u53ea\u8981 tree \u8db3\u5920\u591a\u3002<\/p>\n<p>ref:<br \/>\n<a href=\"https:\/\/www.zhihu.com\/question\/34819617\">https:\/\/www.zhihu.com\/question\/34819617<\/a><\/p>\n<h2>Feature Construction \u7279\u5fb5\u5efa\u69cb<\/h2>\n<p>\u7279\u5fb5\u69cb\u5efa\u6307\u7684\u662f\u5f9e\u539f\u6709\u7684\u7279\u5fb5\u4e2d\uff0c\u4eba\u5de5\u5730\u5275\u9020\u51fa\u65b0\u7684\u7279\u5fb5\uff0c\u901a\u5e38\u7528\u4f86\u89e3\u6c7a\u4e00\u822c\u7684\u7dda\u6027\u6a21\u578b\u6c92\u8fa6\u6cd5\u5b78\u5230\u975e\u7dda\u6027\u7279\u5fb5\u7684\u554f\u984c\u3002\u5176\u4e2d\u4e00\u500b\u91cd\u9ede\u53ef\u80fd\u662f\u80fd\u4e0d\u80fd\u5920\u904e\u67d0\u4e9b\u8fa6\u6cd5\uff0c\u5728\u7279\u5fb5\u4e2d\u52a0\u5165\u67d0\u4e9b\u300c\u984d\u5916\u7684\u8cc7\u8a0a\u300d\uff0c\u96d6\u7136\u4e5f\u5f97\u5c0f\u5fc3\u6578\u64da\u504f\u898b\u7684\u554f\u984c\u3002<\/p>\n<p>\u5982\u679c\u4f60\u6709\u5f88\u591a user \u8cfc\u7269\u7684\u8cc7\u6599\uff0c\u9664\u4e86\u53ef\u4ee5 aggregate \u5f97\u5230 <code>total spend<\/code> \u9019\u6a23\u7684 feature \u4e4b\u5916\uff0c\u4e5f\u53ef\u4ee5\u8b8a\u63db\u4e00\u4e0b\uff0c\u8b8a\u6210 <code>spend in last week<\/code>\u3001<code>spend in last month<\/code> \u548c <code>spend in last year<\/code> \u9019\u7a2e\u53ef\u4ee5\u8868\u793a\u300c\u8da8\u52e2\u300d\u7684\u7279\u5fb5\u3002<\/p>\n<p>\u7bc4\u4f8b\uff1a<\/p>\n<ul>\n<li><code>author_avg_page_view<\/code>: \u8a72\u6587\u7ae0\u4f5c\u8005\u7684\u6240\u6709\u6587\u7ae0\u7684\u5e73\u5747\u700f\u89bd\u6578<\/li>\n<li><code>user_visited_days_since_doc_published<\/code>: \u8a72\u6587\u7ae0\u767c\u5e03\u5230\u8a72\u7528\u6236\u8a2a\u554f\u7d93\u904e\u4e86\u591a\u5c11\u5929<\/li>\n<li><code>user_history_doc_sim_categories<\/code>: \u7528\u6236\u8b80\u904e\u7684\u6240\u6709\u6587\u7ae0\u7684\u5206\u985e\u548c\u8a72\u7bc7\u6587\u7ae0\u7684\u5206\u985e\u7684 TF-IDF \u7684\u76f8\u4f3c\u5ea6<\/li>\n<li><code>user_history_doc_sim_topics<\/code>: \u7528\u6236\u8b80\u904e\u7684\u6240\u6709\u6587\u7ae0\u7684\u5167\u6587\u548c\u8a72\u7bc7\u6587\u7ae0\u7684\u5167\u6587\u7684 TF-IDF \u7684\u76f8\u4f3c\u5ea6<\/li>\n<\/ul>\n<p>ref:<br \/>\n<a href=\"https:\/\/medium.com\/unstructured\/how-feature-engineering-can-help-you-do-well-in-a-kaggle-competition-part-i-9cc9a883514d\">https:\/\/medium.com\/unstructured\/how-feature-engineering-can-help-you-do-well-in-a-kaggle-competition-part-i-9cc9a883514d<\/a><br \/>\n<a href=\"https:\/\/www.safaribooksonline.com\/library\/view\/large-scale-machine\/9781785888748\/ch04s02.html\">https:\/\/www.safaribooksonline.com\/library\/view\/large-scale-machine\/9781785888748\/ch04s02.html<\/a><br \/>\n<a href=\"https:\/\/www.slideshare.net\/HJvanVeen\/feature-engineering-72376750\/23\">https:\/\/www.slideshare.net\/HJvanVeen\/feature-engineering-72376750\/23<\/a><\/p>\n<h3>Temporal Features \u6642\u9593\u7279\u5fb5<\/h3>\n<p>\u5c0d\u65bc date \/ time \u985e\u578b\u7684\u8cc7\u6599\uff0c\u9664\u4e86\u8f49\u63db\u6210 timestamp \u548c\u53d6\u51fa day\u3001month \u548c year \u505a\u6210\u65b0\u7684\u6b04\u4f4d\u4e4b\u5916\uff0c\u4e5f\u53ef\u4ee5\u5c0d hour \u505a binning\uff08\u5206\u6210\u4e0a\u5348\u3001\u4e2d\u5348\u3001\u665a\u4e0a\u4e4b\u985e\u7684\uff09\u6216\u662f\u5c0d day \u505a binning\uff08\u5206\u6210\u5de5\u4f5c\u65e5\u3001\u9031\u672b\uff09\uff1b\u6216\u662f\u60f3\u8fa6\u6cd5\u67e5\u51fa\u8a72\u65e5\u671f\u7576\u5929\u7684\u5929\u6c23\u3001\u7bc0\u65e5\u6216\u6d3b\u52d5\u7b49\u8a0a\u606f\uff0c\u4f8b\u5982 <code>is_national_holiday<\/code> \u6216 <code>has_sport_events<\/code>\u3002<\/p>\n<p>\u66f4\u9032\u4e00\u6b65\uff0c\u7528 datetime \u985e\u7684\u8cc7\u6599\u901a\u5e38\u4e5f\u53ef\u4ee5\u505a\u6210 <code>spend_hours_last_week<\/code> \u6216 <code>spend_money_last_week<\/code> \u9019\u7a2e\u53ef\u4ee5\u7528\u4f86\u8868\u793a\u300c\u8da8\u52e2\u300d\u7684\u7279\u5fb5\u3002<\/p>\n<h3>Text Features \u6587\u5b57\u7279\u5fb5<\/h3>\n<p>ref:<br \/>\n<a href=\"https:\/\/www.slideshare.net\/HJvanVeen\/feature-engineering-72376750\/57\">https:\/\/www.slideshare.net\/HJvanVeen\/feature-engineering-72376750\/57<\/a><\/p>\n<h3>Spatial Features \u5730\u7406\u7279\u5fb5<\/h3>\n<p>\u5982\u679c\u4f60\u6709 <code>city<\/code> \u6216 <code>address<\/code> \u7b49\u7279\u5fb5\uff0c\u53ef\u4ee5\u65b0\u5efa\u51fa <code>latitude<\/code> \u548c <code>longitude<\/code> \u5169\u500b features\uff08\u7576\u7136\u4f60\u5f97\u900f\u904e\u5916\u90e8\u7684 API \u6216\u8cc7\u6599\u4f86\u6e90\u624d\u505a\u5f97\u5230\uff09\uff0c\u518d\u7d44\u5408\u51fa <code>median_income_within_2_miles<\/code> \u9019\u6a23\u7684\u7279\u5fb5\u3002<\/p>\n<p>ref:<br \/>\n<a href=\"https:\/\/www.slideshare.net\/HJvanVeen\/feature-engineering-72376750\/47\">https:\/\/www.slideshare.net\/HJvanVeen\/feature-engineering-72376750\/47<\/a><\/p>\n<h3>Cyclical Features<\/h3>\n<p>ref:<br \/>\n<a href=\"http:\/\/blog.davidkaleko.com\/feature-engineering-cyclical-features.html\">http:\/\/blog.davidkaleko.com\/feature-engineering-cyclical-features.html<\/a><\/p>\n<h2>Features Interaction \u7279\u5fb5\u4ea4\u4e92<\/h2>\n<p>\u5047\u8a2d\u4f60\u6709 <code>A<\/code> \u548c <code>B<\/code> \u5169\u500b continuous \u7279\u5fb5\uff0c\u4f60\u53ef\u4ee5\u7528 <code>A + B<\/code>\u3001<code>A - B<\/code>\u3001<code>A * B<\/code> \u6216 <code>A \/ B<\/code> \u4e4b\u985e\u7684\u65b9\u5f0f\u5efa\u7acb\u65b0\u7684\u7279\u5fb5\u3002\u4f8b\u5982 <code>house_age_at_purchase = house_built_date - house_purchase_date<\/code> \u6216\u662f <code>click_through_rate = n_clicks \/ n_impressions<\/code>\u3002<\/p>\n<p>\u9084\u6709\u4e00\u7a2e\u985e\u4f3c\u7684\u4f5c\u6cd5\u53eb Polynomial Expansion \u591a\u9805\u5f0f\u5c55\u958b\uff0c\u7576 degree \u70ba 2 \u6642\uff0c\u53ef\u4ee5\u628a <code>(x, y)<\/code> \u5169\u500b\u7279\u5fb5\u8b8a\u6210 <code>(x, x * x, y, x * y, y * y)<\/code> \u4e94\u500b\u7279\u5fb5\u3002<\/p>\n<p>ref:<br \/>\n<a href=\"https:\/\/spark.apache.org\/docs\/latest\/ml-features.html#polynomialexpansion\">https:\/\/spark.apache.org\/docs\/latest\/ml-features.html#polynomialexpansion<\/a><br \/>\n<a href=\"https:\/\/elitedatascience.com\/feature-engineering-best-practices\">https:\/\/elitedatascience.com\/feature-engineering-best-practices<\/a><\/p>\n<h2>Feature Combination \u7279\u5fb5\u7d44\u5408<\/h2>\n<p>\u4e5f\u7a31\u70ba\u7279\u5fb5\u4ea4\u53c9\u3002<\/p>\n<p>\u7279\u5fb5\u7d44\u5408\u4e3b\u8981\u662f\u91dd\u5c0d categorical \u7279\u5fb5\uff0c\u7279\u5fb5\u4ea4\u4e92\u5247\u662f\u9069\u7528\u65bc continuous \u7279\u5fb5\u3002\u4f46\u662f\u5169\u8005\u7684\u6982\u5ff5\u662f\u5dee\u4e0d\u591a\u7684\uff0c\u5c31\u662f\u628a\u5169\u500b\u4ee5\u4e0a\u7684\u7279\u5fb5\u900f\u904e\u67d0\u7a2e\u65b9\u5f0f\u7d50\u5408\u5728\u4e00\u8d77\uff0c\u8b8a\u6210\u65b0\u7684\u7279\u5fb5\u3002\u901a\u5e38\u7528\u4f86\u89e3\u6c7a\u4e00\u822c\u7684\u7dda\u6027\u6a21\u578b\u6c92\u8fa6\u6cd5\u5b78\u5230\u975e\u7dda\u6027\u7279\u5fb5\u7684\u554f\u984c\u3002<\/p>\n<p>\u5047\u8a2d\u6709 <code>gender<\/code> \u548c <code>wealth<\/code> \u5169\u500b\u7279\u5fb5\uff0c\u5206\u5225\u6709 2 \u548c 3 \u7a2e\u53d6\u503c\uff0c\u6700\u7c21\u55ae\u7684\u65b9\u5f0f\u5c31\u662f\u76f4\u63a5 string concatenation \u7d44\u5408\u51fa\u4e00\u500b\u65b0\u7684\u7279\u5fb5 <code>gender_wealth<\/code>\uff0c\u5171\u6709 2 x 3 = 6 \u7a2e\u53d6\u503c\u3002\u56e0\u70ba\u662f categorical \u7279\u5fb5\uff0c\u53ef\u4ee5\u76f4\u63a5\u5c0d <code>gender_wealth<\/code> \u4f7f\u7528 <code>StringIndexer<\/code> \u548c <code>OneHotEncoder<\/code>\u3002\u4f60\u7576\u7136\u4e5f\u53ef\u4ee5\u4e00\u8d77\u7d44\u5408 continuous \u548c categorical \u7279\u5fb5\uff0c\u4f8b\u5982 <code>age_wealth<\/code> \u9019\u6a23\u7684\u7279\u5fb5\uff0c\u53ea\u662f vector \u88e1\u7684\u503c\u5c31\u4e0d\u662f 0 1 \u800c\u662f <code>age<\/code> \u672c\u8eab\u4e86\u3002<\/p>\n<p>\u5047\u8a2d C \u662f categorical \u7279\u5fb5\uff0cN \u662f continuous \u7279\u5fb5\uff0c\u4ee5\u4e0b\u6709\u5e7e\u7a2e\u6709\u610f\u7fa9\u7684\u7d44\u5408\uff1a<\/p>\n<ul>\n<li><code>median(N) GROUP BY C<\/code> \u4e2d\u4f4d\u6578<\/li>\n<li><code>mean(N) GROUP BY C<\/code> \u7b97\u8853\u5e73\u5747\u6578<\/li>\n<li><code>mode(N) GROUP BY C<\/code> \u773e\u6578<\/li>\n<li><code>min(N) GROUP BY C<\/code> \u6700\u5c0f\u503c<\/li>\n<li><code>max(N) GROUP BY C<\/code> \u6700\u5927\u503c<\/li>\n<li><code>std(N) GROUP BY C<\/code> \u6a19\u6e96\u5dee<\/li>\n<li><code>var(N) GROUP BY C<\/code> \u65b9\u5dee<\/li>\n<li><code>N - median(N) GROUP BY C<\/code><\/li>\n<\/ul>\n<pre class=\"line-numbers\"><code class=\"language-txt\">user_id  age  gender  wealth  gender_wealth  gender_wealth_ohe   age_wealth\n1        56   male    rich    male_rich      [1, 0, 0, 0, 0, 0]  [56, 0, 0]\n2        30   male    middle  male_middle    [0, 1, 0, 0, 0, 0]  [0, 30, 0]\n3        19   female  rich    female_rich    [0, 0, 0, 1, 0, 0]  [19, 0, 0]\n4        62   female  poor    female_poor    [0, 0, 0, 0, 0, 1]  [0, 0, 62]\n5        78   male    poor    male_poor      [0, 0, 1, 0, 0, 0]  [0, 0, 78]\n6        34   female  middle  female_middle  [0, 0, 0, 0, 1, 0]  [0, 34, 0]<\/code><\/pre>\n<p>ref:<br \/>\n<a href=\"http:\/\/breezedeus.github.io\/2014\/11\/15\/breezedeus-feature-processing.html\">http:\/\/breezedeus.github.io\/2014\/11\/15\/breezedeus-feature-processing.html<\/a><br \/>\n<a href=\"http:\/\/phunters.lofter.com\/post\/86d56_194e956\">http:\/\/phunters.lofter.com\/post\/86d56_194e956<\/a><br \/>\n<a href=\"https:\/\/zhuanlan.zhihu.com\/p\/26444240\">https:\/\/zhuanlan.zhihu.com\/p\/26444240<\/a><br \/>\n<a href=\"http:\/\/blog.csdn.net\/mytestmy\/article\/details\/40933235\">http:\/\/blog.csdn.net\/mytestmy\/article\/details\/40933235<\/a><\/p>\n<h2>Feature Extraction \u7279\u5fb5\u63d0\u53d6<\/h2>\n<p>\u901a\u5e38\u5c31\u662f\u6307 dimensionality reduction\u3002<\/p>\n<ul>\n<li>Principal Component Analysis (PCA)<\/li>\n<li>Latent Dirichlet Allocation (LDA)<\/li>\n<li>Latent Semantic Analysis (LSA)<\/li>\n<\/ul>\n<h2>Feature Selection \u7279\u5fb5\u9078\u64c7<\/h2>\n<p>\u7279\u5fb5\u9078\u64c7\u662f\u6307\u900f\u904e\u67d0\u4e9b\u65b9\u6cd5\u81ea\u52d5\u5730\u5f9e\u6240\u6709\u7684\u7279\u5fb5\u4e2d\u6311\u9078\u51fa\u6709\u7528\u7684\u7279\u5fb5\u3002<\/p>\n<p>ref:<br \/>\n<a href=\"http:\/\/scikit-learn.org\/stable\/modules\/feature_selection.html\">http:\/\/scikit-learn.org\/stable\/modules\/feature_selection.html<\/a><\/p>\n<h3>Filter Method<\/h3>\n<p>\u63a1\u7528\u67d0\u4e00\u7a2e\u8a55\u4f30\u6307\u6a19\uff08\u767c\u6563\u6027\u3001\u76f8\u95dc\u6027\u6216 Information Gain \u7b49\uff09\uff0c\u55ae\u7368\u5730\u8861\u91cf\u500b\u5225\u7279\u5fb5\u8ddf target variable \u4e4b\u9593\u7684\u95dc\u4fc2\uff0c\u5e38\u7528\u7684\u65b9\u6cd5\u6709 Chi Square Test\uff08\u5361\u65b9\u6aa2\u9a57\uff09\u3002\u9019\u7a2e\u7279\u5fb5\u9078\u64c7\u65b9\u5f0f\u6c92\u6709\u4efb\u4f55\u6a21\u578b\u7684\u53c3\u8207\u3002<\/p>\n<p>\u4ee5\u76f8\u95dc\u6027\u4f86\u8aaa\uff0c\u4e5f\u4e0d\u898b\u5f97\u8ddf target variable \u7684\u76f8\u95dc\u6027\u8d8a\u9ad8\u5c31\u8d8a\u597d\uff0c<\/p>\n<p>ref:<br \/>\n<a href=\"https:\/\/spark.apache.org\/docs\/latest\/ml-features.html#chisqselector\">https:\/\/spark.apache.org\/docs\/latest\/ml-features.html#chisqselector<\/a><br \/>\n<a href=\"http:\/\/files.cnblogs.com\/files\/XBWer\/%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0%E3%81%AE%E7%89%B9%E5%BE%81.pdf\">http:\/\/files.cnblogs.com\/files\/XBWer\/%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0%E3%81%AE%E7%89%B9%E5%BE%81.pdf<\/a><\/p>\n<h3>Wrapper Method<\/h3>\n<p>\u6703\u63a1\u7528\u67d0\u500b\u6a21\u578b\u4f86\u9810\u6e2c\u4f60\u7684 target variable\uff0c\u628a\u7279\u5fb5\u9078\u64c7\u60f3\u6210\u662f\u4e00\u500b\u7d44\u5408\u512a\u5316\u7684\u554f\u984c\uff0c\u60f3\u8fa6\u6cd5\u627e\u51fa\u4e00\u7d44\u7279\u5fb5\u5b50\u96c6\u80fd\u5920\u8b93\u6a21\u578b\u7684\u8a55\u4f30\u7d50\u679c\u6700\u597d\u3002\u7f3a\u9ede\u662f\u592a\u8017\u6642\u9593\u4e86\uff0c\u5be6\u52d9\u4e0a\u4e0d\u5e38\u7528\u3002<\/p>\n<p>ref:<br \/>\n<a href=\"http:\/\/www.cnblogs.com\/heaad\/archive\/2011\/01\/02\/1924088.html\">http:\/\/www.cnblogs.com\/heaad\/archive\/2011\/01\/02\/1924088.html<\/a><\/p>\n<h3>Embedded Method<\/h3>\n<p>\u901a\u5e38\u6703\u63a1\u7528\u4e00\u500b\u6703\u70ba\u7279\u5fb5\u8ce6\u4e88 coefficients \u6216 importances \u7684\u6f14\u7b97\u6cd5\uff0c\u4f8b\u5982 Logistic Regression\uff08\u7279\u5225\u662f\u4f7f\u7528 L1 penalty\uff09\u6216 GBDT\uff0c\u76f4\u63a5\u7528\u6b0a\u91cd\u6216\u91cd\u8981\u6027\u5c0d\u6240\u6709\u7279\u5fb5\u6392\u5e8f\uff0c\u7136\u5f8c\u53d6\u524d n \u500b\u4f5c\u70ba\u7279\u5fb5\u5b50\u96c6\u3002<\/p>\n<p>ref:<br \/>\n<a href=\"http:\/\/scikit-learn.org\/stable\/modules\/feature_selection.html#feature-selection-using-selectfrommodel\">http:\/\/scikit-learn.org\/stable\/modules\/feature_selection.html#feature-selection-using-selectfrommodel<\/a><br \/>\n<a href=\"https:\/\/www.zhihu.com\/question\/28641663\">https:\/\/www.zhihu.com\/question\/28641663<\/a><\/p>\n<h2>Feature Learning \u7279\u5fb5\u5b78\u7fd2<\/h2>\n<p>\u4e5f\u7a31\u70ba Representation Learning \u6216 Automated Feature Engineering\u3002<\/p>\n<ul>\n<li>GBDT<\/li>\n<li>Neural Network: Restricted Boltzmann Machines<\/li>\n<li>Deep Learning: Autoencoder<\/li>\n<\/ul>\n<p>ref:<br \/>\n<a href=\"https:\/\/zhuanlan.zhihu.com\/p\/26444240\">https:\/\/zhuanlan.zhihu.com\/p\/26444240<\/a><\/p>\n<h2>Data Leakage \u6578\u64da\u6d29\u6f0f<\/h2>\n<p>\u5c31\u662f\u6307\u4f60\u5728 features \u4e2d\u76f4\u63a5\u6216\u9593\u63a5\u5730\u52a0\u5165\u4e86\u8ddf target variable \u6709\u95dc\u7684\u6578\u64da\u3002<\/p>\n<p>ref:<br \/>\n<a href=\"https:\/\/zhuanlan.zhihu.com\/p\/26444240\">https:\/\/zhuanlan.zhihu.com\/p\/26444240<\/a><\/p>\n<h2>Target Engineering<\/h2>\n<p>\u96d6\u7136\u4e0d\u80fd\u7b97\u662f feature engineering \u7684\u4e00\u90e8\u5206\uff0c\u4f46\u662f\u5176\u5be6\u4f60\u4e5f\u53ef\u4ee5\u5c0d target variable \/ label\uff08\u5c31\u662f\u4f60\u7684\u6a21\u578b\u8981\u9810\u6e2c\u7684\u90a3\u500b\u503c\uff09\u505a\u9ede\u8b8a\u63db\u3002\u4f8b\u5982 <code>log(y + 1)<\/code> \u6216 <code>exp(y) - 1<\/code>\u3002<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Feature Engineering \u662f\u500b\u624b\u85dd\u6d3b\uff0c\u8b1b\u6c42\u7684\u662f\u5275\u9020\u529b\u3002<\/p>\n","protected":false},"author":1,"featured_media":418,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[97],"tags":[111,98],"class_list":["post-417","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-about-ai","tag-feature-engineering","tag-machine-learning"],"_links":{"self":[{"href":"https:\/\/vinta.ws\/code\/wp-json\/wp\/v2\/posts\/417","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/vinta.ws\/code\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/vinta.ws\/code\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/vinta.ws\/code\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/vinta.ws\/code\/wp-json\/wp\/v2\/comments?post=417"}],"version-history":[{"count":0,"href":"https:\/\/vinta.ws\/code\/wp-json\/wp\/v2\/posts\/417\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/vinta.ws\/code\/wp-json\/wp\/v2\/media\/418"}],"wp:attachment":[{"href":"https:\/\/vinta.ws\/code\/wp-json\/wp\/v2\/media?parent=417"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/vinta.ws\/code\/wp-json\/wp\/v2\/categories?post=417"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/vinta.ws\/code\/wp-json\/wp\/v2\/tags?post=417"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}