Kaggle Mercari Price Suggestion

xiaoxiao2021-02-28 46

参赛经验代码技巧积累计算时间文件读取动态获取数据集行列数 Series转DataFrame便捷写法一句话把某列是否为空变成新的一列特征loclambda x xindex missing第一次见frozensetFeature UnionPipeline细节详解 fit_transform和transform的区别ItemSelector用于选择某列HashingVectorizer CountVectorizer

参赛经验

第一次参加Kaggle里feature类的比赛，把参赛经验详细整理如下：

每天只有5次提交机会，如果真的想打出一个好名次不妨注册两个账号，而且不知道为什么好像能加入两个队伍（我看到一个认识的人就是这样）。我最初只是抱着试一试的心态，不想拖累任何人，所以没有注册两个账号刷提交机会，也没有组队组队不一定要找认识的人这个比赛每个Kernel的运行时间限制在1小时内，跑一次时间长。我就想偷懒，跑完了还能在该界面调试，同样又不影响提交该kernel的output，于是我就在一个kernel跑完之后，直接复制url到另一个tab页面，然后在原tab页面提交了output作为submission，这样就相当于把同一个kernel的运行结果重复提交了两遍！因为我的偷懒，这点快坑死自己了，最后一天有很有用的kernel发出，而我因为偷懒浪费了最后一天两次提交机会！要知道最后一天因为新NB Kernel的发布，名次变化可是很大的！唉，长记性吧！

代码技巧积累

比赛Stage1截止前一天有人放出一个非常好的特征工程的源码！运行速度比其他方法快很多！下面学习之。先贴源码如下：

import multiprocessing as mp import pandas as pd from time import time from scipy.sparse import csr_matrix import os from sklearn.linear_model import Ridge from sklearn.pipeline import FeatureUnion, Pipeline from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfTransformer from sklearn.metrics import mean_squared_log_error from sklearn.preprocessing import OneHotEncoder import numpy as np import gc from sklearn.base import BaseEstimator, TransformerMixin import re from pandas.api.types import is_numeric_dtype, is_categorical_dtype os.environ['MKL_NUM_THREADS'] = '4' os.environ['OMP_NUM_THREADS'] = '4' os.environ['JOBLIB_START_METHOD'] = 'forkserver' INPUT_PATH = r'../input' def dameraulevenshtein(seq1, seq2): """Calculate the Damerau-Levenshtein distance between sequences. This method has not been modified from the original. Source: http://mwh.geek.nz/2009/04/26/python-damerau-levenshtein-distance/ This distance is the number of additions, deletions, substitutions, and transpositions needed to transform the first sequence into the second. Although generally used with strings, any sequences of comparable objects will work. Transpositions are exchanges of *consecutive* characters; all other operations are self-explanatory. This implementation is O(N*M) time and O(M) space, for N and M the lengths of the two sequences. >>> dameraulevenshtein('ba', 'abc') 2 >>> dameraulevenshtein('fee', 'deed') 2 It works with arbitrary sequences too: >>> dameraulevenshtein('abcd', ['b', 'a', 'c', 'd', 'e']) 2 """ # codesnippet:D0DE4716-B6E6-4161-9219-2903BF8F547F # Conceptually, this is based on a len(seq1) + 1 * len(seq2) + 1 matrix. # However, only the current and two previous rows are needed at once, # so we only store those. oneago = None thisrow = list(range(1, len(seq2) + 1)) + [0] for x in range(len(seq1)): # Python lists wrap around for negative indices, so put the # leftmost column at the *end* of the list. This matches with # the zero-indexed strings and saves extra calculation. twoago, oneago, thisrow = (oneago, thisrow, [0] * len(seq2) + [x + 1]) for y in range(len(seq2)): delcost = oneago[y] + 1 addcost = thisrow[y - 1] + 1 subcost = oneago[y - 1] + (seq1[x] != seq2[y]) thisrow[y] = min(delcost, addcost, subcost) # This block deals with transpositions if (x > 0 and y > 0 and seq1[x] == seq2[y - 1] and seq1[x - 1] == seq2[y] and seq1[x] != seq2[y]): thisrow[y] = min(thisrow[y], twoago[y - 2] + 1) return thisrow[len(seq2) - 1] class SymSpell: def __init__(self, max_edit_distance=3, verbose=0): self.max_edit_distance = max_edit_distance self.verbose = verbose # 0: top suggestion # 1: all suggestions of smallest edit distance # 2: all suggestions <= max_edit_distance (slower, no early termination) self.dictionary = {} self.longest_word_length = 0 def get_deletes_list(self, w): """given a word, derive strings with up to max_edit_distance characters deleted""" deletes = [] queue = [w] for d in range(self.max_edit_distance): temp_queue = [] for word in queue: if len(word) > 1: for c in range(len(word)): # character index word_minus_c = word[:c] + word[c + 1:] if word_minus_c not in deletes: deletes.append(word_minus_c) if word_minus_c not in temp_queue: temp_queue.append(word_minus_c) queue = temp_queue return deletes def create_dictionary_entry(self, w): '''add word and its derived deletions to dictionary''' # check if word is already in dictionary # dictionary entries are in the form: (list of suggested corrections, # frequency of word in corpus) new_real_word_added = False if w in self.dictionary: # increment count of word in corpus self.dictionary[w] = (self.dictionary[w][0], self.dictionary[w][1] + 1) else: self.dictionary[w] = ([], 1) self.longest_word_length = max(self.longest_word_length, len(w)) if self.dictionary[w][1] == 1: # first appearance of word in corpus # n.b. word may already be in dictionary as a derived word # (deleting character from a real word) # but counter of frequency of word in corpus is not incremented # in those cases) new_real_word_added = True deletes = self.get_deletes_list(w) for item in deletes: if item in self.dictionary: # add (correct) word to delete's suggested correction list self.dictionary[item][0].append(w) else: # note frequency of word in corpus is not incremented self.dictionary[item] = ([w], 0) return new_real_word_added def create_dictionary_from_arr(self, arr, token_pattern=r'[a-z]+'): total_word_count = 0 unique_word_count = 0 for line in arr: # separate by words by non-alphabetical characters words = re.findall(token_pattern, line.lower()) for word in words: total_word_count += 1 if self.create_dictionary_entry(word): unique_word_count += 1 print("total words processed: %i" % total_word_count) print("total unique words in corpus: %i" % unique_word_count) print("total items in dictionary (corpus words and deletions): %i" % len(self.dictionary)) print(" edit distance for deletions: %i" % self.max_edit_distance) print(" length of longest word in corpus: %i" % self.longest_word_length) return self.dictionary def create_dictionary(self, fname): total_word_count = 0 unique_word_count = 0 with open(fname) as file: for line in file: # separate by words by non-alphabetical characters words = re.findall('[a-z]+', line.lower()) for word in words: total_word_count += 1 if self.create_dictionary_entry(word): unique_word_count += 1 print("total words processed: %i" % total_word_count) print("total unique words in corpus: %i" % unique_word_count) print("total items in dictionary (corpus words and deletions): %i" % len(self.dictionary)) print(" edit distance for deletions: %i" % self.max_edit_distance) print(" length of longest word in corpus: %i" % self.longest_word_length) return self.dictionary def get_suggestions(self, string, silent=False): """return list of suggested corrections for potentially incorrectly spelled word""" if (len(string) - self.longest_word_length) > self.max_edit_distance: if not silent: print("no items in dictionary within maximum edit distance") return [] suggest_dict = {} min_suggest_len = float('inf') queue = [string] q_dictionary = {} # items other than string that we've checked while len(queue) > 0: q_item = queue[0] # pop queue = queue[1:] # early exit if ((self.verbose < 2) and (len(suggest_dict) > 0) and ((len(string) - len(q_item)) > min_suggest_len)): break # process queue item if (q_item in self.dictionary) and (q_item not in suggest_dict): if self.dictionary[q_item][1] > 0: # word is in dictionary, and is a word from the corpus, and # not already in suggestion list so add to suggestion # dictionary, indexed by the word with value (frequency in # corpus, edit distance) # note q_items that are not the input string are shorter # than input string since only deletes are added (unless # manual dictionary corrections are added) assert len(string) >= len(q_item) suggest_dict[q_item] = (self.dictionary[q_item][1], len(string) - len(q_item)) # early exit if (self.verbose < 2) and (len(string) == len(q_item)): break elif (len(string) - len(q_item)) < min_suggest_len: min_suggest_len = len(string) - len(q_item) # the suggested corrections for q_item as stored in # dictionary (whether or not q_item itself is a valid word # or merely a delete) can be valid corrections for sc_item in self.dictionary[q_item][0]: if sc_item not in suggest_dict: # compute edit distance # suggested items should always be longer # (unless manual corrections are added) assert len(sc_item) > len(q_item) # q_items that are not input should be shorter # than original string # (unless manual corrections added) assert len(q_item) <= len(string) if len(q_item) == len(string): assert q_item == string item_dist = len(sc_item) - len(q_item) # item in suggestions list should not be the same as # the string itself assert sc_item != string # calculate edit distance using, for example, # Damerau-Levenshtein distance item_dist = dameraulevenshtein(sc_item, string) # do not add words with greater edit distance if # verbose setting not on if (self.verbose < 2) and (item_dist > min_suggest_len): pass elif item_dist <= self.max_edit_distance: assert sc_item in self.dictionary # should already be in dictionary if in suggestion list suggest_dict[sc_item] = (self.dictionary[sc_item][1], item_dist) if item_dist < min_suggest_len: min_suggest_len = item_dist # depending on order words are processed, some words # with different edit distances may be entered into # suggestions; trim suggestion dictionary if verbose # setting not on if self.verbose < 2: suggest_dict = {k: v for k, v in suggest_dict.items() if v[1] <= min_suggest_len} # now generate deletes (e.g. a substring of string or of a delete) # from the queue item # as additional items to check -- add to end of queue assert len(string) >= len(q_item) # do not add words with greater edit distance if verbose setting # is not on if (self.verbose < 2) and ((len(string) - len(q_item)) > min_suggest_len): pass elif (len(string) - len(q_item)) < self.max_edit_distance and len(q_item) > 1: for c in range(len(q_item)): # character index word_minus_c = q_item[:c] + q_item[c + 1:] if word_minus_c not in q_dictionary: queue.append(word_minus_c) q_dictionary[word_minus_c] = None # arbitrary value, just to identify we checked this # queue is now empty: convert suggestions in dictionary to # list for output if not silent and self.verbose != 0: print("number of possible corrections: %i" % len(suggest_dict)) print(" edit distance for deletions: %i" % self.max_edit_distance) # output option 1 # sort results by ascending order of edit distance and descending # order of frequency # and return list of suggested word corrections only: # return sorted(suggest_dict, key = lambda x: # (suggest_dict[x][1], -suggest_dict[x][0])) # output option 2 # return list of suggestions with (correction, # (frequency in corpus, edit distance)): as_list = suggest_dict.items() # outlist = sorted(as_list, key=lambda (term, (freq, dist)): (dist, -freq)) outlist = sorted(as_list, key=lambda x: (x[1][1], -x[1][0])) if self.verbose == 0: return outlist[0] else: return outlist ''' Option 1: ['file', 'five', 'fire', 'fine', ...] Option 2: [('file', (5, 0)), ('five', (67, 1)), ('fire', (54, 1)), ('fine', (17, 1))...] ''' def best_word(self, s, silent=False): try: return self.get_suggestions(s, silent)[0] except: return None class ItemSelector(BaseEstimator, TransformerMixin): def __init__(self, field, start_time=time()): self.field = field self.start_time = start_time def fit(self, x, y=None): return self def transform(self, dataframe): print(f'[{time()-self.start_time}] select {self.field}') dt = dataframe[self.field].dtype if is_categorical_dtype(dt): return dataframe[self.field].cat.codes[:, None] elif is_numeric_dtype(dt): return dataframe[self.field][:, None] else: return dataframe[self.field] class DropColumnsByDf(BaseEstimator, TransformerMixin): def __init__(self, min_df=1, max_df=1.0): self.min_df = min_df self.max_df = max_df def fit(self, X, y=None): m = X.tocsc() self.nnz_cols = ((m != 0).sum(axis=0) >= self.min_df).A1 if self.max_df < 1.0: max_df = m.shape[0] * self.max_df self.nnz_cols = self.nnz_cols & ((m != 0).sum(axis=0) <= max_df).A1 return self def transform(self, X, y=None): m = X.tocsc() return m[:, self.nnz_cols] def get_rmsle(y_true, y_pred): return np.sqrt(mean_squared_log_error(np.expm1(y_true), np.expm1(y_pred))) def split_cat(text): try: cats = text.split("/") return cats[0], cats[1], cats[2], cats[0] + '/' + cats[1] except: print("no category") return 'other', 'other', 'other', 'other/other' def brands_filling(dataset): vc = dataset['brand_name'].value_counts() brands = vc[vc > 0].index brand_word = r"[a-z0-9*/+\-'’?!.,|&%®™ôèéü]+" many_w_brands = brands[brands.str.contains(' ')] one_w_brands = brands[~brands.str.contains(' ')] ss2 = SymSpell(max_edit_distance=0) ss2.create_dictionary_from_arr(many_w_brands, token_pattern=r'.+') ss1 = SymSpell(max_edit_distance=0) ss1.create_dictionary_from_arr(one_w_brands, token_pattern=r'.+') two_words_re = re.compile(r"(?=(\s[a-z0-9*/+\-'’?!.,|&%®™ôèéü]+\s[a-z0-9*/+\-'’?!.,|&%®™ôèéü]+))") def find_in_str_ss2(row): for doc_word in two_words_re.finditer(row): print(doc_word) suggestion = ss2.best_word(doc_word.group(1), silent=True) if suggestion is not None: return doc_word.group(1) return '' def find_in_list_ss1(list): for doc_word in list: suggestion = ss1.best_word(doc_word, silent=True) if suggestion is not None: return doc_word return '' def find_in_list_ss2(list): for doc_word in list: suggestion = ss2.best_word(doc_word, silent=True) if suggestion is not None: return doc_word return '' print(f"Before empty brand_name: {len(dataset[dataset['brand_name'] == ''].index)}") n_name = dataset[dataset['brand_name'] == '']['name'].str.findall( pat=r"^[a-z0-9*/+\-'’?!.,|&%®™ôèéü]+\s[a-z0-9*/+\-'’?!.,|&%®™ôèéü]+") dataset.loc[dataset['brand_name'] == '', 'brand_name'] = [find_in_list_ss2(row) for row in n_name] n_desc = dataset[dataset['brand_name'] == '']['item_description'].str.findall( pat=r"^[a-z0-9*/+\-'’?!.,|&%®™ôèéü]+\s[a-z0-9*/+\-'’?!.,|&%®™ôèéü]+") dataset.loc[dataset['brand_name'] == '', 'brand_name'] = [find_in_list_ss2(row) for row in n_desc] n_name = dataset[dataset['brand_name'] == '']['name'].str.findall(pat=brand_word) dataset.loc[dataset['brand_name'] == '', 'brand_name'] = [find_in_list_ss1(row) for row in n_name] desc_lower = dataset[dataset['brand_name'] == '']['item_description'].str.findall(pat=brand_word) dataset.loc[dataset['brand_name'] == '', 'brand_name'] = [find_in_list_ss1(row) for row in desc_lower] print(f"After empty brand_name: {len(dataset[dataset['brand_name'] == ''].index)}") del ss1, ss2 gc.collect() def preprocess_regex(dataset, start_time=time()): karats_regex = r'(\d)([\s-]?)(karat|karats|carat|carats|kt)([^\w])' karats_repl = r'\1k\4' unit_regex = r'(\d+)[\s-]([a-z]{2})(\s)' unit_repl = r'\1\2\3' dataset['name'] = dataset['name'].str.replace(karats_regex, karats_repl) dataset['item_description'] = dataset['item_description'].str.replace(karats_regex, karats_repl) print(f'[{time() - start_time}] Karats normalized.') dataset['name'] = dataset['name'].str.replace(unit_regex, unit_repl) dataset['item_description'] = dataset['item_description'].str.replace(unit_regex, unit_repl) print(f'[{time() - start_time}] Units glued.') def preprocess_pandas(train, test, start_time=time()): train = train[train.price > 0.0].reset_index(drop=True) print('Train shape without zero price: ', train.shape) nrow_train = train.shape[0] y_train = np.log1p(train["price"]) merge: pd.DataFrame = pd.concat([train, test]) del train del test gc.collect() merge['has_category'] = (merge['category_name'].notnull()).astype('category') print(f'[{time() - start_time}] Has_category filled.') merge['category_name'] = merge['category_name'] \ .fillna('other/other/other') \#填补空缺值 .str.lower() \#都变成小写 .astype(str) merge['general_cat'], merge['subcat_1'], merge['subcat_2'], merge['gen_subcat1'] = \ zip(*merge['category_name'].apply(lambda x: split_cat(x))) print(f'[{time() - start_time}] Split categories completed.') merge['has_brand'] = (merge['brand_name'].notnull()).astype('category') print(f'[{time() - start_time}] Has_brand filled.') merge['gencat_cond'] = merge['general_cat'].map(str) + '_' + merge['item_condition_id'].astype(str) merge['subcat_1_cond'] = merge['subcat_1'].map(str) + '_' + merge['item_condition_id'].astype(str) merge['subcat_2_cond'] = merge['subcat_2'].map(str) + '_' + merge['item_condition_id'].astype(str) print(f'[{time() - start_time}] Categories and item_condition_id concancenated.') merge['name'] = merge['name'] \ .fillna('') \ .str.lower() \ .astype(str) merge['brand_name'] = merge['brand_name'] \ .fillna('') \ .str.lower() \ .astype(str) merge['item_description'] = merge['item_description'] \ .fillna('') \ .str.lower() \ .replace(to_replace='No description yet', value='') print(f'[{time() - start_time}] Missing filled.') preprocess_regex(merge, start_time) brands_filling(merge) print(f'[{time() - start_time}] Brand name filled.') merge['name'] = merge['name'] + ' ' + merge['brand_name'] print(f'[{time() - start_time}] Name concancenated.') merge['item_description'] = merge['item_description'] \ + ' ' + merge['name'] \ + ' ' + merge['subcat_1'] \ + ' ' + merge['subcat_2'] \ + ' ' + merge['general_cat'] \ + ' ' + merge['brand_name'] print(f'[{time() - start_time}] Item description concatenated.') merge.drop(['price', 'test_id', 'train_id'], axis=1, inplace=True) return merge, y_train, nrow_train def intersect_drop_columns(train: csr_matrix, valid: csr_matrix, min_df=0): t = train.tocsc() v = valid.tocsc() nnz_train = ((t != 0).sum(axis=0) >= min_df).A1 nnz_valid = ((v != 0).sum(axis=0) >= min_df).A1 nnz_cols = nnz_train & nnz_valid res = t[:, nnz_cols], v[:, nnz_cols] return res if __name__ == '__main__': mp.set_start_method('forkserver', True) start_time = time() train = pd.read_table(os.path.join(INPUT_PATH, 'train.tsv'), engine='c', dtype={'item_condition_id': 'category', 'shipping': 'category'} ) test = pd.read_table(os.path.join(INPUT_PATH, 'test.tsv'), engine='c', dtype={'item_condition_id': 'category', 'shipping': 'category'} ) print(f'[{time() - start_time}] Finished to load data') print('Train shape: ', train.shape) print('Test shape: ', test.shape) submission: pd.DataFrame = test[['test_id']] merge, y_train, nrow_train = preprocess_pandas(train, test, start_time) meta_params = {'name_ngram': (1, 2), 'name_max_f': 75000, 'name_min_df': 10, 'category_ngram': (2, 3), 'category_token': '.+', 'category_min_df': 10, 'brand_min_df': 10, 'desc_ngram': (1, 3), 'desc_max_f': 150000, 'desc_max_df': 0.5, 'desc_min_df': 10} stopwords = frozenset(['the', 'a', 'an', 'is', 'it', 'this', ]) # 'i', 'so', 'its', 'am', 'are']) vectorizer = FeatureUnion([ ('name', Pipeline([ ('select', ItemSelector('name', start_time=start_time)), ('transform', HashingVectorizer( ngram_range=(1, 2), n_features=2 ** 28, norm='l2', lowercase=False, stop_words=stopwords )), ('drop_cols', DropColumnsByDf(min_df=2)) ])), ('category_name', Pipeline([ ('select', ItemSelector('category_name', start_time=start_time)), ('transform', HashingVectorizer( ngram_range=(1, 1), token_pattern='.+', tokenizer=split_cat, n_features=2 ** 28, norm='l2', lowercase=False )), ('drop_cols', DropColumnsByDf(min_df=2)) ])), ('brand_name', Pipeline([ ('select', ItemSelector('brand_name', start_time=start_time)), ('transform', CountVectorizer( token_pattern='.+', min_df=2, lowercase=False )), ])), ('gencat_cond', Pipeline([ ('select', ItemSelector('gencat_cond', start_time=start_time)), ('transform', CountVectorizer( token_pattern='.+', min_df=2, lowercase=False )), ])), ('subcat_1_cond', Pipeline([ ('select', ItemSelector('subcat_1_cond', start_time=start_time)), ('transform', CountVectorizer( token_pattern='.+', min_df=2, lowercase=False )), ])), ('subcat_2_cond', Pipeline([ ('select', ItemSelector('subcat_2_cond', start_time=start_time)), ('transform', CountVectorizer( token_pattern='.+', min_df=2, lowercase=False )), ])), ('has_brand', Pipeline([ ('select', ItemSelector('has_brand', start_time=start_time)), ('ohe', OneHotEncoder()) ])), ('shipping', Pipeline([ ('select', ItemSelector('shipping', start_time=start_time)), ('ohe', OneHotEncoder()) ])), ('item_condition_id', Pipeline([ ('select', ItemSelector('item_condition_id', start_time=start_time)), ('ohe', OneHotEncoder()) ])), ('item_description', Pipeline([ ('select', ItemSelector('item_description', start_time=start_time)), ('hash', HashingVectorizer( ngram_range=(1, 3), n_features=2 ** 27, dtype=np.float32, norm='l2', lowercase=False, stop_words=stopwords )), ('drop_cols', DropColumnsByDf(min_df=2)), ])) ], n_jobs=1) sparse_merge = vectorizer.fit_transform(merge) print(f'[{time() - start_time}] Merge vectorized') print(sparse_merge.shape) tfidf_transformer = TfidfTransformer() X = tfidf_transformer.fit_transform(sparse_merge) print(f'[{time() - start_time}] TF/IDF completed') X_train = X[:nrow_train] print(X_train.shape) X_test = X[nrow_train:] del merge del sparse_merge del vectorizer del tfidf_transformer gc.collect() X_train, X_test = intersect_drop_columns(X_train, X_test, min_df=1) print(f'[{time() - start_time}] Drop only in train or test cols: {X_train.shape[1]}') gc.collect() ridge = Ridge(solver='auto', fit_intercept=True, alpha=0.4, max_iter=250, normalize=False, tol=0.01) ridge.fit(X_train, y_train) print(f'[{time() - start_time}] Train Ridge completed. Iterations: {ridge.n_iter_}') predsR = ridge.predict(X_test) print(f'[{time() - start_time}] Predict Ridge completed.') submission.loc[:, 'price'] = np.expm1(predsR) submission.loc[submission['price'] < 0.0, 'price'] = 0.0 submission.to_csv("submission_ridge.csv", index=False)

我把训练集的前5行单独提出来文件名为train

计算时间

记录开始时间： start_time = time()

文件读取

如果以以下方式直接读取 train=pd.read_csv(“train.csv”,sep=’\t’) 查看每列的类型可以看出

train.dtypes Out[2]: train_id int64 name object item_condition_id int64 category_name object brand_name object price float64 shipping int64 item_description object

而作者的读取方式用engine=’c’理论上可以加快读取速度;通过设置dtype可以改变列在DataFrame里的类型。为了加快看结果的速度，只取前5行哈（不影响结果）

INPUT_PATH = r'../input' train = pd.read_table(os.path.join(INPUT_PATH, 'train.tsv'), engine='c', dtype={'item_condition_id': 'category', 'shipping': 'category'} ) train5=train.head(5) train5.dtypes Out[2]: train_id int64 name object item_condition_id category category_name object brand_name object price float64 shipping category item_description object dtype: object

动态获取数据集行列数 & Series转DataFrame便捷写法

train5.shape Out[3]: (5, 8) #经试验，这种行为生成一个DataFrame,名字叫submission,且有列名就是test_id submission: pd.DataFrame = test[['test_id']] submission Out[8]: test_id 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22 23 23 24 24 25 25 26 26 27 27 28 28 29 29 ... 291098 291098 291099 291099 291100 291100 291101 291101 291102 291102 291103 291103 291104 291104 291105 291105 291106 291106 291107 291107 291108 291108 291109 291109 291110 291110 291111 291111 291112 291112 291113 291113 291114 291114 291115 291115 291116 291116 291117 291117 291118 291118 291119 291119 291120 291120 291121 291121 291122 291122 291123 291123 291124 291124 291125 291125 291126 291126 291127 291127 [291128 rows x 1 columns] #可能你会想为什么要这么写，不妨看看如果不这么写： type(submission) Out[9]: pandas.core.frame.DataFrame type(test['test_id']) Out[10]: pandas.core.series.Series #Series并不是我们想要的类型，因为Series如果输出成csv是没有列名的，而我们想要列名，所以需要DataFrame类型

一句话把某列是否为空变成新的一列特征

train5['has_category'] = train5['category_name'].notnull().astype('category') #额外附加： train5['category_name'].notnull() Out[4]: 0 True 1 True 2 True 3 True 4 True Name: category_name, dtype: bool

如果不加.astype(‘category’)，train5[‘has_category’]的dtype为’bool’

.loc[lambda x: x.index != ‘missing’]

train = pd.read_table( 'train.tsv', engine='c', dtype={'item_condition_id': 'category', 'shipping': 'category'} ) train150=train.head(150) NUM_BRANDS=10 train150['brand_name'].fillna(value='missing', inplace=True) pop_brand = train150['brand_name'].value_counts().loc[lambda x: x.index != 'missing'].index[:NUM_BRANDS] ###过程分析：第一步，数品牌出现的次数 train150['brand_name'].value_counts() Out[1]: missing 58 Victoria's Secret 5 PINK 4 Sephora 4 Adidas 3 Nike 3 Nintendo 2 Fossil 2 Apple 2 Under Armour 2 FOREVER 21 2 Hollister 2 Brandy Melville 2 Air Jordan 2 Rae Dunn 2 Target 2 Michael Kors 2 LuLaRoe 2 Lululemon 2 Too Faced 2 Gap 1 Levi's® 1 Rock Revival 1 Kylie Cosmetics 1 Spin Master 1 Anthropologie 1 Urban Outfitters 1 Samsung 1 Littlest Pet Shop 1 Ralph Lauren 1 .. Nostalgia Electrics 1 Motherhood Maternity 1 H&M 1 Melissa & Doug 1 The North Face 1 Barbie 1 Acacia Swimwear 1 Miss Me 1 Pokemon 1 Manolo Blahnik 1 MARC JACOBS 1 Merona 1 Scholastic 1 Faded Glory 1 Salvatore Ferragamo 1 Torrid 1 Xbox 1 Bath & Body Works 1 Razer 1 rue 1 patagonia 1 Soffe 1 Steve Madden 1 Nars 1 LC Lauren Conrad 1 lululemon athletica 1 Smashbox 1 Infantino 1 Silver Jeans Co. 1 Boulevard Boutique 1 Name: brand_name, Length: 65, dtype: int64 #第二步：把missing删去 train150['brand_name'].value_counts().loc[lambda x: x.index != 'missing'] Out[2]: Victoria's Secret 5 PINK 4 Sephora 4 Adidas 3 Nike 3 Nintendo 2 Fossil 2 Apple 2 Under Armour 2 FOREVER 21 2 Hollister 2 Brandy Melville 2 Air Jordan 2 Rae Dunn 2 Target 2 Michael Kors 2 LuLaRoe 2 Lululemon 2 Too Faced 2 Gap 1 Levi's® 1 Rock Revival 1 Kylie Cosmetics 1 Spin Master 1 Anthropologie 1 Urban Outfitters 1 Samsung 1 Littlest Pet Shop 1 Ralph Lauren 1 Tarte 1 .. Nostalgia Electrics 1 Motherhood Maternity 1 H&M 1 Melissa & Doug 1 The North Face 1 Barbie 1 Acacia Swimwear 1 Miss Me 1 Pokemon 1 Manolo Blahnik 1 MARC JACOBS 1 Merona 1 Scholastic 1 Faded Glory 1 Salvatore Ferragamo 1 Torrid 1 Xbox 1 Bath & Body Works 1 Razer 1 rue 1 patagonia 1 Soffe 1 Steve Madden 1 Nars 1 LC Lauren Conrad 1 lululemon athletica 1 Smashbox 1 Infantino 1 Silver Jeans Co. 1 Boulevard Boutique 1 Name: brand_name, Length: 64, dtype: int64 #第三步：截取出现次数最高的10个品牌名 train150['brand_name'].value_counts().loc[lambda x: x.index != 'missing'].index[:NUM_BRANDS] Out[3]: Index(['Victoria's Secret', 'PINK', 'Sephora', 'Adidas', 'Nike', 'Nintendo', 'Fossil', 'Apple', 'Under Armour', 'FOREVER 21'], dtype='object') train150.loc[~train150['brand_name'].isin(pop_brand), 'brand_name'] = 'missing' #第一步：选出'brand_name'在pop_brand里的 train150['brand_name'].isin(pop_brand) Out[1]: 0 False 1 False 2 False 3 False 4 False 5 False 6 False 7 False 8 True 9 False 10 False 11 True 12 False 13 False 14 False 15 False 16 False 17 False 18 False 19 False 20 False 21 True 22 False 23 False 24 True 25 True 26 True 27 True 28 True 29 True ... 120 False 121 False 122 False 123 False 124 False 125 False 126 False 127 False 128 False 129 False 130 False 131 True 132 False 133 False 134 False 135 False 136 False 137 True 138 False 139 False 140 False 141 False 142 False 143 False 144 False 145 True 146 False 147 False 148 False 149 True Name: brand_name, Length: 150, dtype: bool #第二步：因为只有isin,没有notin，所以想求notin要对isin做取反操作，即“~” ~train150['brand_name'].isin(pop_brand) Out[2]: 0 True 1 True 2 True 3 True 4 True 5 True 6 True 7 True 8 False 9 True 10 True 11 False 12 True 13 True 14 True 15 True 16 True 17 True 18 True 19 True 20 True 21 False 22 True 23 True 24 False 25 False 26 False 27 False 28 False 29 False ... 120 True 121 True 122 True 123 True 124 True 125 True 126 True 127 True 128 True 129 True 130 True 131 False 132 True 133 True 134 True 135 True 136 True 137 False 138 True 139 True 140 True 141 True 142 True 143 True 144 True 145 False 146 True 147 True 148 True 149 False Name: brand_name, Length: 150, dtype: bool #第三步：获得不在pop_brand所在行的brandname,注意看index为8的不被列出 train150.loc[~train150['brand_name'].isin(pop_brand), 'brand_name'] Out[3]: 0 missing 1 Razer 2 Target 3 missing 4 missing 5 missing 6 Acacia Swimwear 7 Soffe 9 missing 10 Smashbox 12 rue 13 Scholastic 14 UGG Australia 15 Tarte 16 Wet n Wild 17 missing 18 Too Faced 19 Anthropologie 20 Torrid 22 Samsung 23 missing 30 Too Faced 31 missing 34 missing 35 missing 36 missing 37 Under Armour 38 Xbox 40 Kylie Cosmetics 42 missing ... 113 MARC JACOBS 114 missing 115 Steve Madden 117 missing 120 No Boundaries 121 missing 122 missing 123 Faded Glory 124 Miss Me 125 Salvatore Ferragamo 126 Urban Outfitters 127 missing 128 missing 129 missing 130 missing 132 missing 133 Bath & Body Works 134 Fossil 135 Nars 136 Brandy Melville 138 missing 139 missing 140 Nintendo 141 Manolo Blahnik 142 Ralph Lauren 143 LC Lauren Conrad 144 missing 146 Lululemon 147 Michael Kors 148 missing Name: brand_name, Length: 121, dtype: object #第四步：赋值 train150.loc[~train150['brand_name'].isin(pop_brand), 'brand_name'] = 'missing' #结果：brand_name不在前多少（自己设定）pop_brand的brand_name都被设置为missing,泯然众人矣了 train150['brand_name'] Out[5]: 0 missing 1 missing 2 missing 3 missing 4 missing 5 missing 6 missing 7 missing 8 Nike 9 missing 10 missing 11 Victoria's Secret 12 missing 13 missing 14 missing 15 missing 16 missing 17 missing 18 missing 19 missing 20 missing 21 Victoria's Secret 22 missing 23 missing 24 FOREVER 21 25 Air Jordan 26 Apple 27 LuLaRoe 28 FOREVER 21 29 Hollister ... 120 missing 121 missing 122 missing 123 missing 124 missing 125 missing 126 missing 127 missing 128 missing 129 missing 130 missing 131 LuLaRoe 132 missing 133 missing 134 missing 135 missing 136 missing 137 Adidas 138 missing 139 missing 140 missing 141 missing 142 missing 143 missing 144 missing 145 Air Jordan 146 missing 147 missing 148 missing 149 Apple Name: brand_name, Length: 150, dtype: object

第一次见frozenset

stopwords = frozenset(['the', 'a', 'an', 'is', 'it', 'this', ]) stopwords Out[8]: frozenset({'a', 'an', 'is', 'it', 'the', 'this'})

frozenset 用法详解： https://www.programiz.com/python-programming/methods/built-in/frozenset

Feature Union

使用sklearn优雅地进行数据挖掘教程：http://www.cnblogs.com/jasonfreak/p/5448462.html （教程很好，但我个人认为有一个错误，不知道我说的对不对哈，就是部分并行这个问题，作者可能是不知道有ItemSelector这个功能，所以自己实现了部分并行）官方说明：http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html Concatenates results of multiple transformer objects.

This estimator applies a list of transformer objects in parallel to the input data, then concatenates the results. This is useful to combine several feature extraction mechanisms into a single transformer.

Parameters of the transformers may be set using its name and the parameter name separated by a ‘__’. A transformer may be replaced entirely by setting the parameter with its name to another transformer, or removed by setting to None. 中文说明（附很有用的例子！）： http://sklearn.apachecn.org/cn/0.19.0/modules/pipeline.html

Pipeline细节详解:

fit_transform和transform的区别

关于fit_transform和transform的详解参见：http://blog.csdn.net/anecdotegyb/article/details/74857055 提炼如下： Q：scikit-learn中fit_transform()与transform()到底有什么区别，能不能混用？ A：二者的功能都是对数据进行某种统一处理（比如标准化~N(0,1)，将数据缩放(映射)到某个固定区间，归一化，正则化等） fit_transform(partData)对部分数据先拟合fit，找到该part的整体指标，如均值、方差、最大值最小值等等（根据具体转换的目的），然后对该partData进行转换transform，从而实现数据的标准化、归一化等等。。根据对之前部分fit的整体指标，对剩余的数据（restData）使用同样的均值、方差、最大最小值等指标进行转换transform(restData)，从而保证part、rest处理方式相同。必须先用fit_transform(partData)，之后再transform(restData) 如果直接transform(partData)，程序会报错如果fit_transfrom(partData)后，使用fit_transform(restData)而不用transform(restData)，虽然也能归一化，但是两个结果不是在同一个“标准”下的，具有明显差异。

ItemSelector:用于选择某列

https://stats.stackexchange.com/questions/177082/sklearn-combine-multiple-feature-sets-in-pipeline

HashingVectorizer/ CountVectorizer

http://ju.outofmemory.cn/entry/74958

转载请注明原文地址: https://www.6miu.com/read-2613851.html

技术

最新回复(0)