Abstract | Resampling data is an engineering technique that has an impact on instances where the underlying data distribution and proportions of instances and classes change as a result. Applying any resampling technique may influence the occurrence of certain phenomena such as concept drift, class imbalance, and anomalies. Such influence may produce, exaggerate or eliminate the presence of these phenomena, whether they are viewed as a problem or as a characteristic of the data. Resampling, such as over- or under-sampling, introduces new challenges as well as resolving others. One of the challenges of resampling is its impact on concept drift in a data stream. This paper looks at concept drift produced as a result of data resampling, its nature and how to use its complexity as an indicator of performance. Additionally it examines the nature of concept drifts as a result of applying over- and under-sampling techniques and the various different concept drifts produced as a result of these two techniques. Even though concept drift, class imbalance, and resampling techniques have been studied and researched extensively, sampling-induced concept drift itself as a separate phenomenon has been under-researched. This phenomenon has a certain complexity and can have an impact on the model, which can be measured using concept drift complexity especially when using the value of complexity as a baseline for the overall complexity of drift in a dataset. |
---|