Abstract:Open-sourced datasets accelerate the development of deep learning, while unauthorized data usage frequently happens. To protect the dataset copyright, this study proposes the dataset watermarking algorithm. The watermark is embedded into the dataset before it is released. When the model is trained on this dataset, the watermark is attached to the model, which allows illegal dataset usage to be traced by verifying whether the watermark exists in a suspect model. However, existing dataset watermarking algorithms cannot provide effective and covert black-box verification under small perturbations. Given this problem, the method of embedding the watermark by a style attribute independent of the image content and label is proposed for the first time in this study, and the perturbation on the original dataset is constrained to avoid the modification of labels. The covertness and validity of the watermark are ensured without introducing the inconsistency between the image content and label or extra surrogate model. In the watermark verification stage, only the prediction results of the suspected model are applied to give the judgment via a hypothesis test. The proposed method is compared with the existing five methods on the CIFAR-10 dataset. The experimental results validate the effectiveness and fidelity of the proposed algorithm. Besides, the ablation experiments conducted in this study verify the necessity of the proposed style refinement module and the effectiveness of the proposed algorithm under various hyper-parameter settings and datasets.