1.问题描述

在用yolo4训练模型的时候,在加载完预训练的数据集的时候,
报错了BrokenPipeError: [Errno 32] Broken pipe

2.分析与解决

官方描述点我

The memory leak is caused by the difference in using FileMapping(mmap) on Windows. On Windows, FileMapping objects should be closed by all related processes and then it can be released. And there's no other way to explicitly delete it.(Like shm_unlink)
When multiprocessing is on, the child process will create a FileMapping and then the main process will open it. After that, at some time, the child process will try to release it but it's reference count is non-zero so it cannot be released at that time. But the current code does not provide a chance to let it close again when possible.
This PR targets #5590.
Current Progress:
The memory leak when num_worker=1 should be solved. However, further work has to be done for more workers.
Error type 1(unrelated filemapping handle get killed):

内存泄漏是由在Windows上使用文件映射(mmap)的差异造成的。在Windows上,文件映射对象应该由所有相关进程关闭,然后才能释放。没有其他方法可以明确删除它。(如shm_unlink)
当多处理打开时,子进程将创建一个文件映射,然后主进程将打开它。之后,在某个时候,子进程会尝试释放它,但是它的引用计数是非零的,所以在那个时候它不能被释放。但是当前的代码没有提供机会让它在可能的情况下再次关闭。

当前进度:
应该解决num_worker=1时的内存泄漏问题。然而,必须为更多的线程做进一步的工作。

问题是windows下多线程的问题,和DataLoader类有关
涉事代码:

DataLoader(train_dataset, shuffle = True, batch_size = batch_size, num_workers = num_workers, pin_memory=True,
                                    drop_last=True, collate_fn=yolo_dataset_collate)

解决方法:
让num_workers为0
(torch.utils.data.DataLoader中的num_workers)
(我自己测试选小一点也可以,比如2,只有主线程读取文件太慢了,此刻有点想念go routine:worried:)

3.另一个错误

这个错误没来得及截屏
CUDNN_STATUS_INTERNAL_ERROR以及filepage不足的错误
cudnn内部错误网上统一解决办法:

  • import torch
    torch.cuda.set_device(0)
  • import os
    os.environ['CUDA_ENABLE_DEVICES'] = '0'

filepage不足的解决办法:增加分页空间(虚拟内存,可以存在c盘)

分类: 深度学习

站点统计

  • 文章总数:316 篇
  • 分类总数:20 个
  • 标签总数:193 个
  • 运行天数:1184 天
  • 访问总数:77828 人次

浙公网安备33011302000604

辽ICP备20003309号