From: Andrea Arcangeli <aarcange@redhat.com> Date: Thu, 7 May 2009 23:44:41 +0200 Subject: [mm] fork vs fast gup race fix Message-id: 20090507214441.GA11745@random.random O-Subject: Re: [patch 3/3] fork-o_direct-race Bugzilla: 471613 Think a thread writing constantly to the last 512bytes of a page, while another thread read and writes to/from the first 512bytes of the page. We can lose O_DIRECT reads (or any other get_user_pages write=1 I/O not just bio/O_DIRECT), the very moment we mark any pte wrprotected because a third unrelated thread forks off a child. This fixes it by copying the anon page (instead of sharing it) within fork, if there can be any direct I/O in flight to the page. That takes care of O_DIRECT reads (writes to memory, read from disk). Checking the page_count under the PT lock guarantees no get_user_pages could be running under us because if somebody wants to write to the page, it has to break any cow first and that requires taking the PT lock in follow_page before increasing the page count. We are also guaranteed mapcount is 1 if fork is writeprotecting the pte so the PT lock is enough to serialize against get_user_pages->get_page. Another problem are the O_DIRECT writes to disk, if the parent touches a shared anon page before the child, the child do_wp_page will takeover the anon page and map it read-write despite it was under direct-io from the parent thread pool. This requires de-cowing the pages in gup more aggressively (i.e. setting FOLL_WRITE temporarily on anon pages to de-cow them, and always assume write=1 for hugetlb follow_page version). gup-fast is taken care of without flushing the smp-tlb for every parent-pte wrprotected, by wrprotecting the pte before checking the page count vs mapcount. gup-fast will then re-check that the pte is still available in write mode after having increased the page count, so solving the race without a flood of IPIs in fork. The COW triggered inside fork will run while the parent pte is readonly to provide as usual the per-page atomic copy from parent to child during fork. However timings will be altered by having to copy the pages that might be under O_DIRECT. Once this race is fixed, the testcase instead of showing corruption is capable of triggering a glibc NPTL race condition where fork_pre_cow is copying internal the nptl stack list in anonymous memory while some parent thread may be modifying it, which results in userland deadlock when the fork-child tries to free the stacks before returning from fork. We are flushing the tlb after wrprotecting the pte that maps the anon page if we take the fork_pre_cow path, so we should be providing per-page atomic copy from parent to child. The race indeed can trigger also without this patch and without fork_pre_Cow and to trigger it the wrprotect event must happen exactly in the middle of a list_add/list_del instruction run by some NPLT thread that is mangling over the stack list while fork runs. Some preliminary NPTL fix for this race exposed by this fix, is happening on glibc repository but I think it'd be better off to use a smart lock capable of jumping in and out of signal handler and not to go out of order rcu style which sounds too complex. The pagevec code calls get_page while the page is sitting in the pagevec (before it becomes PageLRU) and doing so it can generate false positives, so to avoid slowing down fork all the time even for pages that could never possibly be under O_DIRECT write=1, the PG_gup bitflag is added, this eliminates most overhead of the fix in fork. I had to add src_vma/dst_vma to use proper ->mm pointers, and in the case of track_pfn_vma_copy PAT code, this is fixing a bug, because previously vma was the dst_vma, while track_pfn_vma_copy has to run on the src_vma (the dst_vma in that place is guaranteed to have zero ptes instantiated/allocated). There are two testcases that reproduces the bug and they reproduce the bug both for regular anon pages and using the libhugetlbfs and hugepages too. Patch works for both. The glibc race is also eventually reproducible both using anon pages and hugepages with the dma_thread testcase (the forkscrew testcases isn't capable of reproducing the nptl race condition in fork). ========== dma_thread.c ======= /* compile with 'gcc -g -o dma_thread dma_thread.c -lpthread' */ #define _GNU_SOURCE 1 #include <stdio.h> #include <stdlib.h> #include <fcntl.h> #include <unistd.h> #include <memory.h> #include <pthread.h> #include <getopt.h> #include <errno.h> #include <sys/types.h> #include <sys/wait.h> #define FILESIZE (12*1024*1024) #define READSIZE (1024*1024) #define FILENAME "test_%.04d.tmp" #define FILECOUNT 100 #define MIN_WORKERS 2 #define MAX_WORKERS 256 #define PAGE_SIZE 4096 #define true 1 #define false 0 typedef int bool; bool done = false; int workers = 2; #define PATTERN (0xfa) static void usage (void) { fprintf(stderr, "\nUsage: dma_thread [-h | -a <alignment> [ -w <workers>]\n" "\nWith no arguments, generate test files and exit.\n" "-h Display this help and exit.\n" "-a align read buffer to offset <alignment>.\n" "-w number of worker threads, 2 (default) to 256,\n" " defaults to number of cores.\n\n" "Run first with no arguments to generate files.\n" "Then run with -a <alignment> = 512 or 0. \n"); } typedef struct { pthread_t tid; int worker_number; int fd; int offset; int length; int pattern; unsigned char *buffer; } worker_t; void *worker_thread(void * arg) { int bytes_read; int i,k; worker_t *worker = (worker_t *) arg; int offset = worker->offset; int fd = worker->fd; unsigned char *buffer = worker->buffer; int pattern = worker->pattern; int length = worker->length; if (lseek(fd, offset, SEEK_SET) < 0) { fprintf(stderr, "Failed to lseek to %d on fd %d: %s.\n", offset, fd, strerror(errno)); exit(1); } bytes_read = read(fd, buffer, length); if (bytes_read != length) { fprintf(stderr, "read failed on fd %d: bytes_read %d, %s\n", fd, bytes_read, strerror(errno)); exit(1); } /* Corruption check */ for (i = 0; i < length; i++) { if (buffer[i] != pattern) { printf("Bad data at 0x%.06x: %p, \n", i, buffer + i); printf("Data dump starting at 0x%.06x:\n", i - 8); printf("Expect 0x%x followed by 0x%x:\n", pattern, PATTERN); for (k = 0; k < 16; k++) { printf("%02x ", buffer[i - 8 + k]); if (k == 7) { printf("\n"); } } printf("\n"); abort(); } } return 0; } void *fork_thread (void *arg) { pid_t pid; while (!done) { pid = fork(); if (pid == 0) { exit(0); } else if (pid < 0) { fprintf(stderr, "Failed to fork child.\n"); exit(1); } waitpid(pid, NULL, 0 ); usleep(100); } return NULL; } int main(int argc, char *argv[]) { unsigned char *buffer = NULL; char filename[1024]; int fd; bool dowrite = true; pthread_t fork_tid; int c, n, j; worker_t *worker; int align = 0; int offset, rc; workers = sysconf(_SC_NPROCESSORS_ONLN); while ((c = getopt(argc, argv, "a:hw:")) != -1) { switch (c) { case 'a': align = atoi(optarg); if (align < 0 || align > PAGE_SIZE) { printf("Bad alignment %d.\n", align); exit(1); } dowrite = false; break; case 'h': usage(); exit(0); break; case 'w': workers = atoi(optarg); if (workers < MIN_WORKERS || workers > MAX_WORKERS) { fprintf(stderr, "Worker count %d not between " "%d and %d, inclusive.\n", workers, MIN_WORKERS, MAX_WORKERS); usage(); exit(1); } dowrite = false; break; default: usage(); exit(1); } } if (argc > 1 && (optind < argc)) { fprintf(stderr, "Bad command line.\n"); usage(); exit(1); } if (dowrite) { buffer = malloc(FILESIZE); if (buffer == NULL) { fprintf(stderr, "Failed to malloc write buffer.\n"); exit(1); } for (n = 1; n <= FILECOUNT; n++) { sprintf(filename, FILENAME, n); fd = open(filename, O_RDWR|O_CREAT|O_TRUNC, 0666); if (fd < 0) { printf("create failed(%s): %s.\n", filename, strerror(errno)); exit(1); } memset(buffer, n, FILESIZE); printf("Writing file %s.\n", filename); if (write(fd, buffer, FILESIZE) != FILESIZE) { printf("write failed (%s)\n", filename); } close(fd); fd = -1; } free(buffer); buffer = NULL; printf("done\n"); exit(0); } printf("Using %d workers.\n", workers); worker = malloc(workers * sizeof(worker_t)); if (worker == NULL) { fprintf(stderr, "Failed to malloc worker array.\n"); exit(1); } for (j = 0; j < workers; j++) { worker[j].worker_number = j; } printf("Using alignment %d.\n", align); posix_memalign((void *)&buffer, PAGE_SIZE, READSIZE+ align); printf("Read buffer: %p.\n", buffer); for (n = 1; n <= FILECOUNT; n++) { sprintf(filename, FILENAME, n); for (j = 0; j < workers; j++) { if ((worker[j].fd = open(filename, O_RDONLY|O_DIRECT)) < 0) { fprintf(stderr, "Failed to open %s: %s.\n", filename, strerror(errno)); exit(1); } worker[j].pattern = n; } printf("Reading file %d.\n", n); for (offset = 0; offset < FILESIZE; offset += READSIZE) { memset(buffer, PATTERN, READSIZE + align); for (j = 0; j < workers; j++) { worker[j].offset = offset + j * PAGE_SIZE; worker[j].buffer = buffer + align + j * PAGE_SIZE; worker[j].length = PAGE_SIZE; } /* The final worker reads whatever is left over. */ worker[workers - 1].length = READSIZE - PAGE_SIZE * (workers - 1); done = 0; rc = pthread_create(&fork_tid, NULL, fork_thread, NULL); if (rc != 0) { fprintf(stderr, "Can't create fork thread: %s.\n", strerror(rc)); exit(1); } for (j = 0; j < workers; j++) { rc = pthread_create(&worker[j].tid, NULL, worker_thread, worker + j); if (rc != 0) { fprintf(stderr, "Can't create worker thread %d: %s.\n", j, strerror(rc)); exit(1); } } for (j = 0; j < workers; j++) { rc = pthread_join(worker[j].tid, NULL); if (rc != 0) { fprintf(stderr, "Failed to join worker thread %d: %s.\n", j, strerror(rc)); exit(1); } } /* Let the fork thread know it's ok to exit */ done = 1; rc = pthread_join(fork_tid, NULL); if (rc != 0) { fprintf(stderr, "Failed to join fork thread: %s.\n", strerror(rc)); exit(1); } } /* Close the fd's for the next file. */ for (j = 0; j < workers; j++) { close(worker[j].fd); } } return 0; } ========== dma_thread.c ======= ========== forkscrew.c ======== /* * Copyright 2009, Red Hat, Inc. * * Author: Jeff Moyer <jmoyer@redhat.com> * * This program attempts to expose a race between O_DIRECT I/O and the fork() * path in a multi-threaded program. In order to reliably reproduce the * problem, it is best to perform a dd from the device under test to /dev/null * as this makes the read I/O slow enough to orchestrate the problem. * * Running: ./forkscrew * * It is expected that a file name "data" exists in the current working * directory, and that its contents are something other than 0x2a. A simple * dd if=/dev/zero of=data bs=1M count=1 should be sufficient. */ #define _GNU_SOURCE 1 #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #include <fcntl.h> #include <errno.h> #include <sys/types.h> #include <sys/wait.h> #include <pthread.h> #include <libaio.h> pthread_cond_t worker_cond = PTHREAD_COND_INITIALIZER; pthread_mutex_t worker_mutex = PTHREAD_MUTEX_INITIALIZER; pthread_cond_t fork_cond = PTHREAD_COND_INITIALIZER; pthread_mutex_t fork_mutex = PTHREAD_MUTEX_INITIALIZER; char *buffer; int fd; /* pattern filled into the in-memory buffer */ #define PATTERN 0x2a // '*' void usage(void) { fprintf(stderr, "\nUsage: forkscrew\n" "it is expected that a file named \"data\" is the current\n" "working directory. It should be at least 3*pagesize in size\n" ); } void dump_buffer(char *buf, int len) { int i; int last_off, last_val; last_off = -1; last_val = -1; for (i = 0; i < len; i++) { if (last_off < 0) { last_off = i; last_val = buf[i]; continue; } if (buf[i] != last_val) { printf("%d - %d: %d\n", last_off, i - 1, last_val); last_off = i; last_val = buf[i]; } } if (last_off != len - 1) printf("%d - %d: %d\n", last_off, i-1, last_val); } int check_buffer(char *bufp, int len, int pattern) { int i; for (i = 0; i < len; i++) { if (bufp[i] == pattern) return 1; } return 0; } void * forker_thread(void *arg) { pthread_mutex_lock(&fork_mutex); pthread_cond_signal(&fork_cond); pthread_cond_wait(&fork_cond, &fork_mutex); switch (fork()) { case 0: sleep(1); printf("child dumping buffer:\n"); dump_buffer(buffer + 512, 2*getpagesize()); exit(0); case -1: perror("fork"); exit(1); default: break; } pthread_cond_signal(&fork_cond); pthread_mutex_unlock(&fork_mutex); wait(NULL); return (void *)0; } void * worker(void *arg) { int first = (int)arg; char *bufp; int pagesize = getpagesize(); int ret; int corrupted = 0; if (first) { io_context_t aioctx; struct io_event event; struct iocb *iocb = malloc(sizeof *iocb); if (!iocb) { perror("malloc"); exit(1); } memset(&aioctx, 0, sizeof(aioctx)); ret = io_setup(1, &aioctx); if (ret != 0) { errno = -ret; perror("io_setup"); exit(1); } bufp = buffer + 512; io_prep_pread(iocb, fd, bufp, pagesize, 0); /* submit the I/O */ io_submit(aioctx, 1, &iocb); /* tell the fork thread to run */ pthread_mutex_lock(&fork_mutex); pthread_cond_signal(&fork_cond); /* wait for the fork to happen */ pthread_cond_wait(&fork_cond, &fork_mutex); pthread_mutex_unlock(&fork_mutex); /* release the other worker to issue I/O */ pthread_mutex_lock(&worker_mutex); pthread_cond_signal(&worker_cond); pthread_mutex_unlock(&worker_mutex); ret = io_getevents(aioctx, 1, 1, &event, NULL); if (ret != 1) { errno = -ret; perror("io_getevents"); exit(1); } if (event.res != pagesize) { errno = -event.res; perror("read error"); exit(1); } io_destroy(aioctx); /* check buffer, should be corrupt */ if (check_buffer(bufp, pagesize, PATTERN)) { printf("worker 0 failed check\n"); dump_buffer(bufp, pagesize); corrupted = 1; } } else { bufp = buffer + 512 + pagesize; pthread_mutex_lock(&worker_mutex); pthread_cond_signal(&worker_cond); /* tell main we're ready */ /* wait for the first I/O and the fork */ pthread_cond_wait(&worker_cond, &worker_mutex); pthread_mutex_unlock(&worker_mutex); /* submit overlapping I/O */ ret = read(fd, bufp, pagesize); if (ret != pagesize) { perror("read"); exit(1); } /* check buffer, should be fine */ if (check_buffer(bufp, pagesize, PATTERN)) { printf("worker 1 failed check -- abnormal\n"); dump_buffer(bufp, pagesize); corrupted = 1; } } return (void *)corrupted; } int main(int argc, char **argv) { pthread_t workers[2]; pthread_t forker; int ret, rc = 0; void *thread_ret; int pagesize = getpagesize(); fd = open("data", O_DIRECT|O_RDONLY); if (fd < 0) { perror("open"); exit(1); } ret = posix_memalign(&buffer, pagesize, 3 * pagesize); if (ret != 0) { errno = ret; perror("posix_memalign"); exit(1); } memset(buffer, PATTERN, 3*pagesize); pthread_mutex_lock(&fork_mutex); ret = pthread_create(&forker, NULL, forker_thread, NULL); pthread_cond_wait(&fork_cond, &fork_mutex); pthread_mutex_unlock(&fork_mutex); pthread_mutex_lock(&worker_mutex); ret |= pthread_create(&workers[0], NULL, worker, (void *)0); if (ret) { perror("pthread_create"); exit(1); } pthread_cond_wait(&worker_cond, &worker_mutex); pthread_mutex_unlock(&worker_mutex); ret = pthread_create(&workers[1], NULL, worker, (void *)1); if (ret != 0) { perror("pthread_create"); exit(1); } pthread_join(forker, NULL); pthread_join(workers[0], &thread_ret); if (thread_ret != 0) rc = 1; pthread_join(workers[1], &thread_ret); if (thread_ret != 0) rc = 1; if (rc != 0) { printf("parent dumping full buffer\n"); dump_buffer(buffer + 512, 2 * pagesize); } close(fd); free(buffer); exit(rc); } ========== forkscrew.c ======== ========== forkscrewreverse.c ======== #define _GNU_SOURCE 1 #include <stdio.h> #include <stdlib.h> #include <fcntl.h> #include <unistd.h> #include <memory.h> #include <pthread.h> #include <getopt.h> #include <errno.h> #include <sys/types.h> #include <sys/wait.h> #define FILESIZE (40*1024*1024) #define BUFSIZE (40*1024*1024) static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER; static const char *filename = "file.dat"; static int fd; static void *buffer; #define PAGE_SIZE 4096 static void store(void) { int i; if (usleep(50*1000) == -1) perror("usleep"), exit(1); printf("child storing\n"); fflush(stdout); for (i = 0; i < BUFSIZE; i++) ((char *)buffer)[i] = 0xff; _exit(0); } static void *writer(void *arg) { int i; if (pthread_mutex_lock(&lock) == -1) perror("pthread_mutex_lock"), exit(1); printf("thread writing\n"); fflush(stdout); for (i = 0; i < FILESIZE / BUFSIZE; i++) { size_t count = BUFSIZE; ssize_t ret; do { ret = write(fd, buffer, count); if (ret == -1) { if (errno != EINTR) perror("write"), exit(1); ret = 0; } count -= ret; } while (count); } printf("thread writing done\n"); fflush(stdout); if (pthread_mutex_unlock(&lock) == -1) perror("pthread_mutex_lock"), exit(1); return NULL; } int main(int argc, char *argv[]) { int i; int status; pthread_t writer_thread; pid_t store_proc; posix_memalign(&buffer, PAGE_SIZE, BUFSIZE); printf("Write buffer: %p.\n", buffer); for (i = 0; i < BUFSIZE; i++) ((char *)buffer)[i] = 0x00; fd = open(filename, O_RDWR|O_DIRECT); if (fd == -1) perror("open"), exit(1); if (pthread_mutex_lock(&lock) == -1) perror("pthread_mutex_lock"), exit(1); if (pthread_create(&writer_thread, NULL, writer, NULL) == -1) perror("pthred_create"), exit(1); store_proc = fork(); if (store_proc == -1) perror("fork"), exit(1); if (!store_proc) store(); if (pthread_mutex_unlock(&lock) == -1) perror("pthread_mutex_lock"), exit(1); if (usleep(10*1000) == -1) perror("usleep"), exit(1); printf("parent storing\n"); fflush(stdout); for (i = 0; i < BUFSIZE; i++) ((char *)buffer)[i] = 0x11; do { pid_t w; w = waitpid(store_proc, &status, WUNTRACED | WCONTINUED); if (w == -1) perror("waitpid"), exit(1); } while (!WIFEXITED(status) && !WIFSIGNALED(status)); if (pthread_join(writer_thread, NULL) == -1) perror("pthread_join"), exit(1); exit(0); } ========== forkscrewreverse.c ======== Normally I test with "dma_thread -a 512 -w 40". To reproduce or verify the fix with hugepages run it like this: LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ../test/dma_thread -a 512 -w 40 LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./forkscrew LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./forkscrewreverse This is a fixed version of original patch from Nick Piggin. KSM has the same problem of fork and it also checks the page_count after a ptep_clear_flush_notify (the _flush sending smp-tlb-flush stops gup-fast so it doesn't depend on the above gup-fast changes that allows fork not to flush the smp-tlb at every pte wrprotected, and the _notify ensure all secondary ptes are zapped and any page-pin released for mmu-notifier subsystems that take page pins like currently KVM). BTW, I guess it's pure luck ENOSPC != VM_FAULT_OOM in hugetlb.c, mixing -errno with -VM_FAULT_* is total breakage that will have to be cleaned up (either don't use -ENOSPC, or use -ENOMEM instead of VM_FAULT_OOM), I didn't address it in this patch as it's unrelated. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> diff --git a/arch/x86_64/mm/gup.c b/arch/x86_64/mm/gup.c index acb5d93..750f186 100644 --- a/arch/x86_64/mm/gup.c +++ b/arch/x86_64/mm/gup.c @@ -90,6 +90,26 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr, BUG_ON(!pfn_valid(pte_pfn(pte))); page = pte_page(pte); get_page(page); + if (PageAnon(page)) { + if (!PageGUP(page)) + SetPageGUP(page); + smp_mb(); + /* + * Fork doesn't want to flush the smp-tlb for + * every pte that it marks readonly but newly + * created shared anon pages cannot have + * direct-io going to them, so check if fork + * made the page shared before we taken the + * page pin. + * de-cow to make direct read from memory safe. + */ + if ((pte_val(gup_get_pte(ptep)) & + (mask | _PAGE_SPECIAL)) != (mask|_PAGE_RW)) { + put_page(page); + pte_unmap(ptep); + return 0; + } + } pages[*nr] = page; (*nr)++; @@ -99,24 +119,16 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr, return 1; } -static inline void get_head_page_multiple(struct page *page, int nr) -{ - BUG_ON(page != compound_head(page)); - BUG_ON(page_count(page) == 0); - atomic_add(nr, &page->_count); -} - -static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr, - unsigned long end, int write, struct page **pages, int *nr) +static noinline int gup_huge_pmd(pmd_t *pmdp, unsigned long addr, + unsigned long end, struct page **pages, int *nr) { unsigned long mask; - pte_t pte = *(pte_t *)&pmd; + pte_t pte = *(pte_t *)pmdp; struct page *head, *page; int refs; - mask = _PAGE_PRESENT|_PAGE_USER; - if (write) - mask |= _PAGE_RW; + /* de-cow to make direct read from memory safe */ + mask = _PAGE_PRESENT|_PAGE_USER|_PAGE_RW; if ((pte_val(pte) & mask) != mask) return 0; /* hugepages are never "special" */ @@ -128,12 +140,21 @@ static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr, page = head + ((addr & ~HPAGE_MASK) >> PAGE_SHIFT); do { BUG_ON(compound_head(page) != head); + get_page(head); + if (!PageGUP(head)) + SetPageGUP(head); + smp_mb(); + if ((pte_val(*(pte_t *)pmdp) & mask) != mask) { + put_page(page); + return 0; + } pages[*nr] = page; (*nr)++; page++; refs++; } while (addr += PAGE_SIZE, addr != end); - get_head_page_multiple(head, refs); + BUG_ON(page_count(head) == 0); + BUG_ON(head != compound_head(head)); return 1; } @@ -152,7 +173,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, if (pmd_none(pmd)) return 0; if (unlikely(pmd_large(pmd))) { - if (!gup_huge_pmd(pmd, addr, next, write, pages, nr)) + if (!gup_huge_pmd(pmdp, addr, next, pages, nr)) return 0; } else { if (!gup_pte_range(pmd, addr, next, write, pages, nr)) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index d0f4d51..a1c6c8d 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -14,8 +14,8 @@ static inline int is_vm_hugetlb_page(struct vm_area_struct *vma) } int hugetlb_sysctl_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *); -int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *); -int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, int *, int, int); +int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *, struct vm_area_struct *); +int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, int *, int); void unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long); void __unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long); int hugetlb_prefault(struct address_space *, struct vm_area_struct *); @@ -126,9 +126,9 @@ static inline unsigned long hugetlb_total_pages(void) return 0; } -#define follow_hugetlb_page(m,v,p,vs,a,b,i,w) ({ BUG(); 0; }) +#define follow_hugetlb_page(m,v,p,vs,a,b,i) ({ BUG(); 0; }) #define follow_huge_addr(mm, addr, write) ERR_PTR(-EINVAL) -#define copy_hugetlb_page_range(src, dst, vma) ({ BUG(); 0; }) +#define copy_hugetlb_page_range(src, dst, dst_vma, src_vma) ({ BUG(); 0; }) #define hugetlb_prefault(mapping, vma) ({ BUG(); 0; }) #define unmap_hugepage_range(vma, start, end) BUG() #define hugetlb_report_meminfo(buf) 0 diff --git a/include/linux/mm.h b/include/linux/mm.h index b8a0d95..99ebb1c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -794,7 +794,8 @@ void free_pgd_range(struct mmu_gather **tlb, unsigned long addr, void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *start_vma, unsigned long floor, unsigned long ceiling); int copy_page_range(struct mm_struct *dst, struct mm_struct *src, - struct vm_area_struct *vma); + struct vm_area_struct *dst_vma, + struct vm_area_struct *src_vma); int zeromap_page_range(struct vm_area_struct *vma, unsigned long from, unsigned long size, pgprot_t prot); void unmap_mapping_range(struct address_space *mapping, @@ -1186,11 +1187,12 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn); struct page *follow_page(struct vm_area_struct *, unsigned long address, - unsigned int foll_flags); + unsigned int *foll_flags); #define FOLL_WRITE 0x01 /* check pte is writable */ #define FOLL_TOUCH 0x02 /* mark page accessed */ #define FOLL_GET 0x04 /* do get_page on page */ #define FOLL_ANON 0x08 /* give ZERO_PAGE if no pgtable */ +#define FOLL_COW 0x10 /* COW already happened */ #ifdef CONFIG_XEN typedef int (*pte_fn_t)(pte_t *pte, struct page *pmd_page, unsigned long addr, diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 63c8e4c..6105cad 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -86,6 +86,7 @@ #define PG_reclaim 17 /* To be reclaimed asap */ #define PG_nosave_free 18 /* Free, should not be written */ #define PG_buddy 19 /* Page is free, on buddy lists */ +#define PG_gup 20 /* Page pin may be because of gup */ #define PG_xpmem 27 /* Testing for xpmem. */ /* PG_owner_priv_1 users should have descriptive aliases */ @@ -239,6 +240,10 @@ #define __SetPageCompound(page) __set_bit(PG_compound, &(page)->flags) #define __ClearPageCompound(page) __clear_bit(PG_compound, &(page)->flags) +#define SetPageGUP(page) set_bit(PG_gup, &(page)->flags) +#define PageGUP(page) test_bit(PG_gup, &(page)->flags) +#define __ClearPageGUP(page) __clear_bit(PG_gup, &(page)->flags) + /* * PG_reclaim is used in combination with PG_compound to mark the * head and tail of a compound page diff --git a/kernel/fork.c b/kernel/fork.c index 2e375a5..8ef2897 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -386,7 +386,7 @@ static inline int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm) rb_parent = &tmp->vm_rb; mm->map_count++; - retval = copy_page_range(mm, oldmm, mpnt); + retval = copy_page_range(mm, oldmm, tmp, mpnt); if (tmp->vm_ops && tmp->vm_ops->open) tmp->vm_ops->open(tmp); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index f737968..d98c8d1 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -357,18 +357,22 @@ static void set_huge_ptep_writable(struct vm_area_struct *vma, lazy_mmu_prot_update(entry); } +static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, pte_t *ptep, pte_t pte); int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, - struct vm_area_struct *vma) + struct vm_area_struct *dst_vma, + struct vm_area_struct *src_vma) { - pte_t *src_pte, *dst_pte, entry; + pte_t *src_pte, *dst_pte, entry, orig_entry; struct page *ptepage; unsigned long addr; - int cow; + int cow, forcecow, oom; - cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE; + cow = (src_vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE; - for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) { + for (addr = src_vma->vm_start; addr < src_vma->vm_end; + addr += HPAGE_SIZE) { src_pte = huge_pte_offset(src, addr); if (!src_pte) continue; @@ -378,18 +382,50 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, /* if the page table is shared dont copy or take references */ if (dst_pte == src_pte) continue; + oom = 0; spin_lock(&dst->page_table_lock); spin_lock(&src->page_table_lock); - if (!huge_pte_none(huge_ptep_get(src_pte))) { - if (cow) - huge_ptep_set_wrprotect(src, addr, src_pte); - entry = huge_ptep_get(src_pte); + orig_entry = entry = huge_ptep_get(src_pte); + forcecow = 0; + if (!huge_pte_none(entry)) { ptepage = pte_page(entry); get_page(ptepage); + if (cow && pte_write(entry)) { + huge_ptep_set_wrprotect(src, addr, src_pte); + smp_mb(); + if (PageGUP(ptepage)) + forcecow = 1; + entry = huge_ptep_get(src_pte); + } set_huge_pte_at(dst, addr, dst_pte, entry); } - spin_unlock(&src->page_table_lock); + spin_unlock(&src->page_table_lock); + if (forcecow) { + int cow_ret; + /* force atomic copy from parent to child */ + flush_tlb_range(src_vma, addr, addr+HPAGE_SIZE); + cow_ret = hugetlb_cow(dst, dst_vma, addr, + dst_pte, entry); + /* + * We hold mmap_sem in write mode and + * the VM doesn't know about hugepages + * so the src_pte/dst_pte can't change + * from under us even without both + * page_table_lock hold the whole time. + */ + spin_lock(&src->page_table_lock); + BUG_ON(!pte_same(huge_ptep_get(src_pte), + entry)); + set_huge_pte_at(src, addr, + src_pte, + orig_entry); + spin_unlock(&src->page_table_lock); + if (cow_ret != VM_FAULT_MINOR) + oom = 1; + } spin_unlock(&dst->page_table_lock); + if (oom) + goto nomem; } return 0; @@ -620,14 +656,14 @@ static int huge_zeropage_ok(pte_t *ptep, int write, int shared) int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, struct page **pages, struct vm_area_struct **vmas, - unsigned long *position, int *length, int i, - int write) + unsigned long *position, int *length, int i) { unsigned long pfn_offset; unsigned long vaddr = *position; int remainder = *length; int zeropage_ok = 0; int shared = vma->vm_flags & VM_SHARED; + const int write = 1; spin_lock(&mm->page_table_lock); while (vaddr < vma->vm_end && remainder) { @@ -664,6 +700,8 @@ int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, page = pte_page(huge_ptep_get(pte)); same_page: if (pages) { + if (write && !PageGUP(page)) + SetPageGUP(page); if (zeropage_ok) pages[i] = ZERO_PAGE(0); else diff --git a/mm/memory.c b/mm/memory.c index dd41265..06b0bec 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -475,14 +475,16 @@ out: * covered by this vma. */ -static inline void +static inline int copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, - pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma, + pte_t *dst_pte, pte_t *src_pte, + struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, unsigned long addr, int *rss) { - unsigned long vm_flags = vma->vm_flags; + unsigned long vm_flags = src_vma->vm_flags; pte_t pte = *src_pte; struct page *page; + int forcecow = 0; /* pte contains position in swap or file, so copy. */ if (unlikely(!pte_present(pte))) { @@ -513,15 +515,6 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, } /* - * If it's a COW mapping, write protect it both - * in the parent and the child - */ - if (is_cow_mapping(vm_flags)) { - ptep_set_wrprotect(src_mm, addr, src_pte); - pte = *src_pte; - } - - /* * If it's a shared mapping, mark it clean in * the child */ @@ -529,27 +522,87 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, pte = pte_mkclean(pte); pte = pte_mkold(pte); - page = vm_normal_page(vma, addr, pte); + /* + * If it's a COW mapping, write protect it both + * in the parent and the child. + */ + if (is_cow_mapping(vm_flags) && pte_write(pte)) { + /* + * Serialization against gup-fast happens by + * wrprotecting the pte and checking the PG_gup flag + * and the number of page pins after that. If gup-fast + * boosts the page_count after we checked it, it will + * also take the slow path because it will find the + * pte wrprotected. + */ + ptep_set_wrprotect(src_mm, addr, src_pte); + } + + page = vm_normal_page(src_vma, addr, pte); if (page) { get_page(page); page_dup_rmap(page); + if (is_cow_mapping(vm_flags) && pte_write(pte) && + PageAnon(page)) { + smp_mb(); + if (PageGUP(page)) { + if (unlikely(TestSetPageLocked(page))) + forcecow = 1; + else { + BUG_ON(page_mapcount(page) != 2); + if (unlikely(page_count(page) != + page_mapcount(page) + + !!PageSwapCache(page))) + forcecow = 1; + unlock_page(page); + } + } + } rss[!!PageAnon(page)]++; } + if (is_cow_mapping(vm_flags) && pte_write(pte)) { + pte = pte_wrprotect(pte); + if (forcecow) { + /* force atomic copy from parent to child */ + flush_tlb_page(src_vma, addr); + /* + * Don't set the dst_pte here to be + * safer, as fork_pre_cow might return + * -EAGAIN and restart. + */ + goto out; + } + } + out_set_pte: set_pte_at(dst_mm, addr, dst_pte, pte); +out: + return forcecow; } +static int fork_pre_cow(struct mm_struct *dst_mm, + struct mm_struct *src_mm, + struct vm_area_struct *dst_vma, + struct vm_area_struct *src_vma, + unsigned long address, + pte_t **dst_ptep, pte_t **src_ptep, + spinlock_t **dst_ptlp, spinlock_t **src_ptlp, + pmd_t *dst_pmd, pmd_t *src_pmd); + static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, - pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma, + pmd_t *dst_pmd, pmd_t *src_pmd, + struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, unsigned long addr, unsigned long end) { pte_t *src_pte, *dst_pte; spinlock_t *src_ptl, *dst_ptl; int progress = 0; int rss[2]; + int forcecow; again: + forcecow = 0; rss[1] = rss[0] = 0; dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl); if (!dst_pte) @@ -559,6 +612,9 @@ again: spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING); do { + if (forcecow) + break; + /* * We are holding two locks at this point - either of them * could generate latencies in another task on another CPU. @@ -574,22 +630,54 @@ again: progress++; continue; } - copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma, addr, rss); + forcecow = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, + dst_vma, src_vma, addr, rss); progress += 8; } while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end); + if (unlikely(forcecow)) { + pte_t *_src_pte = src_pte-1, *_dst_pte = dst_pte-1; + /* + * Try to COW the child page as direct I/O is working + * on the parent page, and so we've to mark the parent + * pte read-write before dropping the PT lock and + * mmap_sem to avoid the page to be cowed in the + * parent and any direct I/O to get lost. + */ + forcecow = fork_pre_cow(dst_mm, src_mm, + dst_vma, src_vma, + addr-PAGE_SIZE, + &_dst_pte, &_src_pte, + &dst_ptl, &src_ptl, + dst_pmd, src_pmd); + src_pte = _src_pte + 1; + dst_pte = _dst_pte + 1; + /* after the page copy set the parent pte writeable again */ + set_pte_at(src_mm, addr-PAGE_SIZE, src_pte-1, + pte_mkwrite(*(src_pte-1))); + if (unlikely(forcecow == -EAGAIN)) { + dst_pte--; + src_pte--; + addr -= PAGE_SIZE; + rss[1]--; + } + } + spin_unlock(src_ptl); pte_unmap_nested(src_pte - 1); add_mm_rss(dst_mm, rss[0], rss[1]); pte_unmap_unlock(dst_pte - 1, dst_ptl); cond_resched(); + if (unlikely(forcecow == -ENOMEM)) + return -ENOMEM; if (addr != end) goto again; return 0; } static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, - pud_t *dst_pud, pud_t *src_pud, struct vm_area_struct *vma, + pud_t *dst_pud, pud_t *src_pud, + struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, unsigned long addr, unsigned long end) { pmd_t *src_pmd, *dst_pmd; @@ -604,14 +692,15 @@ static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src if (pmd_none_or_clear_bad(src_pmd)) continue; if (copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd, - vma, addr, next)) + dst_vma, src_vma, addr, next)) return -ENOMEM; } while (dst_pmd++, src_pmd++, addr = next, addr != end); return 0; } static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, - pgd_t *dst_pgd, pgd_t *src_pgd, struct vm_area_struct *vma, + pgd_t *dst_pgd, pgd_t *src_pgd, + struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, unsigned long addr, unsigned long end) { pud_t *src_pud, *dst_pud; @@ -626,19 +715,20 @@ static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src if (pud_none_or_clear_bad(src_pud)) continue; if (copy_pmd_range(dst_mm, src_mm, dst_pud, src_pud, - vma, addr, next)) + dst_vma, src_vma, addr, next)) return -ENOMEM; } while (dst_pud++, src_pud++, addr = next, addr != end); return 0; } int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, - struct vm_area_struct *vma) + struct vm_area_struct *dst_vma, + struct vm_area_struct *src_vma) { pgd_t *src_pgd, *dst_pgd; unsigned long next; - unsigned long addr = vma->vm_start; - unsigned long end = vma->vm_end; + unsigned long addr = src_vma->vm_start; + unsigned long end = src_vma->vm_end; int ret; /* @@ -647,13 +737,14 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, * readonly mappings. The tradeoff is that copy_page_range is more * efficient than faulting. */ - if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) { - if (!vma->anon_vma) + if (!(src_vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) { + if (!src_vma->anon_vma) return 0; } - if (is_vm_hugetlb_page(vma)) - return copy_hugetlb_page_range(dst_mm, src_mm, vma); + if (is_vm_hugetlb_page(src_vma)) + return copy_hugetlb_page_range(dst_mm, src_mm, + dst_vma, src_vma); /* * We need to invalidate the secondary MMU mappings only when @@ -661,7 +752,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, * parent mm. And a permission downgrade will only happen if * is_cow_mapping() returns true. */ - if (is_cow_mapping(vma->vm_flags)) + if (is_cow_mapping(src_vma->vm_flags)) mmu_notifier_invalidate_range_start(src_mm, addr, end); ret = 0; @@ -672,15 +763,15 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, if (pgd_none_or_clear_bad(src_pgd)) continue; if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd, - vma, addr, next))) { + dst_vma, src_vma, addr, next))) { ret = -ENOMEM; break; } } while (dst_pgd++, src_pgd++, addr = next, addr != end); - if (is_cow_mapping(vma->vm_flags)) + if (is_cow_mapping(src_vma->vm_flags)) mmu_notifier_invalidate_range_end(src_mm, - vma->vm_start, end); + src_vma->vm_start, end); return ret; } @@ -983,7 +1074,7 @@ EXPORT_SYMBOL_GPL(zap_page_range); * Do a quick page-table lookup for a single page. */ struct page *follow_page(struct vm_area_struct *vma, unsigned long address, - unsigned int flags) + unsigned int *flagsp) { pgd_t *pgd; pud_t *pud; @@ -992,6 +1083,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address, spinlock_t *ptl; struct page *page; struct mm_struct *mm = vma->vm_mm; + unsigned long flags = *flagsp; page = follow_huge_addr(mm, address, flags & FOLL_WRITE); if (!IS_ERR(page)) { @@ -1032,8 +1124,19 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address, if (unlikely(!page)) goto bad_page; - if (flags & FOLL_GET) + if (flags & FOLL_GET) { + if (PageAnon(page)) { + /* de-cow to make direct read from memory safe */ + if (!pte_write(pte) && !(flags & FOLL_COW)) { + page = NULL; + *flagsp |= FOLL_WRITE; + goto unlock; + } + if (!PageGUP(page)) + SetPageGUP(page); + } get_page(page); + } if (flags & FOLL_TOUCH) { if ((flags & FOLL_WRITE) && !pte_dirty(pte) && !PageDirty(page)) @@ -1173,7 +1276,7 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, if (is_vm_hugetlb_page(vma)) { i = follow_hugetlb_page(mm, vma, pages, vmas, - &start, &len, i, write); + &start, &len, i); continue; } @@ -1190,7 +1293,7 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, foll_flags |= FOLL_WRITE; cond_resched(); - while (!(page = follow_page(vma, start, foll_flags))) { + while (!(page = follow_page(vma, start, &foll_flags))) { int ret; ret = __handle_mm_fault(mm, vma, start, foll_flags & FOLL_WRITE); @@ -1200,8 +1303,10 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, * decided not to set pte_write. We can thus safely do * subsequent page lookups as if they were reads. */ - if (ret & VM_FAULT_WRITE) + if (ret & VM_FAULT_WRITE) { foll_flags &= ~FOLL_WRITE; + foll_flags |= FOLL_COW; + } switch (ret & ~VM_FAULT_WRITE) { case VM_FAULT_MINOR: @@ -1218,6 +1323,7 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, BUG(); } } + foll_flags &= ~FOLL_COW; if (IS_ERR(page)) return i ? i : PTR_ERR(page); if (pages) { @@ -1724,6 +1830,79 @@ static inline void cow_user_page(struct page *dst, struct page *src, unsigned lo copy_user_highpage(dst, src, va); } +static int fork_pre_cow(struct mm_struct *dst_mm, + struct mm_struct *src_mm, + struct vm_area_struct *dst_vma, + struct vm_area_struct *src_vma, + unsigned long address, + pte_t **dst_ptep, pte_t **src_ptep, + spinlock_t **dst_ptlp, spinlock_t **src_ptlp, + pmd_t *dst_pmd, pmd_t *src_pmd) +{ + pte_t _src_pte, _dst_pte; + struct page *old_page, *new_page; + + _src_pte = **src_ptep; + _dst_pte = **dst_ptep; + old_page = vm_normal_page(src_vma, address, **src_ptep); + BUG_ON(!old_page); + get_page(old_page); + spin_unlock(*src_ptlp); + pte_unmap_nested(*src_ptep); + pte_unmap_unlock(*dst_ptep, *dst_ptlp); + + new_page = alloc_page_vma(GFP_HIGHUSER, dst_vma, address); + if (unlikely(!new_page)) { + *dst_ptep = pte_offset_map_lock(dst_mm, dst_pmd, address, + dst_ptlp); + *src_ptep = pte_offset_map_nested(src_pmd, address); + *src_ptlp = pte_lockptr(src_mm, src_pmd); + spin_lock_nested(*src_ptlp, SINGLE_DEPTH_NESTING); + return -ENOMEM; + } + cow_user_page(new_page, old_page, address); + + *dst_ptep = pte_offset_map_lock(dst_mm, dst_pmd, address, dst_ptlp); + *src_ptep = pte_offset_map_nested(src_pmd, address); + *src_ptlp = pte_lockptr(src_mm, src_pmd); + spin_lock_nested(*src_ptlp, SINGLE_DEPTH_NESTING); + + /* + * src pte can unmapped by the VM from under us after dropping + * the src_ptlp but it can't be cowed from under us as fork + * holds the mmap_sem in write mode. + */ + if (!pte_same(**src_ptep, _src_pte)) + goto eagain; + if (!pte_same(**dst_ptep, _dst_pte)) + goto eagain; + + page_remove_rmap(old_page); + page_cache_release(old_page); + page_cache_release(old_page); + + flush_cache_page(src_vma, address, pte_pfn(**src_ptep)); + _dst_pte = mk_pte(new_page, dst_vma->vm_page_prot); + _dst_pte = maybe_mkwrite(pte_mkdirty(_dst_pte), dst_vma); + page_add_new_anon_rmap(new_page, dst_vma, address); + lru_cache_add_active(new_page); + set_pte_at(dst_mm, address, *dst_ptep, _dst_pte); + update_mmu_cache(dst_vma, address, _dst_pte); + lazy_mmu_prot_update(_dst_pte); + return 0; + +eagain: + page_cache_release(old_page); + page_cache_release(new_page); + /* + * Later we'll repeat the copy of this pte, so here we've to + * undo the mapcount and page count taken in copy_one_pte. + */ + page_remove_rmap(old_page); + page_cache_release(old_page); + return -EAGAIN; +} + /* * This routine handles present pages, when users try to write * to a shared page. It is done by copying the page to a new address diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 4eacdbe..4bbf86b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -155,6 +155,7 @@ static void bad_page(struct page *page) 1 << PG_slab | 1 << PG_swapcache | 1 << PG_writeback | + 1 << PG_gup | 1 << PG_buddy ); set_page_count(page, 0); reset_page_mapcount(page); @@ -401,6 +402,8 @@ static inline int free_pages_check(struct page *page) bad_page(page); if (PageDirty(page)) __ClearPageDirty(page); + if (PageGUP(page)) + __ClearPageGUP(page); /* * For now, we report if PG_reserved was found set, but do not * clear it, and do not free the page. But we shall soon need @@ -547,6 +550,7 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) 1 << PG_swapcache | 1 << PG_writeback | 1 << PG_reserved | + 1 << PG_gup | 1 << PG_buddy )))) bad_page(page); diff --git a/mm/swap.c b/mm/swap.c index 6898238..6ba1333 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -49,6 +49,8 @@ static void put_compound_page(struct page *page) if (put_page_testzero(page)) { compound_page_dtor *dtor; + if (PageGUP(page)) + __ClearPageGUP(page); dtor = get_compound_page_dtor(page); (*dtor)(page); }