All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/5] large folios swap-in: handle refault cases first
@ 2024-04-09  8:26 Barry Song
  2024-04-09  8:26 ` [PATCH v2 1/5] mm: swap: introduce swap_free_nr() for batched swap_free() Barry Song
                   ` (4 more replies)
  0 siblings, 5 replies; 54+ messages in thread
From: Barry Song @ 2024-04-09  8:26 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: baolin.wang, chrisl, david, hanchuanhua, hannes, hughd, kasong,
	ryan.roberts, surenb, v-songbaohua, willy, xiang, ying.huang,
	yosryahmed, yuzhao, ziy, linux-kernel

From: Barry Song <v-songbaohua@oppo.com>

This patch is extracted from the large folio swapin series[1], primarily addressing
the handling of scenarios involving large folios in the swap cache. Currently, it is
particularly focused on addressing the refaulting of mTHP, which is still undergoing
reclamation. This approach aims to streamline code review and expedite the integration
of this segment into the MM tree.

It relies on Ryan's swap-out series v7[2], leveraging the helper function
swap_pte_batch() introduced by that series.

Presently, do_swap_page only encounters a large folio in the swap
cache before the large folio is released by vmscan. However, the code
should remain equally useful once we support large folio swap-in via
swapin_readahead(). This approach can effectively reduce page faults
and eliminate most redundant checks and early exits for MTE restoration
in recent MTE patchset[3].

The large folio swap-in for SWP_SYNCHRONOUS_IO and swapin_readahead()
will be split into separate patch sets and sent at a later time.

-v2:
 - rebase on top of mm-unstable in which Ryan's swap_pte_batch() has changed
   a lot.
 - remove folio_add_new_anon_rmap() for !folio_test_anon()
   as currently large folios are always anon(refault).
 - add mTHP swpin refault counters

-v1:
  Link: https://lore.kernel.org/linux-mm/20240402073237.240995-1-21cnbao@gmail.com/

Differences with the original large folios swap-in series
 - collect r-o-b, acked;
 - rename swap_nr_free to swap_free_nr, according to Ryan;
 - limit the maximum kernel stack usage for swap_free_nr, Ryan;
 - add output argument in swap_pte_batch to expose if all entries are
   exclusive
 - many clean refinements, handle the corner case folio's virtual addr
   might not be naturally aligned

[1] https://lore.kernel.org/linux-mm/20240304081348.197341-1-21cnbao@gmail.com/
[2] https://lore.kernel.org/linux-mm/20240408183946.2991168-1-ryan.roberts@arm.com/
[3] https://lore.kernel.org/linux-mm/20240322114136.61386-1-21cnbao@gmail.com/

Barry Song (1):
  mm: swap_pte_batch: add an output argument to reture if all swap
    entries are exclusive
  mm: add per-order mTHP swpin_refault counter

Chuanhua Han (3):
  mm: swap: introduce swap_free_nr() for batched swap_free()
  mm: swap: make should_try_to_free_swap() support large-folio
  mm: swap: entirely map large folios found in swapcache

 include/linux/huge_mm.h |  1 +
 include/linux/swap.h    |  5 +++
 mm/huge_memory.c        |  2 ++
 mm/internal.h           |  9 +++++-
 mm/madvise.c            |  2 +-
 mm/memory.c             | 69 ++++++++++++++++++++++++++++++++---------
 mm/swapfile.c           | 51 ++++++++++++++++++++++++++++++
 7 files changed, 123 insertions(+), 16 deletions(-)

Appendix:

The following program can generate numerous instances where large folios
are hit in the swap cache if we enable 64KiB mTHP,

#echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled

#define DATA_SIZE (128UL * 1024)
#define PAGE_SIZE (4UL * 1024)
#define LARGE_FOLIO_SIZE (64UL * 1024)

static void *write_data(void *addr)
{
	unsigned long i;

	for (i = 0; i < DATA_SIZE; i += PAGE_SIZE)
		memset(addr + i, (char)i, PAGE_SIZE);
}

static void *read_data(void *addr)
{
	unsigned long i;

	for (i = 0; i < DATA_SIZE; i += PAGE_SIZE) {
		if (*((char *)addr + i) != (char)i) {
			perror("mismatched data");
			_exit(-1);
		}
	}
}

static void *pgout_data(void *addr)
{
	madvise(addr, DATA_SIZE, MADV_PAGEOUT);
}

int main(int argc, char **argv)
{
	for (int i = 0; i < 10000; i++) {
		pthread_t tid1, tid2;
		void *addr = mmap(NULL, DATA_SIZE * 2, PROT_READ | PROT_WRITE,
				MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
		unsigned long aligned_addr = ((unsigned long)addr + LARGE_FOLIO_SIZE) &
				~(LARGE_FOLIO_SIZE - 1);

		if (addr == MAP_FAILED) {
			perror("fail to malloc");
			return -1;
		}

		write_data(aligned_addr);

		if (pthread_create(&tid1, NULL, pgout_data, (void *)aligned_addr)) {
			perror("fail to pthread_create");
			return -1;
		}

		if (pthread_create(&tid2, NULL, read_data, (void *)aligned_addr)) {
			perror("fail to pthread_create");
			return -1;
		}

		pthread_join(tid1, NULL);
		pthread_join(tid2, NULL);
		munmap(addr, DATA_SIZE * 2);
	}

	return 0;
}

# cat /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/anon_swpout
932
# cat /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/anon_swpin_refault 
1488

-- 
2.34.1


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v2 1/5] mm: swap: introduce swap_free_nr() for batched swap_free()
  2024-04-09  8:26 [PATCH v2 0/5] large folios swap-in: handle refault cases first Barry Song
@ 2024-04-09  8:26 ` Barry Song
  2024-04-10 23:37   ` SeongJae Park
                     ` (2 more replies)
  2024-04-09  8:26 ` [PATCH v2 2/5] mm: swap: make should_try_to_free_swap() support large-folio Barry Song
                   ` (3 subsequent siblings)
  4 siblings, 3 replies; 54+ messages in thread
From: Barry Song @ 2024-04-09  8:26 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: baolin.wang, chrisl, david, hanchuanhua, hannes, hughd, kasong,
	ryan.roberts, surenb, v-songbaohua, willy, xiang, ying.huang,
	yosryahmed, yuzhao, ziy, linux-kernel

From: Chuanhua Han <hanchuanhua@oppo.com>

While swapping in a large folio, we need to free swaps related to the whole
folio. To avoid frequently acquiring and releasing swap locks, it is better
to introduce an API for batched free.

Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
Co-developed-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 include/linux/swap.h |  5 +++++
 mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 56 insertions(+)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 11c53692f65f..b7a107e983b8 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
 extern int swap_duplicate(swp_entry_t);
 extern int swapcache_prepare(swp_entry_t);
 extern void swap_free(swp_entry_t);
+extern void swap_free_nr(swp_entry_t entry, int nr_pages);
 extern void swapcache_free_entries(swp_entry_t *entries, int n);
 extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
 int swap_type_of(dev_t device, sector_t offset);
@@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
 {
 }
 
+void swap_free_nr(swp_entry_t entry, int nr_pages)
+{
+}
+
 static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
 {
 }
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 28642c188c93..f4c65aeb088d 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1356,6 +1356,57 @@ void swap_free(swp_entry_t entry)
 		__swap_entry_free(p, entry);
 }
 
+/*
+ * Free up the maximum number of swap entries at once to limit the
+ * maximum kernel stack usage.
+ */
+#define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
+
+/*
+ * Called after swapping in a large folio, batched free swap entries
+ * for this large folio, entry should be for the first subpage and
+ * its offset is aligned with nr_pages
+ */
+void swap_free_nr(swp_entry_t entry, int nr_pages)
+{
+	int i, j;
+	struct swap_cluster_info *ci;
+	struct swap_info_struct *p;
+	unsigned int type = swp_type(entry);
+	unsigned long offset = swp_offset(entry);
+	int batch_nr, remain_nr;
+	DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };
+
+	/* all swap entries are within a cluster for mTHP */
+	VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
+
+	if (nr_pages == 1) {
+		swap_free(entry);
+		return;
+	}
+
+	remain_nr = nr_pages;
+	p = _swap_info_get(entry);
+	if (p) {
+		for (i = 0; i < nr_pages; i += batch_nr) {
+			batch_nr = min_t(int, SWAP_BATCH_NR, remain_nr);
+
+			ci = lock_cluster_or_swap_info(p, offset);
+			for (j = 0; j < batch_nr; j++) {
+				if (__swap_entry_free_locked(p, offset + i * SWAP_BATCH_NR + j, 1))
+					__bitmap_set(usage, j, 1);
+			}
+			unlock_cluster_or_swap_info(p, ci);
+
+			for_each_clear_bit(j, usage, batch_nr)
+				free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
+
+			bitmap_clear(usage, 0, SWAP_BATCH_NR);
+			remain_nr -= batch_nr;
+		}
+	}
+}
+
 /*
  * Called after dropping swapcache to decrease refcnt to swap entries.
  */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v2 2/5] mm: swap: make should_try_to_free_swap() support large-folio
  2024-04-09  8:26 [PATCH v2 0/5] large folios swap-in: handle refault cases first Barry Song
  2024-04-09  8:26 ` [PATCH v2 1/5] mm: swap: introduce swap_free_nr() for batched swap_free() Barry Song
@ 2024-04-09  8:26 ` Barry Song
  2024-04-15  7:11   ` Huang, Ying
  2024-04-09  8:26 ` [PATCH v2 3/5] mm: swap_pte_batch: add an output argument to reture if all swap entries are exclusive Barry Song
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 54+ messages in thread
From: Barry Song @ 2024-04-09  8:26 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: baolin.wang, chrisl, david, hanchuanhua, hannes, hughd, kasong,
	ryan.roberts, surenb, v-songbaohua, willy, xiang, ying.huang,
	yosryahmed, yuzhao, ziy, linux-kernel

From: Chuanhua Han <hanchuanhua@oppo.com>

The function should_try_to_free_swap() operates under the assumption that
swap-in always occurs at the normal page granularity, i.e., folio_nr_pages
= 1. However, in reality, for large folios, add_to_swap_cache() will
invoke folio_ref_add(folio, nr). To accommodate large folio swap-in,
this patch eliminates this assumption.

Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
Co-developed-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Acked-by: Chris Li <chrisl@kernel.org>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/memory.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 78422d1c7381..2702d449880e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3856,7 +3856,7 @@ static inline bool should_try_to_free_swap(struct folio *folio,
 	 * reference only in case it's likely that we'll be the exlusive user.
 	 */
 	return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
-		folio_ref_count(folio) == 2;
+		folio_ref_count(folio) == (1 + folio_nr_pages(folio));
 }
 
 static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v2 3/5] mm: swap_pte_batch: add an output argument to reture if all swap entries are exclusive
  2024-04-09  8:26 [PATCH v2 0/5] large folios swap-in: handle refault cases first Barry Song
  2024-04-09  8:26 ` [PATCH v2 1/5] mm: swap: introduce swap_free_nr() for batched swap_free() Barry Song
  2024-04-09  8:26 ` [PATCH v2 2/5] mm: swap: make should_try_to_free_swap() support large-folio Barry Song
@ 2024-04-09  8:26 ` Barry Song
  2024-04-11 14:54   ` Ryan Roberts
  2024-04-09  8:26 ` [PATCH v2 4/5] mm: swap: entirely map large folios found in swapcache Barry Song
  2024-04-09  8:26 ` [PATCH v2 5/5] mm: add per-order mTHP swpin_refault counter Barry Song
  4 siblings, 1 reply; 54+ messages in thread
From: Barry Song @ 2024-04-09  8:26 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: baolin.wang, chrisl, david, hanchuanhua, hannes, hughd, kasong,
	ryan.roberts, surenb, v-songbaohua, willy, xiang, ying.huang,
	yosryahmed, yuzhao, ziy, linux-kernel

From: Barry Song <v-songbaohua@oppo.com>

Add a boolean argument named any_shared. If any of the swap entries are
non-exclusive, set any_shared to true. The function do_swap_page() can
then utilize this information to determine whether the entire large
folio can be reused.

Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 mm/internal.h | 9 ++++++++-
 mm/madvise.c  | 2 +-
 mm/memory.c   | 2 +-
 3 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 9d3250b4a08a..cae39c372bfc 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -238,7 +238,8 @@ static inline pte_t pte_next_swp_offset(pte_t pte)
  *
  * Return: the number of table entries in the batch.
  */
-static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte)
+static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte,
+				bool *any_shared)
 {
 	pte_t expected_pte = pte_next_swp_offset(pte);
 	const pte_t *end_ptep = start_ptep + max_nr;
@@ -248,12 +249,18 @@ static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte)
 	VM_WARN_ON(!is_swap_pte(pte));
 	VM_WARN_ON(non_swap_entry(pte_to_swp_entry(pte)));
 
+	if (any_shared)
+		*any_shared |= !pte_swp_exclusive(pte);
+
 	while (ptep < end_ptep) {
 		pte = ptep_get(ptep);
 
 		if (!pte_same(pte, expected_pte))
 			break;
 
+		if (any_shared)
+			*any_shared |= !pte_swp_exclusive(pte);
+
 		expected_pte = pte_next_swp_offset(expected_pte);
 		ptep++;
 	}
diff --git a/mm/madvise.c b/mm/madvise.c
index f59169888b8e..d34ca6983227 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -671,7 +671,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 			entry = pte_to_swp_entry(ptent);
 			if (!non_swap_entry(entry)) {
 				max_nr = (end - addr) / PAGE_SIZE;
-				nr = swap_pte_batch(pte, max_nr, ptent);
+				nr = swap_pte_batch(pte, max_nr, ptent, NULL);
 				nr_swap -= nr;
 				free_swap_and_cache_nr(entry, nr);
 				clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
diff --git a/mm/memory.c b/mm/memory.c
index 2702d449880e..c4a52e8d740a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1638,7 +1638,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			folio_put(folio);
 		} else if (!non_swap_entry(entry)) {
 			max_nr = (end - addr) / PAGE_SIZE;
-			nr = swap_pte_batch(pte, max_nr, ptent);
+			nr = swap_pte_batch(pte, max_nr, ptent, NULL);
 			/* Genuine swap entries, hence a private anon pages */
 			if (!should_zap_cows(details))
 				continue;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v2 4/5] mm: swap: entirely map large folios found in swapcache
  2024-04-09  8:26 [PATCH v2 0/5] large folios swap-in: handle refault cases first Barry Song
                   ` (2 preceding siblings ...)
  2024-04-09  8:26 ` [PATCH v2 3/5] mm: swap_pte_batch: add an output argument to reture if all swap entries are exclusive Barry Song
@ 2024-04-09  8:26 ` Barry Song
  2024-04-11 15:33   ` Ryan Roberts
  2024-04-15  8:37   ` Huang, Ying
  2024-04-09  8:26 ` [PATCH v2 5/5] mm: add per-order mTHP swpin_refault counter Barry Song
  4 siblings, 2 replies; 54+ messages in thread
From: Barry Song @ 2024-04-09  8:26 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: baolin.wang, chrisl, david, hanchuanhua, hannes, hughd, kasong,
	ryan.roberts, surenb, v-songbaohua, willy, xiang, ying.huang,
	yosryahmed, yuzhao, ziy, linux-kernel

From: Chuanhua Han <hanchuanhua@oppo.com>

When a large folio is found in the swapcache, the current implementation
requires calling do_swap_page() nr_pages times, resulting in nr_pages
page faults. This patch opts to map the entire large folio at once to
minimize page faults. Additionally, redundant checks and early exits
for ARM64 MTE restoring are removed.

Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
Co-developed-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 mm/memory.c | 64 +++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 52 insertions(+), 12 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index c4a52e8d740a..9818dc1893c8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3947,6 +3947,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	pte_t pte;
 	vm_fault_t ret = 0;
 	void *shadow = NULL;
+	int nr_pages = 1;
+	unsigned long start_address = vmf->address;
+	pte_t *start_pte = vmf->pte;
+	bool any_swap_shared = false;
 
 	if (!pte_unmap_same(vmf))
 		goto out;
@@ -4137,6 +4141,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	 */
 	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
 			&vmf->ptl);
+
+	/* We hit large folios in swapcache */
+	if (start_pte && folio_test_large(folio) && folio_test_swapcache(folio)) {
+		int nr = folio_nr_pages(folio);
+		int idx = folio_page_idx(folio, page);
+		unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
+		unsigned long folio_end = folio_start + nr * PAGE_SIZE;
+		pte_t *folio_ptep;
+		pte_t folio_pte;
+
+		if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
+			goto check_pte;
+		if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
+			goto check_pte;
+
+		folio_ptep = vmf->pte - idx;
+		folio_pte = ptep_get(folio_ptep);
+		if (!is_swap_pte(folio_pte) || non_swap_entry(pte_to_swp_entry(folio_pte)) ||
+		    swap_pte_batch(folio_ptep, nr, folio_pte, &any_swap_shared) != nr)
+			goto check_pte;
+
+		start_address = folio_start;
+		start_pte = folio_ptep;
+		nr_pages = nr;
+		entry = folio->swap;
+		page = &folio->page;
+	}
+
+check_pte:
 	if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
 		goto out_nomap;
 
@@ -4190,6 +4223,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 			 */
 			exclusive = false;
 		}
+
+		/* Reuse the whole large folio iff all entries are exclusive */
+		if (nr_pages > 1 && any_swap_shared)
+			exclusive = false;
 	}
 
 	/*
@@ -4204,12 +4241,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	 * We're already holding a reference on the page but haven't mapped it
 	 * yet.
 	 */
-	swap_free(entry);
+	swap_free_nr(entry, nr_pages);
 	if (should_try_to_free_swap(folio, vma, vmf->flags))
 		folio_free_swap(folio);
 
-	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
-	dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
+	folio_ref_add(folio, nr_pages - 1);
+	add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
+	add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
+
 	pte = mk_pte(page, vma->vm_page_prot);
 
 	/*
@@ -4219,33 +4258,34 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	 * exclusivity.
 	 */
 	if (!folio_test_ksm(folio) &&
-	    (exclusive || folio_ref_count(folio) == 1)) {
+	    (exclusive || (folio_ref_count(folio) == nr_pages &&
+			   folio_nr_pages(folio) == nr_pages))) {
 		if (vmf->flags & FAULT_FLAG_WRITE) {
 			pte = maybe_mkwrite(pte_mkdirty(pte), vma);
 			vmf->flags &= ~FAULT_FLAG_WRITE;
 		}
 		rmap_flags |= RMAP_EXCLUSIVE;
 	}
-	flush_icache_page(vma, page);
+	flush_icache_pages(vma, page, nr_pages);
 	if (pte_swp_soft_dirty(vmf->orig_pte))
 		pte = pte_mksoft_dirty(pte);
 	if (pte_swp_uffd_wp(vmf->orig_pte))
 		pte = pte_mkuffd_wp(pte);
-	vmf->orig_pte = pte;
 
 	/* ksm created a completely new copy */
 	if (unlikely(folio != swapcache && swapcache)) {
-		folio_add_new_anon_rmap(folio, vma, vmf->address);
+		folio_add_new_anon_rmap(folio, vma, start_address);
 		folio_add_lru_vma(folio, vma);
 	} else {
-		folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
-					rmap_flags);
+		folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, start_address,
+					 rmap_flags);
 	}
 
 	VM_BUG_ON(!folio_test_anon(folio) ||
 			(pte_write(pte) && !PageAnonExclusive(page)));
-	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
-	arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
+	set_ptes(vma->vm_mm, start_address, start_pte, pte, nr_pages);
+	vmf->orig_pte = ptep_get(vmf->pte);
+	arch_do_swap_page(vma->vm_mm, vma, start_address, pte, pte);
 
 	folio_unlock(folio);
 	if (folio != swapcache && swapcache) {
@@ -4269,7 +4309,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	}
 
 	/* No need to invalidate - it was non-present before */
-	update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
+	update_mmu_cache_range(vmf, vma, start_address, start_pte, nr_pages);
 unlock:
 	if (vmf->pte)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v2 5/5] mm: add per-order mTHP swpin_refault counter
  2024-04-09  8:26 [PATCH v2 0/5] large folios swap-in: handle refault cases first Barry Song
                   ` (3 preceding siblings ...)
  2024-04-09  8:26 ` [PATCH v2 4/5] mm: swap: entirely map large folios found in swapcache Barry Song
@ 2024-04-09  8:26 ` Barry Song
  2024-04-10 23:15   ` SeongJae Park
                     ` (2 more replies)
  4 siblings, 3 replies; 54+ messages in thread
From: Barry Song @ 2024-04-09  8:26 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: baolin.wang, chrisl, david, hanchuanhua, hannes, hughd, kasong,
	ryan.roberts, surenb, v-songbaohua, willy, xiang, ying.huang,
	yosryahmed, yuzhao, ziy, linux-kernel

From: Barry Song <v-songbaohua@oppo.com>

Currently, we are handling the scenario where we've hit a
large folio in the swapcache, and the reclaiming process
for this large folio is still ongoing.

Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 include/linux/huge_mm.h | 1 +
 mm/huge_memory.c        | 2 ++
 mm/memory.c             | 1 +
 3 files changed, 4 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index c8256af83e33..b67294d5814f 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -269,6 +269,7 @@ enum mthp_stat_item {
 	MTHP_STAT_ANON_ALLOC_FALLBACK,
 	MTHP_STAT_ANON_SWPOUT,
 	MTHP_STAT_ANON_SWPOUT_FALLBACK,
+	MTHP_STAT_ANON_SWPIN_REFAULT,
 	__MTHP_STAT_COUNT
 };
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d8d2ed80b0bf..fb95345b0bde 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -556,12 +556,14 @@ DEFINE_MTHP_STAT_ATTR(anon_alloc, MTHP_STAT_ANON_ALLOC);
 DEFINE_MTHP_STAT_ATTR(anon_alloc_fallback, MTHP_STAT_ANON_ALLOC_FALLBACK);
 DEFINE_MTHP_STAT_ATTR(anon_swpout, MTHP_STAT_ANON_SWPOUT);
 DEFINE_MTHP_STAT_ATTR(anon_swpout_fallback, MTHP_STAT_ANON_SWPOUT_FALLBACK);
+DEFINE_MTHP_STAT_ATTR(anon_swpin_refault, MTHP_STAT_ANON_SWPIN_REFAULT);
 
 static struct attribute *stats_attrs[] = {
 	&anon_alloc_attr.attr,
 	&anon_alloc_fallback_attr.attr,
 	&anon_swpout_attr.attr,
 	&anon_swpout_fallback_attr.attr,
+	&anon_swpin_refault_attr.attr,
 	NULL,
 };
 
diff --git a/mm/memory.c b/mm/memory.c
index 9818dc1893c8..acc023795a4d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4167,6 +4167,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		nr_pages = nr;
 		entry = folio->swap;
 		page = &folio->page;
+		count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_SWPIN_REFAULT);
 	}
 
 check_pte:
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 5/5] mm: add per-order mTHP swpin_refault counter
  2024-04-09  8:26 ` [PATCH v2 5/5] mm: add per-order mTHP swpin_refault counter Barry Song
@ 2024-04-10 23:15   ` SeongJae Park
  2024-04-11  1:46     ` Barry Song
  2024-04-11 15:53   ` Ryan Roberts
  2024-04-17  0:45   ` Huang, Ying
  2 siblings, 1 reply; 54+ messages in thread
From: SeongJae Park @ 2024-04-10 23:15 UTC (permalink / raw)
  To: Barry Song
  Cc: SeongJae Park, akpm, linux-mm, baolin.wang, chrisl, david,
	hanchuanhua, hannes, hughd, kasong, ryan.roberts, surenb,
	v-songbaohua, willy, xiang, ying.huang, yosryahmed, yuzhao, ziy,
	linux-kernel

Hi Barry,

On Tue,  9 Apr 2024 20:26:31 +1200 Barry Song <21cnbao@gmail.com> wrote:

> From: Barry Song <v-songbaohua@oppo.com>
> 
> Currently, we are handling the scenario where we've hit a
> large folio in the swapcache, and the reclaiming process
> for this large folio is still ongoing.
> 
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  include/linux/huge_mm.h | 1 +
>  mm/huge_memory.c        | 2 ++
>  mm/memory.c             | 1 +
>  3 files changed, 4 insertions(+)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index c8256af83e33..b67294d5814f 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -269,6 +269,7 @@ enum mthp_stat_item {
>  	MTHP_STAT_ANON_ALLOC_FALLBACK,
>  	MTHP_STAT_ANON_SWPOUT,
>  	MTHP_STAT_ANON_SWPOUT_FALLBACK,
> +	MTHP_STAT_ANON_SWPIN_REFAULT,
>  	__MTHP_STAT_COUNT
>  };
>  
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d8d2ed80b0bf..fb95345b0bde 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -556,12 +556,14 @@ DEFINE_MTHP_STAT_ATTR(anon_alloc, MTHP_STAT_ANON_ALLOC);
>  DEFINE_MTHP_STAT_ATTR(anon_alloc_fallback, MTHP_STAT_ANON_ALLOC_FALLBACK);
>  DEFINE_MTHP_STAT_ATTR(anon_swpout, MTHP_STAT_ANON_SWPOUT);
>  DEFINE_MTHP_STAT_ATTR(anon_swpout_fallback, MTHP_STAT_ANON_SWPOUT_FALLBACK);
> +DEFINE_MTHP_STAT_ATTR(anon_swpin_refault, MTHP_STAT_ANON_SWPIN_REFAULT);
>  
>  static struct attribute *stats_attrs[] = {
>  	&anon_alloc_attr.attr,
>  	&anon_alloc_fallback_attr.attr,
>  	&anon_swpout_attr.attr,
>  	&anon_swpout_fallback_attr.attr,
> +	&anon_swpin_refault_attr.attr,
>  	NULL,
>  };
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index 9818dc1893c8..acc023795a4d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4167,6 +4167,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  		nr_pages = nr;
>  		entry = folio->swap;
>  		page = &folio->page;
> +		count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_SWPIN_REFAULT);
>  	}

From the latest mm-unstable tree, I get below kunit build failure and
'git bisect' points this patch.

    $ ./tools/testing/kunit/kunit.py run --build_dir ../kunit.out/
    [16:07:40] Configuring KUnit Kernel ...
    [16:07:40] Building KUnit Kernel ...
    Populating config with:
    $ make ARCH=um O=../kunit.out/ olddefconfig
    Building with:
    $ make ARCH=um O=../kunit.out/ --jobs=36
    ERROR:root:.../mm/memory.c: In function ‘do_swap_page’:
    .../mm/memory.c:4169:17: error: implicit declaration of function ‘count_mthp_stat’ [-Werror=implicit-function-declaration]
     4169 |                 count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_SWPIN_REFAULT);
          |                 ^~~~~~~~~~~~~~~
    .../mm/memory.c:4169:53: error: ‘MTHP_STAT_ANON_SWPIN_REFAULT’ undeclared (first use in this function)
     4169 |                 count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_SWPIN_REFAULT);
          |                                                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
    .../mm/memory.c:4169:53: note: each undeclared identifier is reported only once for each function it appears in
    cc1: some warnings being treated as errors

My kunit build config doesn't have CONFIG_TRANSPARE_HUGEPAGE.  Maybe that's the
reason and this patch, or the patch that introduced the function and the enum
need to take care of the case?


Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/5] mm: swap: introduce swap_free_nr() for batched swap_free()
  2024-04-09  8:26 ` [PATCH v2 1/5] mm: swap: introduce swap_free_nr() for batched swap_free() Barry Song
@ 2024-04-10 23:37   ` SeongJae Park
  2024-04-11  1:27     ` Barry Song
  2024-04-11 14:30   ` Ryan Roberts
  2024-04-15  6:17   ` Huang, Ying
  2 siblings, 1 reply; 54+ messages in thread
From: SeongJae Park @ 2024-04-10 23:37 UTC (permalink / raw)
  To: Barry Song
  Cc: SeongJae Park, akpm, linux-mm, baolin.wang, chrisl, david,
	hanchuanhua, hannes, hughd, kasong, ryan.roberts, surenb,
	v-songbaohua, willy, xiang, ying.huang, yosryahmed, yuzhao, ziy,
	linux-kernel

Hi Barry,

On Tue,  9 Apr 2024 20:26:27 +1200 Barry Song <21cnbao@gmail.com> wrote:

> From: Chuanhua Han <hanchuanhua@oppo.com>
> 
> While swapping in a large folio, we need to free swaps related to the whole
> folio. To avoid frequently acquiring and releasing swap locks, it is better
> to introduce an API for batched free.
> 
> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  include/linux/swap.h |  5 +++++
>  mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 56 insertions(+)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 11c53692f65f..b7a107e983b8 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
[...]
> +void swap_free_nr(swp_entry_t entry, int nr_pages)
> +{
> +}

I found the latest mm-unstable fails build when CONFIG_SWAP is not set with
errors including below, and 'git bisect' points this patch.

    do_mounts.c:(.text+0x6): multiple definition of `swap_free_nr'; init/main.o:main.c:(.text+0x9c): first defined here

I think this should be defined as 'static inline'?  I confirmed adding the two
keywords as below fixes the build failure.

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4bf5090de0fd..5fd60d733ba8 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -562,7 +562,7 @@ static inline void swap_free(swp_entry_t swp)
 {
 }

-void swap_free_nr(swp_entry_t entry, int nr_pages)
+static inline void swap_free_nr(swp_entry_t entry, int nr_pages)
 {
 }


Thanks,
SJ

[...]

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/5] mm: swap: introduce swap_free_nr() for batched swap_free()
  2024-04-10 23:37   ` SeongJae Park
@ 2024-04-11  1:27     ` Barry Song
  0 siblings, 0 replies; 54+ messages in thread
From: Barry Song @ 2024-04-11  1:27 UTC (permalink / raw)
  To: SeongJae Park
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, hanchuanhua, hannes,
	hughd, kasong, ryan.roberts, surenb, v-songbaohua, willy, xiang,
	ying.huang, yosryahmed, yuzhao, ziy, linux-kernel

On Thu, Apr 11, 2024 at 11:38 AM SeongJae Park <sj@kernel.org> wrote:
>
> Hi Barry,
>
> On Tue,  9 Apr 2024 20:26:27 +1200 Barry Song <21cnbao@gmail.com> wrote:
>
> > From: Chuanhua Han <hanchuanhua@oppo.com>
> >
> > While swapping in a large folio, we need to free swaps related to the whole
> > folio. To avoid frequently acquiring and releasing swap locks, it is better
> > to introduce an API for batched free.
> >
> > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> > Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  include/linux/swap.h |  5 +++++
> >  mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 56 insertions(+)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 11c53692f65f..b7a107e983b8 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> [...]
> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
> > +{
> > +}
>
> I found the latest mm-unstable fails build when CONFIG_SWAP is not set with
> errors including below, and 'git bisect' points this patch.
>
>     do_mounts.c:(.text+0x6): multiple definition of `swap_free_nr'; init/main.o:main.c:(.text+0x9c): first defined here
>
> I think this should be defined as 'static inline'?  I confirmed adding the two
> keywords as below fixes the build failure.

definitely yes.
It's highly likely that it was an oversight. we definitely meant
static inline for !CONFIG_SWAP.
Thanks! will fix it in v3.


>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 4bf5090de0fd..5fd60d733ba8 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -562,7 +562,7 @@ static inline void swap_free(swp_entry_t swp)
>  {
>  }
>
> -void swap_free_nr(swp_entry_t entry, int nr_pages)
> +static inline void swap_free_nr(swp_entry_t entry, int nr_pages)
>  {
>  }
>
>
> Thanks,
> SJ
>
Thanks
Barry

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 5/5] mm: add per-order mTHP swpin_refault counter
  2024-04-10 23:15   ` SeongJae Park
@ 2024-04-11  1:46     ` Barry Song
  2024-04-11 16:14       ` SeongJae Park
  0 siblings, 1 reply; 54+ messages in thread
From: Barry Song @ 2024-04-11  1:46 UTC (permalink / raw)
  To: sj
  Cc: 21cnbao, akpm, baolin.wang, chrisl, david, hanchuanhua, hannes,
	hughd, kasong, linux-kernel, linux-mm, ryan.roberts, surenb,
	v-songbaohua, willy, xiang, ying.huang, yosryahmed, yuzhao, ziy

>> +		count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_SWPIN_REFAULT);
>>  	}
>> 
> From the latest mm-unstable tree, I get below kunit build failure and
> 'git bisect' points this patch.
> 
>     $ ./tools/testing/kunit/kunit.py run --build_dir ../kunit.out/
>     [16:07:40] Configuring KUnit Kernel ...
>     [16:07:40] Building KUnit Kernel ...
>     Populating config with:
>     $ make ARCH=um O=../kunit.out/ olddefconfig
>     Building with:
>     $ make ARCH=um O=../kunit.out/ --jobs=36
>     ERROR:root:.../mm/memory.c: In function ‘do_swap_page’:
>     .../mm/memory.c:4169:17: error: implicit declaration of function ‘count_mthp_stat’ [-Werror=implicit-function-declaration]
>      4169 |                 count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_SWPIN_REFAULT);
>           |                 ^~~~~~~~~~~~~~~
>     .../mm/memory.c:4169:53: error: ‘MTHP_STAT_ANON_SWPIN_REFAULT’ undeclared (first use in this function)
>      4169 |                 count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_SWPIN_REFAULT);
>           |                                                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
>     .../mm/memory.c:4169:53: note: each undeclared identifier is reported only once for each function it appears in
>     cc1: some warnings being treated as errors
> 
> My kunit build config doesn't have CONFIG_TRANSPARE_HUGEPAGE.  Maybe that's the
> reason and this patch, or the patch that introduced the function and the enum
> need to take care of the case?

Hi SeongJae,
Thanks very much, can you check if the below fix the build? If yes, I will
include this fix while sending v3.

Subject: [PATCH] mm: fix build errors on CONFIG_TRANSPARENT_HUGEPAGE=N

Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 mm/memory.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index acc023795a4d..1d587d1eb432 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4142,6 +4142,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
 			&vmf->ptl);
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	/* We hit large folios in swapcache */
 	if (start_pte && folio_test_large(folio) && folio_test_swapcache(folio)) {
 		int nr = folio_nr_pages(folio);
@@ -4171,6 +4172,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	}
 
 check_pte:
+#endif
 	if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
 		goto out_nomap;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/5] mm: swap: introduce swap_free_nr() for batched swap_free()
  2024-04-09  8:26 ` [PATCH v2 1/5] mm: swap: introduce swap_free_nr() for batched swap_free() Barry Song
  2024-04-10 23:37   ` SeongJae Park
@ 2024-04-11 14:30   ` Ryan Roberts
  2024-04-12  2:07     ` Chuanhua Han
  2024-04-15  6:17   ` Huang, Ying
  2 siblings, 1 reply; 54+ messages in thread
From: Ryan Roberts @ 2024-04-11 14:30 UTC (permalink / raw)
  To: Barry Song, akpm, linux-mm
  Cc: baolin.wang, chrisl, david, hanchuanhua, hannes, hughd, kasong,
	surenb, v-songbaohua, willy, xiang, ying.huang, yosryahmed,
	yuzhao, ziy, linux-kernel

On 09/04/2024 09:26, Barry Song wrote:
> From: Chuanhua Han <hanchuanhua@oppo.com>
> 
> While swapping in a large folio, we need to free swaps related to the whole
> folio. To avoid frequently acquiring and releasing swap locks, it is better
> to introduce an API for batched free.
> 
> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>

Couple of nits; feel free to ignore.

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>

> ---
>  include/linux/swap.h |  5 +++++
>  mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 56 insertions(+)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 11c53692f65f..b7a107e983b8 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
>  extern int swap_duplicate(swp_entry_t);
>  extern int swapcache_prepare(swp_entry_t);
>  extern void swap_free(swp_entry_t);
> +extern void swap_free_nr(swp_entry_t entry, int nr_pages);
>  extern void swapcache_free_entries(swp_entry_t *entries, int n);
>  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
>  int swap_type_of(dev_t device, sector_t offset);
> @@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
>  {
>  }
>  
> +void swap_free_nr(swp_entry_t entry, int nr_pages)
> +{
> +}
> +
>  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
>  {
>  }
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 28642c188c93..f4c65aeb088d 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1356,6 +1356,57 @@ void swap_free(swp_entry_t entry)
>  		__swap_entry_free(p, entry);
>  }
>  
> +/*
> + * Free up the maximum number of swap entries at once to limit the
> + * maximum kernel stack usage.
> + */
> +#define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
> +
> +/*
> + * Called after swapping in a large folio, batched free swap entries
> + * for this large folio, entry should be for the first subpage and
> + * its offset is aligned with nr_pages
> + */
> +void swap_free_nr(swp_entry_t entry, int nr_pages)
> +{
> +	int i, j;
> +	struct swap_cluster_info *ci;
> +	struct swap_info_struct *p;
> +	unsigned int type = swp_type(entry);
> +	unsigned long offset = swp_offset(entry);
> +	int batch_nr, remain_nr;
> +	DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };
> +
> +	/* all swap entries are within a cluster for mTHP */
> +	VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
> +
> +	if (nr_pages == 1) {
> +		swap_free(entry);
> +		return;
> +	}
> +
> +	remain_nr = nr_pages;
> +	p = _swap_info_get(entry);
> +	if (p) {

nit: perhaps return early if (!p) ? Then you dedent the for() block.

> +		for (i = 0; i < nr_pages; i += batch_nr) {
> +			batch_nr = min_t(int, SWAP_BATCH_NR, remain_nr);
> +
> +			ci = lock_cluster_or_swap_info(p, offset);
> +			for (j = 0; j < batch_nr; j++) {
> +				if (__swap_entry_free_locked(p, offset + i * SWAP_BATCH_NR + j, 1))
> +					__bitmap_set(usage, j, 1);
> +			}
> +			unlock_cluster_or_swap_info(p, ci);
> +
> +			for_each_clear_bit(j, usage, batch_nr)
> +				free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
> +

nit: perhaps change to for (;;), and do the checks here to avoid clearing the
bitmap on the last run:

			i += batch_nr;
			if (i < nr_pages)
				break;

> +			bitmap_clear(usage, 0, SWAP_BATCH_NR);
> +			remain_nr -= batch_nr;
> +		}
> +	}
> +}
> +
>  /*
>   * Called after dropping swapcache to decrease refcnt to swap entries.
>   */


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 3/5] mm: swap_pte_batch: add an output argument to reture if all swap entries are exclusive
  2024-04-09  8:26 ` [PATCH v2 3/5] mm: swap_pte_batch: add an output argument to reture if all swap entries are exclusive Barry Song
@ 2024-04-11 14:54   ` Ryan Roberts
  2024-04-11 15:00     ` David Hildenbrand
  0 siblings, 1 reply; 54+ messages in thread
From: Ryan Roberts @ 2024-04-11 14:54 UTC (permalink / raw)
  To: Barry Song, akpm, linux-mm
  Cc: baolin.wang, chrisl, david, hanchuanhua, hannes, hughd, kasong,
	surenb, v-songbaohua, willy, xiang, ying.huang, yosryahmed,
	yuzhao, ziy, linux-kernel

On 09/04/2024 09:26, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> Add a boolean argument named any_shared. If any of the swap entries are
> non-exclusive, set any_shared to true. The function do_swap_page() can
> then utilize this information to determine whether the entire large
> folio can be reused.
> 
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  mm/internal.h | 9 ++++++++-
>  mm/madvise.c  | 2 +-
>  mm/memory.c   | 2 +-
>  3 files changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 9d3250b4a08a..cae39c372bfc 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -238,7 +238,8 @@ static inline pte_t pte_next_swp_offset(pte_t pte)
>   *
>   * Return: the number of table entries in the batch.
>   */
> -static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte)
> +static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte,
> +				bool *any_shared)

Please update the docs in the comment above this for the new param; follow
folio_pte_batch()'s docs as a template.

>  {
>  	pte_t expected_pte = pte_next_swp_offset(pte);
>  	const pte_t *end_ptep = start_ptep + max_nr;
> @@ -248,12 +249,18 @@ static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte)
>  	VM_WARN_ON(!is_swap_pte(pte));
>  	VM_WARN_ON(non_swap_entry(pte_to_swp_entry(pte)));
>  
> +	if (any_shared)
> +		*any_shared |= !pte_swp_exclusive(pte);

This is different from the approach in folio_pte_batch(). It inits *any_shared
to false and does NOT include the value of the first pte. I think that's odd,
personally and I prefer your approach. I'm not sure if there was a good reason
that David chose the other approach? Regardless, I think both functions should
follow the same pattern here.

If sticking with your approach, why is this initial flag being ORed? Surely it
should just be initialized to get rid of any previous guff?

Thanks,
Ryan


> +
>  	while (ptep < end_ptep) {
>  		pte = ptep_get(ptep);
>  
>  		if (!pte_same(pte, expected_pte))
>  			break;
>  
> +		if (any_shared)
> +			*any_shared |= !pte_swp_exclusive(pte);
> +
>  		expected_pte = pte_next_swp_offset(expected_pte);
>  		ptep++;
>  	}
> diff --git a/mm/madvise.c b/mm/madvise.c
> index f59169888b8e..d34ca6983227 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -671,7 +671,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  			entry = pte_to_swp_entry(ptent);
>  			if (!non_swap_entry(entry)) {
>  				max_nr = (end - addr) / PAGE_SIZE;
> -				nr = swap_pte_batch(pte, max_nr, ptent);
> +				nr = swap_pte_batch(pte, max_nr, ptent, NULL);
>  				nr_swap -= nr;
>  				free_swap_and_cache_nr(entry, nr);
>  				clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
> diff --git a/mm/memory.c b/mm/memory.c
> index 2702d449880e..c4a52e8d740a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1638,7 +1638,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  			folio_put(folio);
>  		} else if (!non_swap_entry(entry)) {
>  			max_nr = (end - addr) / PAGE_SIZE;
> -			nr = swap_pte_batch(pte, max_nr, ptent);
> +			nr = swap_pte_batch(pte, max_nr, ptent, NULL);
>  			/* Genuine swap entries, hence a private anon pages */
>  			if (!should_zap_cows(details))
>  				continue;


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 3/5] mm: swap_pte_batch: add an output argument to reture if all swap entries are exclusive
  2024-04-11 14:54   ` Ryan Roberts
@ 2024-04-11 15:00     ` David Hildenbrand
  2024-04-11 15:36       ` Ryan Roberts
  0 siblings, 1 reply; 54+ messages in thread
From: David Hildenbrand @ 2024-04-11 15:00 UTC (permalink / raw)
  To: Ryan Roberts, Barry Song, akpm, linux-mm
  Cc: baolin.wang, chrisl, hanchuanhua, hannes, hughd, kasong, surenb,
	v-songbaohua, willy, xiang, ying.huang, yosryahmed, yuzhao, ziy,
	linux-kernel

On 11.04.24 16:54, Ryan Roberts wrote:
> On 09/04/2024 09:26, Barry Song wrote:
>> From: Barry Song <v-songbaohua@oppo.com>
>>
>> Add a boolean argument named any_shared. If any of the swap entries are
>> non-exclusive, set any_shared to true. The function do_swap_page() can
>> then utilize this information to determine whether the entire large
>> folio can be reused.
>>
>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>> ---
>>   mm/internal.h | 9 ++++++++-
>>   mm/madvise.c  | 2 +-
>>   mm/memory.c   | 2 +-
>>   3 files changed, 10 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/internal.h b/mm/internal.h
>> index 9d3250b4a08a..cae39c372bfc 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -238,7 +238,8 @@ static inline pte_t pte_next_swp_offset(pte_t pte)
>>    *
>>    * Return: the number of table entries in the batch.
>>    */
>> -static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte)
>> +static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte,
>> +				bool *any_shared)
> 
> Please update the docs in the comment above this for the new param; follow
> folio_pte_batch()'s docs as a template.
> 
>>   {
>>   	pte_t expected_pte = pte_next_swp_offset(pte);
>>   	const pte_t *end_ptep = start_ptep + max_nr;
>> @@ -248,12 +249,18 @@ static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte)
>>   	VM_WARN_ON(!is_swap_pte(pte));
>>   	VM_WARN_ON(non_swap_entry(pte_to_swp_entry(pte)));
>>   
>> +	if (any_shared)
>> +		*any_shared |= !pte_swp_exclusive(pte);
> 
> This is different from the approach in folio_pte_batch(). It inits *any_shared
> to false and does NOT include the value of the first pte. I think that's odd,
> personally and I prefer your approach. I'm not sure if there was a good reason
> that David chose the other approach?

Because in my case calling code does

nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr, flags,
		     &any_writable);

...

if (any_writable)
	pte = pte_mkwrite(pte, src_vma);

...

and later checks in another function pte_write().

So if the common pattern is that the original PTE will be used for 
checks, then it doesn't make sense to unnecessary checks+setting for the 
first PTE.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 4/5] mm: swap: entirely map large folios found in swapcache
  2024-04-09  8:26 ` [PATCH v2 4/5] mm: swap: entirely map large folios found in swapcache Barry Song
@ 2024-04-11 15:33   ` Ryan Roberts
  2024-04-11 23:30     ` Barry Song
  2024-04-15  8:37   ` Huang, Ying
  1 sibling, 1 reply; 54+ messages in thread
From: Ryan Roberts @ 2024-04-11 15:33 UTC (permalink / raw)
  To: Barry Song, akpm, linux-mm
  Cc: baolin.wang, chrisl, david, hanchuanhua, hannes, hughd, kasong,
	surenb, v-songbaohua, willy, xiang, ying.huang, yosryahmed,
	yuzhao, ziy, linux-kernel

On 09/04/2024 09:26, Barry Song wrote:
> From: Chuanhua Han <hanchuanhua@oppo.com>
> 
> When a large folio is found in the swapcache, the current implementation
> requires calling do_swap_page() nr_pages times, resulting in nr_pages
> page faults. This patch opts to map the entire large folio at once to
> minimize page faults. Additionally, redundant checks and early exits
> for ARM64 MTE restoring are removed.
> 
> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  mm/memory.c | 64 +++++++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 52 insertions(+), 12 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index c4a52e8d740a..9818dc1893c8 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3947,6 +3947,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	pte_t pte;
>  	vm_fault_t ret = 0;
>  	void *shadow = NULL;
> +	int nr_pages = 1;
> +	unsigned long start_address = vmf->address;
> +	pte_t *start_pte = vmf->pte;

possible bug?: there are code paths that assign to vmf-pte below in this
function, so couldn't start_pte be stale in some cases? I'd just do the
assignment (all 4 of these variables in fact) in an else clause below, after any
messing about with them is complete.

nit: rename start_pte -> start_ptep ?

> +	bool any_swap_shared = false;

Suggest you defer initialization of this to your "We hit large folios in
swapcache" block below, and init it to:

	any_swap_shared = !pte_swp_exclusive(vmf->pte);

Then the any_shared semantic in swap_pte_batch() can be the same as for
folio_pte_batch().

>  
>  	if (!pte_unmap_same(vmf))
>  		goto out;
> @@ -4137,6 +4141,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	 */
>  	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>  			&vmf->ptl);

bug: vmf->pte may be NULL and you are not checking it until check_pte:. Byt you
are using it in this block. It also seems odd to do all the work in the below
block under the PTL but before checking if the pte has changed. Suggest moving
both checks here.

> +
> +	/* We hit large folios in swapcache */
> +	if (start_pte && folio_test_large(folio) && folio_test_swapcache(folio)) {

What's the start_pte check protecting?

> +		int nr = folio_nr_pages(folio);
> +		int idx = folio_page_idx(folio, page);
> +		unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
> +		unsigned long folio_end = folio_start + nr * PAGE_SIZE;
> +		pte_t *folio_ptep;
> +		pte_t folio_pte;
> +
> +		if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
> +			goto check_pte;
> +		if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
> +			goto check_pte;
> +
> +		folio_ptep = vmf->pte - idx;
> +		folio_pte = ptep_get(folio_ptep);
> +		if (!is_swap_pte(folio_pte) || non_swap_entry(pte_to_swp_entry(folio_pte)) ||
> +		    swap_pte_batch(folio_ptep, nr, folio_pte, &any_swap_shared) != nr)
> +			goto check_pte;
> +
> +		start_address = folio_start;
> +		start_pte = folio_ptep;
> +		nr_pages = nr;
> +		entry = folio->swap;
> +		page = &folio->page;
> +	}
> +
> +check_pte:
>  	if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
>  		goto out_nomap;
>  
> @@ -4190,6 +4223,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  			 */
>  			exclusive = false;
>  		}
> +
> +		/* Reuse the whole large folio iff all entries are exclusive */
> +		if (nr_pages > 1 && any_swap_shared)
> +			exclusive = false;

If you init any_shared with the firt pte as I suggested then you could just set
exclusive = !any_shared at the top of this if block without needing this
separate fixup.
>  	}
>  
>  	/*
> @@ -4204,12 +4241,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	 * We're already holding a reference on the page but haven't mapped it
>  	 * yet.
>  	 */
> -	swap_free(entry);
> +	swap_free_nr(entry, nr_pages);
>  	if (should_try_to_free_swap(folio, vma, vmf->flags))
>  		folio_free_swap(folio);
>  
> -	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> -	dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> +	folio_ref_add(folio, nr_pages - 1);
> +	add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> +	add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> +
>  	pte = mk_pte(page, vma->vm_page_prot);
>  
>  	/*
> @@ -4219,33 +4258,34 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	 * exclusivity.
>  	 */
>  	if (!folio_test_ksm(folio) &&
> -	    (exclusive || folio_ref_count(folio) == 1)) {
> +	    (exclusive || (folio_ref_count(folio) == nr_pages &&
> +			   folio_nr_pages(folio) == nr_pages))) {
>  		if (vmf->flags & FAULT_FLAG_WRITE) {
>  			pte = maybe_mkwrite(pte_mkdirty(pte), vma);
>  			vmf->flags &= ~FAULT_FLAG_WRITE;
>  		}
>  		rmap_flags |= RMAP_EXCLUSIVE;
>  	}
> -	flush_icache_page(vma, page);
> +	flush_icache_pages(vma, page, nr_pages);
>  	if (pte_swp_soft_dirty(vmf->orig_pte))
>  		pte = pte_mksoft_dirty(pte);
>  	if (pte_swp_uffd_wp(vmf->orig_pte))
>  		pte = pte_mkuffd_wp(pte);

I'm not sure about all this... you are smearing these SW bits from the faulting
PTE across all the ptes you are mapping. Although I guess actually that's ok
because swap_pte_batch() only returns a batch with all these bits the same?

> -	vmf->orig_pte = pte;

Instead of doing a readback below, perhaps:

	vmf->orig_pte = pte_advance_pfn(pte, nr_pages);

>  
>  	/* ksm created a completely new copy */
>  	if (unlikely(folio != swapcache && swapcache)) {
> -		folio_add_new_anon_rmap(folio, vma, vmf->address);
> +		folio_add_new_anon_rmap(folio, vma, start_address);
>  		folio_add_lru_vma(folio, vma);
>  	} else {
> -		folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
> -					rmap_flags);
> +		folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, start_address,
> +					 rmap_flags);
>  	}
>  
>  	VM_BUG_ON(!folio_test_anon(folio) ||
>  			(pte_write(pte) && !PageAnonExclusive(page)));
> -	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> -	arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
> +	set_ptes(vma->vm_mm, start_address, start_pte, pte, nr_pages);
> +	vmf->orig_pte = ptep_get(vmf->pte);
> +	arch_do_swap_page(vma->vm_mm, vma, start_address, pte, pte);
>  
>  	folio_unlock(folio);
>  	if (folio != swapcache && swapcache) {
> @@ -4269,7 +4309,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	}
>  
>  	/* No need to invalidate - it was non-present before */
> -	update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
> +	update_mmu_cache_range(vmf, vma, start_address, start_pte, nr_pages);
>  unlock:
>  	if (vmf->pte)
>  		pte_unmap_unlock(vmf->pte, vmf->ptl);


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 3/5] mm: swap_pte_batch: add an output argument to reture if all swap entries are exclusive
  2024-04-11 15:00     ` David Hildenbrand
@ 2024-04-11 15:36       ` Ryan Roberts
  0 siblings, 0 replies; 54+ messages in thread
From: Ryan Roberts @ 2024-04-11 15:36 UTC (permalink / raw)
  To: David Hildenbrand, Barry Song, akpm, linux-mm
  Cc: baolin.wang, chrisl, hanchuanhua, hannes, hughd, kasong, surenb,
	v-songbaohua, willy, xiang, ying.huang, yosryahmed, yuzhao, ziy,
	linux-kernel

On 11/04/2024 16:00, David Hildenbrand wrote:
> On 11.04.24 16:54, Ryan Roberts wrote:
>> On 09/04/2024 09:26, Barry Song wrote:
>>> From: Barry Song <v-songbaohua@oppo.com>
>>>
>>> Add a boolean argument named any_shared. If any of the swap entries are
>>> non-exclusive, set any_shared to true. The function do_swap_page() can
>>> then utilize this information to determine whether the entire large
>>> folio can be reused.
>>>
>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>> ---
>>>   mm/internal.h | 9 ++++++++-
>>>   mm/madvise.c  | 2 +-
>>>   mm/memory.c   | 2 +-
>>>   3 files changed, 10 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/mm/internal.h b/mm/internal.h
>>> index 9d3250b4a08a..cae39c372bfc 100644
>>> --- a/mm/internal.h
>>> +++ b/mm/internal.h
>>> @@ -238,7 +238,8 @@ static inline pte_t pte_next_swp_offset(pte_t pte)
>>>    *
>>>    * Return: the number of table entries in the batch.
>>>    */
>>> -static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte)
>>> +static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte,
>>> +                bool *any_shared)
>>
>> Please update the docs in the comment above this for the new param; follow
>> folio_pte_batch()'s docs as a template.
>>
>>>   {
>>>       pte_t expected_pte = pte_next_swp_offset(pte);
>>>       const pte_t *end_ptep = start_ptep + max_nr;
>>> @@ -248,12 +249,18 @@ static inline int swap_pte_batch(pte_t *start_ptep, int
>>> max_nr, pte_t pte)
>>>       VM_WARN_ON(!is_swap_pte(pte));
>>>       VM_WARN_ON(non_swap_entry(pte_to_swp_entry(pte)));
>>>   +    if (any_shared)
>>> +        *any_shared |= !pte_swp_exclusive(pte);
>>
>> This is different from the approach in folio_pte_batch(). It inits *any_shared
>> to false and does NOT include the value of the first pte. I think that's odd,
>> personally and I prefer your approach. I'm not sure if there was a good reason
>> that David chose the other approach?
> 
> Because in my case calling code does
> 
> nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr, flags,
>              &any_writable);
> 
> ...
> 
> if (any_writable)
>     pte = pte_mkwrite(pte, src_vma);
> 
> ...
> 
> and later checks in another function pte_write().
> 
> So if the common pattern is that the original PTE will be used for checks, then
> it doesn't make sense to unnecessary checks+setting for the first PTE.

Yep understood. And I think adopting your semantics for any_shared actually
simplifies the code in patch 4 too; I've just commented that.




^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 5/5] mm: add per-order mTHP swpin_refault counter
  2024-04-09  8:26 ` [PATCH v2 5/5] mm: add per-order mTHP swpin_refault counter Barry Song
  2024-04-10 23:15   ` SeongJae Park
@ 2024-04-11 15:53   ` Ryan Roberts
  2024-04-11 23:01     ` Barry Song
  2024-04-17  0:45   ` Huang, Ying
  2 siblings, 1 reply; 54+ messages in thread
From: Ryan Roberts @ 2024-04-11 15:53 UTC (permalink / raw)
  To: Barry Song, akpm, linux-mm
  Cc: baolin.wang, chrisl, david, hanchuanhua, hannes, hughd, kasong,
	surenb, v-songbaohua, willy, xiang, ying.huang, yosryahmed,
	yuzhao, ziy, linux-kernel

On 09/04/2024 09:26, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> Currently, we are handling the scenario where we've hit a
> large folio in the swapcache, and the reclaiming process
> for this large folio is still ongoing.
> 
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  include/linux/huge_mm.h | 1 +
>  mm/huge_memory.c        | 2 ++
>  mm/memory.c             | 1 +
>  3 files changed, 4 insertions(+)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index c8256af83e33..b67294d5814f 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -269,6 +269,7 @@ enum mthp_stat_item {
>  	MTHP_STAT_ANON_ALLOC_FALLBACK,
>  	MTHP_STAT_ANON_SWPOUT,
>  	MTHP_STAT_ANON_SWPOUT_FALLBACK,
> +	MTHP_STAT_ANON_SWPIN_REFAULT,

I don't see any equivalent counter for small folios. Is there an analogue?

>  	__MTHP_STAT_COUNT
>  };
>  
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d8d2ed80b0bf..fb95345b0bde 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -556,12 +556,14 @@ DEFINE_MTHP_STAT_ATTR(anon_alloc, MTHP_STAT_ANON_ALLOC);
>  DEFINE_MTHP_STAT_ATTR(anon_alloc_fallback, MTHP_STAT_ANON_ALLOC_FALLBACK);
>  DEFINE_MTHP_STAT_ATTR(anon_swpout, MTHP_STAT_ANON_SWPOUT);
>  DEFINE_MTHP_STAT_ATTR(anon_swpout_fallback, MTHP_STAT_ANON_SWPOUT_FALLBACK);
> +DEFINE_MTHP_STAT_ATTR(anon_swpin_refault, MTHP_STAT_ANON_SWPIN_REFAULT);
>  
>  static struct attribute *stats_attrs[] = {
>  	&anon_alloc_attr.attr,
>  	&anon_alloc_fallback_attr.attr,
>  	&anon_swpout_attr.attr,
>  	&anon_swpout_fallback_attr.attr,
> +	&anon_swpin_refault_attr.attr,
>  	NULL,
>  };
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index 9818dc1893c8..acc023795a4d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4167,6 +4167,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  		nr_pages = nr;
>  		entry = folio->swap;
>  		page = &folio->page;
> +		count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_SWPIN_REFAULT);

I don't think this is the point of no return yet? There's the pte_same() check
immediately below (although I've suggested that needs to be moved to earlier),
but also the folio_test_uptodate() check. Perhaps this should go after that?

>  	}
>  
>  check_pte:


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 5/5] mm: add per-order mTHP swpin_refault counter
  2024-04-11  1:46     ` Barry Song
@ 2024-04-11 16:14       ` SeongJae Park
  0 siblings, 0 replies; 54+ messages in thread
From: SeongJae Park @ 2024-04-11 16:14 UTC (permalink / raw)
  To: Barry Song
  Cc: SeongJae Park, akpm, baolin.wang, chrisl, david, hanchuanhua,
	hannes, hughd, kasong, linux-kernel, linux-mm, ryan.roberts,
	surenb, v-songbaohua, willy, xiang, ying.huang, yosryahmed,
	yuzhao, ziy

Hi Barry,

On Thu, 11 Apr 2024 13:46:36 +1200 Barry Song <21cnbao@gmail.com> wrote:

> >> +		count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_SWPIN_REFAULT);
> >>  	}
> >> 
> > From the latest mm-unstable tree, I get below kunit build failure and
> > 'git bisect' points this patch.
> > 
> >     $ ./tools/testing/kunit/kunit.py run --build_dir ../kunit.out/
> >     [16:07:40] Configuring KUnit Kernel ...
> >     [16:07:40] Building KUnit Kernel ...
> >     Populating config with:
> >     $ make ARCH=um O=../kunit.out/ olddefconfig
> >     Building with:
> >     $ make ARCH=um O=../kunit.out/ --jobs=36
> >     ERROR:root:.../mm/memory.c: In function ‘do_swap_page’:
> >     .../mm/memory.c:4169:17: error: implicit declaration of function ‘count_mthp_stat’ [-Werror=implicit-function-declaration]
> >      4169 |                 count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_SWPIN_REFAULT);
> >           |                 ^~~~~~~~~~~~~~~
> >     .../mm/memory.c:4169:53: error: ‘MTHP_STAT_ANON_SWPIN_REFAULT’ undeclared (first use in this function)
> >      4169 |                 count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_SWPIN_REFAULT);
> >           |                                                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >     .../mm/memory.c:4169:53: note: each undeclared identifier is reported only once for each function it appears in
> >     cc1: some warnings being treated as errors
> > 
> > My kunit build config doesn't have CONFIG_TRANSPARE_HUGEPAGE.  Maybe that's the
> > reason and this patch, or the patch that introduced the function and the enum
> > need to take care of the case?
> 
> Hi SeongJae,
> Thanks very much, can you check if the below fix the build? If yes, I will
> include this fix while sending v3.

Thank you for quick and kind reply :) I confirmed this fixes the build failure.

> 
> Subject: [PATCH] mm: fix build errors on CONFIG_TRANSPARENT_HUGEPAGE=N
> 
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Tested-by: SeongJae Park <sj@kernel.org>


Thanks,
SJ

> ---
>  mm/memory.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index acc023795a4d..1d587d1eb432 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4142,6 +4142,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>  			&vmf->ptl);
>  
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  	/* We hit large folios in swapcache */
>  	if (start_pte && folio_test_large(folio) && folio_test_swapcache(folio)) {
>  		int nr = folio_nr_pages(folio);
> @@ -4171,6 +4172,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	}
>  
>  check_pte:
> +#endif
>  	if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
>  		goto out_nomap;
>  
> -- 
> 2.34.1
> 
> 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 5/5] mm: add per-order mTHP swpin_refault counter
  2024-04-11 15:53   ` Ryan Roberts
@ 2024-04-11 23:01     ` Barry Song
  0 siblings, 0 replies; 54+ messages in thread
From: Barry Song @ 2024-04-11 23:01 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, hanchuanhua, hannes,
	hughd, kasong, surenb, v-songbaohua, willy, xiang, ying.huang,
	yosryahmed, yuzhao, ziy, linux-kernel

On Fri, Apr 12, 2024 at 3:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 09/04/2024 09:26, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > Currently, we are handling the scenario where we've hit a
> > large folio in the swapcache, and the reclaiming process
> > for this large folio is still ongoing.
> >
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  include/linux/huge_mm.h | 1 +
> >  mm/huge_memory.c        | 2 ++
> >  mm/memory.c             | 1 +
> >  3 files changed, 4 insertions(+)
> >
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index c8256af83e33..b67294d5814f 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -269,6 +269,7 @@ enum mthp_stat_item {
> >       MTHP_STAT_ANON_ALLOC_FALLBACK,
> >       MTHP_STAT_ANON_SWPOUT,
> >       MTHP_STAT_ANON_SWPOUT_FALLBACK,
> > +     MTHP_STAT_ANON_SWPIN_REFAULT,
>
> I don't see any equivalent counter for small folios. Is there an analogue?

Indeed, we don't count refaults for small folios, as their refault
mechanism is much
simpler compared to large folios. Implementing this counter can enhance the
system's visibility to users.

Personally, having this counter and observing a non-zero value greatly enhances
my confidence when debugging this refault series. Otherwise, it feels like being
blind to what's happening inside the system :-)

>
> >       __MTHP_STAT_COUNT
> >  };
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index d8d2ed80b0bf..fb95345b0bde 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -556,12 +556,14 @@ DEFINE_MTHP_STAT_ATTR(anon_alloc, MTHP_STAT_ANON_ALLOC);
> >  DEFINE_MTHP_STAT_ATTR(anon_alloc_fallback, MTHP_STAT_ANON_ALLOC_FALLBACK);
> >  DEFINE_MTHP_STAT_ATTR(anon_swpout, MTHP_STAT_ANON_SWPOUT);
> >  DEFINE_MTHP_STAT_ATTR(anon_swpout_fallback, MTHP_STAT_ANON_SWPOUT_FALLBACK);
> > +DEFINE_MTHP_STAT_ATTR(anon_swpin_refault, MTHP_STAT_ANON_SWPIN_REFAULT);
> >
> >  static struct attribute *stats_attrs[] = {
> >       &anon_alloc_attr.attr,
> >       &anon_alloc_fallback_attr.attr,
> >       &anon_swpout_attr.attr,
> >       &anon_swpout_fallback_attr.attr,
> > +     &anon_swpin_refault_attr.attr,
> >       NULL,
> >  };
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 9818dc1893c8..acc023795a4d 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4167,6 +4167,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >               nr_pages = nr;
> >               entry = folio->swap;
> >               page = &folio->page;
> > +             count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_SWPIN_REFAULT);
>
> I don't think this is the point of no return yet? There's the pte_same() check
> immediately below (although I've suggested that needs to be moved to earlier),
> but also the folio_test_uptodate() check. Perhaps this should go after that?
>

swap_pte_batch() == nr_pages should have passed the test for pte_same.
folio_test_uptodate(folio)) should be also unlikely to be true as we are
not reading from swap devices for refault case.

but i agree we can move all the refault handling after those two "goto
out_nomap".

> >       }
> >
> >  check_pte:
>

Thanks
Barry

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 4/5] mm: swap: entirely map large folios found in swapcache
  2024-04-11 15:33   ` Ryan Roberts
@ 2024-04-11 23:30     ` Barry Song
  2024-04-12 11:31       ` Ryan Roberts
  0 siblings, 1 reply; 54+ messages in thread
From: Barry Song @ 2024-04-11 23:30 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, hanchuanhua, hannes,
	hughd, kasong, surenb, v-songbaohua, willy, xiang, ying.huang,
	yosryahmed, yuzhao, ziy, linux-kernel

On Fri, Apr 12, 2024 at 3:33 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 09/04/2024 09:26, Barry Song wrote:
> > From: Chuanhua Han <hanchuanhua@oppo.com>
> >
> > When a large folio is found in the swapcache, the current implementation
> > requires calling do_swap_page() nr_pages times, resulting in nr_pages
> > page faults. This patch opts to map the entire large folio at once to
> > minimize page faults. Additionally, redundant checks and early exits
> > for ARM64 MTE restoring are removed.
> >
> > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> > Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  mm/memory.c | 64 +++++++++++++++++++++++++++++++++++++++++++----------
> >  1 file changed, 52 insertions(+), 12 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index c4a52e8d740a..9818dc1893c8 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3947,6 +3947,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >       pte_t pte;
> >       vm_fault_t ret = 0;
> >       void *shadow = NULL;
> > +     int nr_pages = 1;
> > +     unsigned long start_address = vmf->address;
> > +     pte_t *start_pte = vmf->pte;
>
> possible bug?: there are code paths that assign to vmf-pte below in this
> function, so couldn't start_pte be stale in some cases? I'd just do the
> assignment (all 4 of these variables in fact) in an else clause below, after any
> messing about with them is complete.
>
> nit: rename start_pte -> start_ptep ?

Agreed.

>
> > +     bool any_swap_shared = false;
>
> Suggest you defer initialization of this to your "We hit large folios in
> swapcache" block below, and init it to:
>
>         any_swap_shared = !pte_swp_exclusive(vmf->pte);
>
> Then the any_shared semantic in swap_pte_batch() can be the same as for
> folio_pte_batch().
>
> >
> >       if (!pte_unmap_same(vmf))
> >               goto out;
> > @@ -4137,6 +4141,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >        */
> >       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> >                       &vmf->ptl);
>
> bug: vmf->pte may be NULL and you are not checking it until check_pte:. Byt you
> are using it in this block. It also seems odd to do all the work in the below
> block under the PTL but before checking if the pte has changed. Suggest moving
> both checks here.

agreed.

>
> > +
> > +     /* We hit large folios in swapcache */
> > +     if (start_pte && folio_test_large(folio) && folio_test_swapcache(folio)) {
>
> What's the start_pte check protecting?

This is exactly protecting the case vmf->pte==NULL but for some reason it was
assigned in the beginning of the function incorrectly. The intention of the code
was actually doing start_pte = vmf->pte after "vmf->pte = pte_offset_map_lock".

>
> > +             int nr = folio_nr_pages(folio);
> > +             int idx = folio_page_idx(folio, page);
> > +             unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
> > +             unsigned long folio_end = folio_start + nr * PAGE_SIZE;
> > +             pte_t *folio_ptep;
> > +             pte_t folio_pte;
> > +
> > +             if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
> > +                     goto check_pte;
> > +             if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
> > +                     goto check_pte;
> > +
> > +             folio_ptep = vmf->pte - idx;
> > +             folio_pte = ptep_get(folio_ptep);
> > +             if (!is_swap_pte(folio_pte) || non_swap_entry(pte_to_swp_entry(folio_pte)) ||
> > +                 swap_pte_batch(folio_ptep, nr, folio_pte, &any_swap_shared) != nr)
> > +                     goto check_pte;
> > +
> > +             start_address = folio_start;
> > +             start_pte = folio_ptep;
> > +             nr_pages = nr;
> > +             entry = folio->swap;
> > +             page = &folio->page;
> > +     }
> > +
> > +check_pte:
> >       if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> >               goto out_nomap;
> >
> > @@ -4190,6 +4223,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >                        */
> >                       exclusive = false;
> >               }
> > +
> > +             /* Reuse the whole large folio iff all entries are exclusive */
> > +             if (nr_pages > 1 && any_swap_shared)
> > +                     exclusive = false;
>
> If you init any_shared with the firt pte as I suggested then you could just set
> exclusive = !any_shared at the top of this if block without needing this
> separate fixup.

Since your swap_pte_batch() function checks that all PTEs have the same
exclusive bits, I'll be removing any_shared first in version 3 per David's
suggestions. We could potentially develop "any_shared" as an incremental
patchset later on .

> >       }
> >
> >       /*
> > @@ -4204,12 +4241,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >        * We're already holding a reference on the page but haven't mapped it
> >        * yet.
> >        */
> > -     swap_free(entry);
> > +     swap_free_nr(entry, nr_pages);
> >       if (should_try_to_free_swap(folio, vma, vmf->flags))
> >               folio_free_swap(folio);
> >
> > -     inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> > -     dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> > +     folio_ref_add(folio, nr_pages - 1);
> > +     add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> > +     add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> > +
> >       pte = mk_pte(page, vma->vm_page_prot);
> >
> >       /*
> > @@ -4219,33 +4258,34 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >        * exclusivity.
> >        */
> >       if (!folio_test_ksm(folio) &&
> > -         (exclusive || folio_ref_count(folio) == 1)) {
> > +         (exclusive || (folio_ref_count(folio) == nr_pages &&
> > +                        folio_nr_pages(folio) == nr_pages))) {
> >               if (vmf->flags & FAULT_FLAG_WRITE) {
> >                       pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> >                       vmf->flags &= ~FAULT_FLAG_WRITE;
> >               }
> >               rmap_flags |= RMAP_EXCLUSIVE;
> >       }
> > -     flush_icache_page(vma, page);
> > +     flush_icache_pages(vma, page, nr_pages);
> >       if (pte_swp_soft_dirty(vmf->orig_pte))
> >               pte = pte_mksoft_dirty(pte);
> >       if (pte_swp_uffd_wp(vmf->orig_pte))
> >               pte = pte_mkuffd_wp(pte);
>
> I'm not sure about all this... you are smearing these SW bits from the faulting
> PTE across all the ptes you are mapping. Although I guess actually that's ok
> because swap_pte_batch() only returns a batch with all these bits the same?

Initially, I didn't recognize the issue at all because the tested
architecture arm64
didn't include these bits. However, after reviewing your latest swpout series,
which verifies the consistent bits for soft_dirty and uffd_wp, I now
feel  its safety
even for platforms with these bits.

>
> > -     vmf->orig_pte = pte;
>
> Instead of doing a readback below, perhaps:
>
>         vmf->orig_pte = pte_advance_pfn(pte, nr_pages);

Nice !

>
> >
> >       /* ksm created a completely new copy */
> >       if (unlikely(folio != swapcache && swapcache)) {
> > -             folio_add_new_anon_rmap(folio, vma, vmf->address);
> > +             folio_add_new_anon_rmap(folio, vma, start_address);
> >               folio_add_lru_vma(folio, vma);
> >       } else {
> > -             folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
> > -                                     rmap_flags);
> > +             folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, start_address,
> > +                                      rmap_flags);
> >       }
> >
> >       VM_BUG_ON(!folio_test_anon(folio) ||
> >                       (pte_write(pte) && !PageAnonExclusive(page)));
> > -     set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> > -     arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
> > +     set_ptes(vma->vm_mm, start_address, start_pte, pte, nr_pages);
> > +     vmf->orig_pte = ptep_get(vmf->pte);
> > +     arch_do_swap_page(vma->vm_mm, vma, start_address, pte, pte);
> >
> >       folio_unlock(folio);
> >       if (folio != swapcache && swapcache) {
> > @@ -4269,7 +4309,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >       }
> >
> >       /* No need to invalidate - it was non-present before */
> > -     update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
> > +     update_mmu_cache_range(vmf, vma, start_address, start_pte, nr_pages);
> >  unlock:
> >       if (vmf->pte)
> >               pte_unmap_unlock(vmf->pte, vmf->ptl);
>

Thanks
Barry

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/5] mm: swap: introduce swap_free_nr() for batched swap_free()
  2024-04-11 14:30   ` Ryan Roberts
@ 2024-04-12  2:07     ` Chuanhua Han
  2024-04-12 11:28       ` Ryan Roberts
  0 siblings, 1 reply; 54+ messages in thread
From: Chuanhua Han @ 2024-04-12  2:07 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Barry Song, akpm, linux-mm, baolin.wang, chrisl, david,
	hanchuanhua, hannes, hughd, kasong, surenb, v-songbaohua, willy,
	xiang, ying.huang, yosryahmed, yuzhao, ziy, linux-kernel

Ryan Roberts <ryan.roberts@arm.com> 于2024年4月11日周四 22:30写道:
>
> On 09/04/2024 09:26, Barry Song wrote:
> > From: Chuanhua Han <hanchuanhua@oppo.com>
> >
> > While swapping in a large folio, we need to free swaps related to the whole
> > folio. To avoid frequently acquiring and releasing swap locks, it is better
> > to introduce an API for batched free.
> >
> > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> > Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>
> Couple of nits; feel free to ignore.
>
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>
> > ---
> >  include/linux/swap.h |  5 +++++
> >  mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 56 insertions(+)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 11c53692f65f..b7a107e983b8 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
> >  extern int swap_duplicate(swp_entry_t);
> >  extern int swapcache_prepare(swp_entry_t);
> >  extern void swap_free(swp_entry_t);
> > +extern void swap_free_nr(swp_entry_t entry, int nr_pages);
> >  extern void swapcache_free_entries(swp_entry_t *entries, int n);
> >  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
> >  int swap_type_of(dev_t device, sector_t offset);
> > @@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
> >  {
> >  }
> >
> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
> > +{
> > +}
> > +
> >  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
> >  {
> >  }
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 28642c188c93..f4c65aeb088d 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -1356,6 +1356,57 @@ void swap_free(swp_entry_t entry)
> >               __swap_entry_free(p, entry);
> >  }
> >
> > +/*
> > + * Free up the maximum number of swap entries at once to limit the
> > + * maximum kernel stack usage.
> > + */
> > +#define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
> > +
> > +/*
> > + * Called after swapping in a large folio, batched free swap entries
> > + * for this large folio, entry should be for the first subpage and
> > + * its offset is aligned with nr_pages
> > + */
> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
> > +{
> > +     int i, j;
> > +     struct swap_cluster_info *ci;
> > +     struct swap_info_struct *p;
> > +     unsigned int type = swp_type(entry);
> > +     unsigned long offset = swp_offset(entry);
> > +     int batch_nr, remain_nr;
> > +     DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };
> > +
> > +     /* all swap entries are within a cluster for mTHP */
> > +     VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
> > +
> > +     if (nr_pages == 1) {
> > +             swap_free(entry);
> > +             return;
> > +     }
> > +
> > +     remain_nr = nr_pages;
> > +     p = _swap_info_get(entry);
> > +     if (p) {
>
> nit: perhaps return early if (!p) ? Then you dedent the for() block.

Agreed!

>
> > +             for (i = 0; i < nr_pages; i += batch_nr) {
> > +                     batch_nr = min_t(int, SWAP_BATCH_NR, remain_nr);
> > +
> > +                     ci = lock_cluster_or_swap_info(p, offset);
> > +                     for (j = 0; j < batch_nr; j++) {
> > +                             if (__swap_entry_free_locked(p, offset + i * SWAP_BATCH_NR + j, 1))
> > +                                     __bitmap_set(usage, j, 1);
> > +                     }
> > +                     unlock_cluster_or_swap_info(p, ci);
> > +
> > +                     for_each_clear_bit(j, usage, batch_nr)
> > +                             free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
> > +
>
> nit: perhaps change to for (;;), and do the checks here to avoid clearing the
> bitmap on the last run:
>
>                         i += batch_nr;
>                         if (i < nr_pages)
>                                 break;
Great, thank you for your advice!
>
> > +                     bitmap_clear(usage, 0, SWAP_BATCH_NR);
> > +                     remain_nr -= batch_nr;
> > +             }
> > +     }
> > +}
> > +
> >  /*
> >   * Called after dropping swapcache to decrease refcnt to swap entries.
> >   */
>
>


-- 
Thanks,
Chuanhua

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/5] mm: swap: introduce swap_free_nr() for batched swap_free()
  2024-04-12  2:07     ` Chuanhua Han
@ 2024-04-12 11:28       ` Ryan Roberts
  2024-04-12 11:38         ` Chuanhua Han
  0 siblings, 1 reply; 54+ messages in thread
From: Ryan Roberts @ 2024-04-12 11:28 UTC (permalink / raw)
  To: Chuanhua Han
  Cc: Barry Song, akpm, linux-mm, baolin.wang, chrisl, david,
	hanchuanhua, hannes, hughd, kasong, surenb, v-songbaohua, willy,
	xiang, ying.huang, yosryahmed, yuzhao, ziy, linux-kernel

On 12/04/2024 03:07, Chuanhua Han wrote:
> Ryan Roberts <ryan.roberts@arm.com> 于2024年4月11日周四 22:30写道:
>>
>> On 09/04/2024 09:26, Barry Song wrote:
>>> From: Chuanhua Han <hanchuanhua@oppo.com>
>>>
>>> While swapping in a large folio, we need to free swaps related to the whole
>>> folio. To avoid frequently acquiring and releasing swap locks, it is better
>>> to introduce an API for batched free.
>>>
>>> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
>>> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>
>> Couple of nits; feel free to ignore.
>>
>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>>
>>> ---
>>>  include/linux/swap.h |  5 +++++
>>>  mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++
>>>  2 files changed, 56 insertions(+)
>>>
>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>> index 11c53692f65f..b7a107e983b8 100644
>>> --- a/include/linux/swap.h
>>> +++ b/include/linux/swap.h
>>> @@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
>>>  extern int swap_duplicate(swp_entry_t);
>>>  extern int swapcache_prepare(swp_entry_t);
>>>  extern void swap_free(swp_entry_t);
>>> +extern void swap_free_nr(swp_entry_t entry, int nr_pages);
>>>  extern void swapcache_free_entries(swp_entry_t *entries, int n);
>>>  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
>>>  int swap_type_of(dev_t device, sector_t offset);
>>> @@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
>>>  {
>>>  }
>>>
>>> +void swap_free_nr(swp_entry_t entry, int nr_pages)
>>> +{
>>> +}
>>> +
>>>  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
>>>  {
>>>  }
>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>> index 28642c188c93..f4c65aeb088d 100644
>>> --- a/mm/swapfile.c
>>> +++ b/mm/swapfile.c
>>> @@ -1356,6 +1356,57 @@ void swap_free(swp_entry_t entry)
>>>               __swap_entry_free(p, entry);
>>>  }
>>>
>>> +/*
>>> + * Free up the maximum number of swap entries at once to limit the
>>> + * maximum kernel stack usage.
>>> + */
>>> +#define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
>>> +
>>> +/*
>>> + * Called after swapping in a large folio, batched free swap entries
>>> + * for this large folio, entry should be for the first subpage and
>>> + * its offset is aligned with nr_pages
>>> + */
>>> +void swap_free_nr(swp_entry_t entry, int nr_pages)
>>> +{
>>> +     int i, j;
>>> +     struct swap_cluster_info *ci;
>>> +     struct swap_info_struct *p;
>>> +     unsigned int type = swp_type(entry);
>>> +     unsigned long offset = swp_offset(entry);
>>> +     int batch_nr, remain_nr;
>>> +     DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };
>>> +
>>> +     /* all swap entries are within a cluster for mTHP */
>>> +     VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
>>> +
>>> +     if (nr_pages == 1) {
>>> +             swap_free(entry);
>>> +             return;
>>> +     }
>>> +
>>> +     remain_nr = nr_pages;
>>> +     p = _swap_info_get(entry);
>>> +     if (p) {
>>
>> nit: perhaps return early if (!p) ? Then you dedent the for() block.
> 
> Agreed!
> 
>>
>>> +             for (i = 0; i < nr_pages; i += batch_nr) {
>>> +                     batch_nr = min_t(int, SWAP_BATCH_NR, remain_nr);
>>> +
>>> +                     ci = lock_cluster_or_swap_info(p, offset);
>>> +                     for (j = 0; j < batch_nr; j++) {
>>> +                             if (__swap_entry_free_locked(p, offset + i * SWAP_BATCH_NR + j, 1))
>>> +                                     __bitmap_set(usage, j, 1);
>>> +                     }
>>> +                     unlock_cluster_or_swap_info(p, ci);
>>> +
>>> +                     for_each_clear_bit(j, usage, batch_nr)
>>> +                             free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
>>> +
>>
>> nit: perhaps change to for (;;), and do the checks here to avoid clearing the
>> bitmap on the last run:
>>
>>                         i += batch_nr;
>>                         if (i < nr_pages)
>>                                 break;
> Great, thank you for your advice!

Or maybe leave the for() as is, but don't explicitly init the bitmap at the
start of the function and instead call:

	bitmap_clear(usage, 0, SWAP_BATCH_NR);

At the start of each loop?

>>
>>> +                     bitmap_clear(usage, 0, SWAP_BATCH_NR);
>>> +                     remain_nr -= batch_nr;
>>> +             }
>>> +     }
>>> +}
>>> +
>>>  /*
>>>   * Called after dropping swapcache to decrease refcnt to swap entries.
>>>   */
>>
>>
> 
> 


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 4/5] mm: swap: entirely map large folios found in swapcache
  2024-04-11 23:30     ` Barry Song
@ 2024-04-12 11:31       ` Ryan Roberts
  0 siblings, 0 replies; 54+ messages in thread
From: Ryan Roberts @ 2024-04-12 11:31 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, hanchuanhua, hannes,
	hughd, kasong, surenb, v-songbaohua, willy, xiang, ying.huang,
	yosryahmed, yuzhao, ziy, linux-kernel

On 12/04/2024 00:30, Barry Song wrote:
> On Fri, Apr 12, 2024 at 3:33 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 09/04/2024 09:26, Barry Song wrote:
>>> From: Chuanhua Han <hanchuanhua@oppo.com>
>>>
>>> When a large folio is found in the swapcache, the current implementation
>>> requires calling do_swap_page() nr_pages times, resulting in nr_pages
>>> page faults. This patch opts to map the entire large folio at once to
>>> minimize page faults. Additionally, redundant checks and early exits
>>> for ARM64 MTE restoring are removed.
>>>
>>> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
>>> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>> ---
>>>  mm/memory.c | 64 +++++++++++++++++++++++++++++++++++++++++++----------
>>>  1 file changed, 52 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index c4a52e8d740a..9818dc1893c8 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -3947,6 +3947,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>       pte_t pte;
>>>       vm_fault_t ret = 0;
>>>       void *shadow = NULL;
>>> +     int nr_pages = 1;
>>> +     unsigned long start_address = vmf->address;
>>> +     pte_t *start_pte = vmf->pte;
>>
>> possible bug?: there are code paths that assign to vmf-pte below in this
>> function, so couldn't start_pte be stale in some cases? I'd just do the
>> assignment (all 4 of these variables in fact) in an else clause below, after any
>> messing about with them is complete.
>>
>> nit: rename start_pte -> start_ptep ?
> 
> Agreed.
> 
>>
>>> +     bool any_swap_shared = false;
>>
>> Suggest you defer initialization of this to your "We hit large folios in
>> swapcache" block below, and init it to:
>>
>>         any_swap_shared = !pte_swp_exclusive(vmf->pte);
>>
>> Then the any_shared semantic in swap_pte_batch() can be the same as for
>> folio_pte_batch().
>>
>>>
>>>       if (!pte_unmap_same(vmf))
>>>               goto out;
>>> @@ -4137,6 +4141,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>        */
>>>       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>>>                       &vmf->ptl);
>>
>> bug: vmf->pte may be NULL and you are not checking it until check_pte:. Byt you
>> are using it in this block. It also seems odd to do all the work in the below
>> block under the PTL but before checking if the pte has changed. Suggest moving
>> both checks here.
> 
> agreed.
> 
>>
>>> +
>>> +     /* We hit large folios in swapcache */
>>> +     if (start_pte && folio_test_large(folio) && folio_test_swapcache(folio)) {
>>
>> What's the start_pte check protecting?
> 
> This is exactly protecting the case vmf->pte==NULL but for some reason it was
> assigned in the beginning of the function incorrectly. The intention of the code
> was actually doing start_pte = vmf->pte after "vmf->pte = pte_offset_map_lock".
> 
>>
>>> +             int nr = folio_nr_pages(folio);
>>> +             int idx = folio_page_idx(folio, page);
>>> +             unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
>>> +             unsigned long folio_end = folio_start + nr * PAGE_SIZE;
>>> +             pte_t *folio_ptep;
>>> +             pte_t folio_pte;
>>> +
>>> +             if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
>>> +                     goto check_pte;
>>> +             if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
>>> +                     goto check_pte;
>>> +
>>> +             folio_ptep = vmf->pte - idx;
>>> +             folio_pte = ptep_get(folio_ptep);
>>> +             if (!is_swap_pte(folio_pte) || non_swap_entry(pte_to_swp_entry(folio_pte)) ||
>>> +                 swap_pte_batch(folio_ptep, nr, folio_pte, &any_swap_shared) != nr)
>>> +                     goto check_pte;
>>> +
>>> +             start_address = folio_start;
>>> +             start_pte = folio_ptep;
>>> +             nr_pages = nr;
>>> +             entry = folio->swap;
>>> +             page = &folio->page;
>>> +     }
>>> +
>>> +check_pte:
>>>       if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
>>>               goto out_nomap;
>>>
>>> @@ -4190,6 +4223,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>                        */
>>>                       exclusive = false;
>>>               }
>>> +
>>> +             /* Reuse the whole large folio iff all entries are exclusive */
>>> +             if (nr_pages > 1 && any_swap_shared)
>>> +                     exclusive = false;
>>
>> If you init any_shared with the firt pte as I suggested then you could just set
>> exclusive = !any_shared at the top of this if block without needing this
>> separate fixup.
> 
> Since your swap_pte_batch() function checks that all PTEs have the same
> exclusive bits, I'll be removing any_shared first in version 3 per David's
> suggestions. We could potentially develop "any_shared" as an incremental
> patchset later on .

Ahh yes, good point. I'll admit that your conversation about this went over my
head at the time since I hadn't yet looked at this.

> 
>>>       }
>>>
>>>       /*
>>> @@ -4204,12 +4241,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>        * We're already holding a reference on the page but haven't mapped it
>>>        * yet.
>>>        */
>>> -     swap_free(entry);
>>> +     swap_free_nr(entry, nr_pages);
>>>       if (should_try_to_free_swap(folio, vma, vmf->flags))
>>>               folio_free_swap(folio);
>>>
>>> -     inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>>> -     dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
>>> +     folio_ref_add(folio, nr_pages - 1);
>>> +     add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
>>> +     add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
>>> +
>>>       pte = mk_pte(page, vma->vm_page_prot);
>>>
>>>       /*
>>> @@ -4219,33 +4258,34 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>        * exclusivity.
>>>        */
>>>       if (!folio_test_ksm(folio) &&
>>> -         (exclusive || folio_ref_count(folio) == 1)) {
>>> +         (exclusive || (folio_ref_count(folio) == nr_pages &&
>>> +                        folio_nr_pages(folio) == nr_pages))) {
>>>               if (vmf->flags & FAULT_FLAG_WRITE) {
>>>                       pte = maybe_mkwrite(pte_mkdirty(pte), vma);
>>>                       vmf->flags &= ~FAULT_FLAG_WRITE;
>>>               }
>>>               rmap_flags |= RMAP_EXCLUSIVE;
>>>       }
>>> -     flush_icache_page(vma, page);
>>> +     flush_icache_pages(vma, page, nr_pages);
>>>       if (pte_swp_soft_dirty(vmf->orig_pte))
>>>               pte = pte_mksoft_dirty(pte);
>>>       if (pte_swp_uffd_wp(vmf->orig_pte))
>>>               pte = pte_mkuffd_wp(pte);
>>
>> I'm not sure about all this... you are smearing these SW bits from the faulting
>> PTE across all the ptes you are mapping. Although I guess actually that's ok
>> because swap_pte_batch() only returns a batch with all these bits the same?
> 
> Initially, I didn't recognize the issue at all because the tested
> architecture arm64
> didn't include these bits. However, after reviewing your latest swpout series,
> which verifies the consistent bits for soft_dirty and uffd_wp, I now
> feel  its safety
> even for platforms with these bits.

Yep, agreed.

> 
>>
>>> -     vmf->orig_pte = pte;
>>
>> Instead of doing a readback below, perhaps:
>>
>>         vmf->orig_pte = pte_advance_pfn(pte, nr_pages);
> 
> Nice !
> 
>>
>>>
>>>       /* ksm created a completely new copy */
>>>       if (unlikely(folio != swapcache && swapcache)) {
>>> -             folio_add_new_anon_rmap(folio, vma, vmf->address);
>>> +             folio_add_new_anon_rmap(folio, vma, start_address);
>>>               folio_add_lru_vma(folio, vma);
>>>       } else {
>>> -             folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
>>> -                                     rmap_flags);
>>> +             folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, start_address,
>>> +                                      rmap_flags);
>>>       }
>>>
>>>       VM_BUG_ON(!folio_test_anon(folio) ||
>>>                       (pte_write(pte) && !PageAnonExclusive(page)));
>>> -     set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
>>> -     arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
>>> +     set_ptes(vma->vm_mm, start_address, start_pte, pte, nr_pages);
>>> +     vmf->orig_pte = ptep_get(vmf->pte);
>>> +     arch_do_swap_page(vma->vm_mm, vma, start_address, pte, pte);
>>>
>>>       folio_unlock(folio);
>>>       if (folio != swapcache && swapcache) {
>>> @@ -4269,7 +4309,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>       }
>>>
>>>       /* No need to invalidate - it was non-present before */
>>> -     update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
>>> +     update_mmu_cache_range(vmf, vma, start_address, start_pte, nr_pages);
>>>  unlock:
>>>       if (vmf->pte)
>>>               pte_unmap_unlock(vmf->pte, vmf->ptl);
>>
> 
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/5] mm: swap: introduce swap_free_nr() for batched swap_free()
  2024-04-12 11:28       ` Ryan Roberts
@ 2024-04-12 11:38         ` Chuanhua Han
  0 siblings, 0 replies; 54+ messages in thread
From: Chuanhua Han @ 2024-04-12 11:38 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Barry Song, akpm, linux-mm, baolin.wang, chrisl, david,
	hanchuanhua, hannes, hughd, kasong, surenb, v-songbaohua, willy,
	xiang, ying.huang, yosryahmed, yuzhao, ziy, linux-kernel

Ryan Roberts <ryan.roberts@arm.com> 于2024年4月12日周五 19:28写道:
>
> On 12/04/2024 03:07, Chuanhua Han wrote:
> > Ryan Roberts <ryan.roberts@arm.com> 于2024年4月11日周四 22:30写道:
> >>
> >> On 09/04/2024 09:26, Barry Song wrote:
> >>> From: Chuanhua Han <hanchuanhua@oppo.com>
> >>>
> >>> While swapping in a large folio, we need to free swaps related to the whole
> >>> folio. To avoid frequently acquiring and releasing swap locks, it is better
> >>> to introduce an API for batched free.
> >>>
> >>> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> >>> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> >>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >>
> >> Couple of nits; feel free to ignore.
> >>
> >> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> >>
> >>> ---
> >>>  include/linux/swap.h |  5 +++++
> >>>  mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++
> >>>  2 files changed, 56 insertions(+)
> >>>
> >>> diff --git a/include/linux/swap.h b/include/linux/swap.h
> >>> index 11c53692f65f..b7a107e983b8 100644
> >>> --- a/include/linux/swap.h
> >>> +++ b/include/linux/swap.h
> >>> @@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
> >>>  extern int swap_duplicate(swp_entry_t);
> >>>  extern int swapcache_prepare(swp_entry_t);
> >>>  extern void swap_free(swp_entry_t);
> >>> +extern void swap_free_nr(swp_entry_t entry, int nr_pages);
> >>>  extern void swapcache_free_entries(swp_entry_t *entries, int n);
> >>>  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
> >>>  int swap_type_of(dev_t device, sector_t offset);
> >>> @@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
> >>>  {
> >>>  }
> >>>
> >>> +void swap_free_nr(swp_entry_t entry, int nr_pages)
> >>> +{
> >>> +}
> >>> +
> >>>  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
> >>>  {
> >>>  }
> >>> diff --git a/mm/swapfile.c b/mm/swapfile.c
> >>> index 28642c188c93..f4c65aeb088d 100644
> >>> --- a/mm/swapfile.c
> >>> +++ b/mm/swapfile.c
> >>> @@ -1356,6 +1356,57 @@ void swap_free(swp_entry_t entry)
> >>>               __swap_entry_free(p, entry);
> >>>  }
> >>>
> >>> +/*
> >>> + * Free up the maximum number of swap entries at once to limit the
> >>> + * maximum kernel stack usage.
> >>> + */
> >>> +#define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
> >>> +
> >>> +/*
> >>> + * Called after swapping in a large folio, batched free swap entries
> >>> + * for this large folio, entry should be for the first subpage and
> >>> + * its offset is aligned with nr_pages
> >>> + */
> >>> +void swap_free_nr(swp_entry_t entry, int nr_pages)
> >>> +{
> >>> +     int i, j;
> >>> +     struct swap_cluster_info *ci;
> >>> +     struct swap_info_struct *p;
> >>> +     unsigned int type = swp_type(entry);
> >>> +     unsigned long offset = swp_offset(entry);
> >>> +     int batch_nr, remain_nr;
> >>> +     DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };
> >>> +
> >>> +     /* all swap entries are within a cluster for mTHP */
> >>> +     VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
> >>> +
> >>> +     if (nr_pages == 1) {
> >>> +             swap_free(entry);
> >>> +             return;
> >>> +     }
> >>> +
> >>> +     remain_nr = nr_pages;
> >>> +     p = _swap_info_get(entry);
> >>> +     if (p) {
> >>
> >> nit: perhaps return early if (!p) ? Then you dedent the for() block.
> >
> > Agreed!
> >
> >>
> >>> +             for (i = 0; i < nr_pages; i += batch_nr) {
> >>> +                     batch_nr = min_t(int, SWAP_BATCH_NR, remain_nr);
> >>> +
> >>> +                     ci = lock_cluster_or_swap_info(p, offset);
> >>> +                     for (j = 0; j < batch_nr; j++) {
> >>> +                             if (__swap_entry_free_locked(p, offset + i * SWAP_BATCH_NR + j, 1))
> >>> +                                     __bitmap_set(usage, j, 1);
> >>> +                     }
> >>> +                     unlock_cluster_or_swap_info(p, ci);
> >>> +
> >>> +                     for_each_clear_bit(j, usage, batch_nr)
> >>> +                             free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
> >>> +
> >>
> >> nit: perhaps change to for (;;), and do the checks here to avoid clearing the
> >> bitmap on the last run:
> >>
> >>                         i += batch_nr;
> >>                         if (i < nr_pages)
> >>                                 break;
Should be:
if (i >= nr_pages)
    break;
> > Great, thank you for your advice!
>
> Or maybe leave the for() as is, but don't explicitly init the bitmap at the
> start of the function and instead call:
>
>         bitmap_clear(usage, 0, SWAP_BATCH_NR);
>
> At the start of each loop?
Yeah, that's OK, actually these two ways are similar, both are to
reduce the number of bitmap_clear calls.
>
> >>
> >>> +                     bitmap_clear(usage, 0, SWAP_BATCH_NR);
> >>> +                     remain_nr -= batch_nr;
> >>> +             }
> >>> +     }
> >>> +}
> >>> +
> >>>  /*
> >>>   * Called after dropping swapcache to decrease refcnt to swap entries.
> >>>   */
> >>
> >>
> >
> >
>


-- 
Thanks,
Chuanhua

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/5] mm: swap: introduce swap_free_nr() for batched swap_free()
  2024-04-09  8:26 ` [PATCH v2 1/5] mm: swap: introduce swap_free_nr() for batched swap_free() Barry Song
  2024-04-10 23:37   ` SeongJae Park
  2024-04-11 14:30   ` Ryan Roberts
@ 2024-04-15  6:17   ` Huang, Ying
  2024-04-15  7:04     ` Barry Song
  2 siblings, 1 reply; 54+ messages in thread
From: Huang, Ying @ 2024-04-15  6:17 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, hanchuanhua, hannes,
	hughd, kasong, ryan.roberts, surenb, v-songbaohua, willy, xiang,
	yosryahmed, yuzhao, ziy, linux-kernel

Barry Song <21cnbao@gmail.com> writes:

> From: Chuanhua Han <hanchuanhua@oppo.com>
>
> While swapping in a large folio, we need to free swaps related to the whole
> folio. To avoid frequently acquiring and releasing swap locks, it is better
> to introduce an API for batched free.
>
> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  include/linux/swap.h |  5 +++++
>  mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 56 insertions(+)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 11c53692f65f..b7a107e983b8 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
>  extern int swap_duplicate(swp_entry_t);
>  extern int swapcache_prepare(swp_entry_t);
>  extern void swap_free(swp_entry_t);
> +extern void swap_free_nr(swp_entry_t entry, int nr_pages);
>  extern void swapcache_free_entries(swp_entry_t *entries, int n);
>  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
>  int swap_type_of(dev_t device, sector_t offset);
> @@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
>  {
>  }
>  
> +void swap_free_nr(swp_entry_t entry, int nr_pages)
> +{
> +}
> +
>  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
>  {
>  }
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 28642c188c93..f4c65aeb088d 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1356,6 +1356,57 @@ void swap_free(swp_entry_t entry)
>  		__swap_entry_free(p, entry);
>  }
>  
> +/*
> + * Free up the maximum number of swap entries at once to limit the
> + * maximum kernel stack usage.
> + */
> +#define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
> +
> +/*
> + * Called after swapping in a large folio,

IMHO, it's not good to document the caller in the function definition.
Because this will discourage function reusing.

> batched free swap entries
> + * for this large folio, entry should be for the first subpage and
> + * its offset is aligned with nr_pages

Why do we need this?

> + */
> +void swap_free_nr(swp_entry_t entry, int nr_pages)
> +{
> +	int i, j;
> +	struct swap_cluster_info *ci;
> +	struct swap_info_struct *p;
> +	unsigned int type = swp_type(entry);
> +	unsigned long offset = swp_offset(entry);
> +	int batch_nr, remain_nr;
> +	DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };
> +
> +	/* all swap entries are within a cluster for mTHP */
> +	VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
> +
> +	if (nr_pages == 1) {
> +		swap_free(entry);
> +		return;
> +	}

Is it possible to unify swap_free() and swap_free_nr() into one function
with acceptable performance?  IIUC, the general rule in mTHP effort is
to avoid duplicate functions between mTHP and normal small folio.
Right?

> +
> +	remain_nr = nr_pages;
> +	p = _swap_info_get(entry);
> +	if (p) {
> +		for (i = 0; i < nr_pages; i += batch_nr) {
> +			batch_nr = min_t(int, SWAP_BATCH_NR, remain_nr);
> +
> +			ci = lock_cluster_or_swap_info(p, offset);
> +			for (j = 0; j < batch_nr; j++) {
> +				if (__swap_entry_free_locked(p, offset + i * SWAP_BATCH_NR + j, 1))
> +					__bitmap_set(usage, j, 1);
> +			}
> +			unlock_cluster_or_swap_info(p, ci);
> +
> +			for_each_clear_bit(j, usage, batch_nr)
> +				free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
> +
> +			bitmap_clear(usage, 0, SWAP_BATCH_NR);
> +			remain_nr -= batch_nr;
> +		}
> +	}
> +}
> +
>  /*
>   * Called after dropping swapcache to decrease refcnt to swap entries.
>   */

put_swap_folio() implements batching in another method.  Do you think
that it's good to use the batching method in that function here?  It
avoids to use bitmap operations and stack space.

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/5] mm: swap: introduce swap_free_nr() for batched swap_free()
  2024-04-15  6:17   ` Huang, Ying
@ 2024-04-15  7:04     ` Barry Song
  2024-04-15  8:06       ` Barry Song
  2024-04-15  8:19       ` Huang, Ying
  0 siblings, 2 replies; 54+ messages in thread
From: Barry Song @ 2024-04-15  7:04 UTC (permalink / raw)
  To: Huang, Ying
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, hanchuanhua, hannes,
	hughd, kasong, ryan.roberts, surenb, v-songbaohua, willy, xiang,
	yosryahmed, yuzhao, ziy, linux-kernel

On Mon, Apr 15, 2024 at 6:19 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > From: Chuanhua Han <hanchuanhua@oppo.com>
> >
> > While swapping in a large folio, we need to free swaps related to the whole
> > folio. To avoid frequently acquiring and releasing swap locks, it is better
> > to introduce an API for batched free.
> >
> > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> > Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  include/linux/swap.h |  5 +++++
> >  mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 56 insertions(+)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 11c53692f65f..b7a107e983b8 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
> >  extern int swap_duplicate(swp_entry_t);
> >  extern int swapcache_prepare(swp_entry_t);
> >  extern void swap_free(swp_entry_t);
> > +extern void swap_free_nr(swp_entry_t entry, int nr_pages);
> >  extern void swapcache_free_entries(swp_entry_t *entries, int n);
> >  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
> >  int swap_type_of(dev_t device, sector_t offset);
> > @@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
> >  {
> >  }
> >
> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
> > +{
> > +}
> > +
> >  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
> >  {
> >  }
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 28642c188c93..f4c65aeb088d 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -1356,6 +1356,57 @@ void swap_free(swp_entry_t entry)
> >               __swap_entry_free(p, entry);
> >  }
> >
> > +/*
> > + * Free up the maximum number of swap entries at once to limit the
> > + * maximum kernel stack usage.
> > + */
> > +#define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
> > +
> > +/*
> > + * Called after swapping in a large folio,
>
> IMHO, it's not good to document the caller in the function definition.
> Because this will discourage function reusing.

ok. right now there is only one user that is why it is added. but i agree
we can actually remove this.

>
> > batched free swap entries
> > + * for this large folio, entry should be for the first subpage and
> > + * its offset is aligned with nr_pages
>
> Why do we need this?

This is a fundamental requirement for the existing kernel, folio's
swap offset is naturally aligned from the first moment add_to_swap
to add swapcache's xa. so this comment is describing the existing
fact. In the future, if we want to support swap-out folio to discontiguous
and not-aligned offsets, we can't pass entry as the parameter, we should
instead pass ptep or another different data struct which can connect
multiple discontiguous swap offsets.

I feel like we only need "for this large folio, entry should be for
the first subpage" and drop "and its offset is aligned with nr_pages",
the latter is not important to this context at all.

>
> > + */
> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
> > +{
> > +     int i, j;
> > +     struct swap_cluster_info *ci;
> > +     struct swap_info_struct *p;
> > +     unsigned int type = swp_type(entry);
> > +     unsigned long offset = swp_offset(entry);
> > +     int batch_nr, remain_nr;
> > +     DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };
> > +
> > +     /* all swap entries are within a cluster for mTHP */
> > +     VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
> > +
> > +     if (nr_pages == 1) {
> > +             swap_free(entry);
> > +             return;
> > +     }
>
> Is it possible to unify swap_free() and swap_free_nr() into one function
> with acceptable performance?  IIUC, the general rule in mTHP effort is
> to avoid duplicate functions between mTHP and normal small folio.
> Right?

I don't see why. but we have lots of places calling swap_free(), we may
have to change them all to call swap_free_nr(entry, 1); the other possible
way is making swap_free() a wrapper of swap_free_nr() always using
1 as the argument. In both cases, we are changing the semantics of
swap_free_nr() to partially freeing large folio cases and have to drop
"entry should be for the first subpage" then.

Right now, the semantics is
* swap_free_nr() for an entire large folio;
* swap_free() for one entry of either a large folio or a small folio

>
> > +
> > +     remain_nr = nr_pages;
> > +     p = _swap_info_get(entry);
> > +     if (p) {
> > +             for (i = 0; i < nr_pages; i += batch_nr) {
> > +                     batch_nr = min_t(int, SWAP_BATCH_NR, remain_nr);
> > +
> > +                     ci = lock_cluster_or_swap_info(p, offset);
> > +                     for (j = 0; j < batch_nr; j++) {
> > +                             if (__swap_entry_free_locked(p, offset + i * SWAP_BATCH_NR + j, 1))
> > +                                     __bitmap_set(usage, j, 1);
> > +                     }
> > +                     unlock_cluster_or_swap_info(p, ci);
> > +
> > +                     for_each_clear_bit(j, usage, batch_nr)
> > +                             free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
> > +
> > +                     bitmap_clear(usage, 0, SWAP_BATCH_NR);
> > +                     remain_nr -= batch_nr;
> > +             }
> > +     }
> > +}
> > +
> >  /*
> >   * Called after dropping swapcache to decrease refcnt to swap entries.
> >   */
>
> put_swap_folio() implements batching in another method.  Do you think
> that it's good to use the batching method in that function here?  It
> avoids to use bitmap operations and stack space.

Chuanhua has strictly limited the maximum stack usage to several
unsigned long, so this should be safe. on the other hand, i believe this
implementation is more efficient, as  put_swap_folio() might lock/
unlock much more often whenever __swap_entry_free_locked returns
0.

>
> --
> Best Regards,
> Huang, Ying

Thanks
Barry

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 2/5] mm: swap: make should_try_to_free_swap() support large-folio
  2024-04-09  8:26 ` [PATCH v2 2/5] mm: swap: make should_try_to_free_swap() support large-folio Barry Song
@ 2024-04-15  7:11   ` Huang, Ying
  0 siblings, 0 replies; 54+ messages in thread
From: Huang, Ying @ 2024-04-15  7:11 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, hanchuanhua, hannes,
	hughd, kasong, ryan.roberts, surenb, v-songbaohua, willy, xiang,
	yosryahmed, yuzhao, ziy, linux-kernel

Barry Song <21cnbao@gmail.com> writes:

> From: Chuanhua Han <hanchuanhua@oppo.com>
>
> The function should_try_to_free_swap() operates under the assumption that
> swap-in always occurs at the normal page granularity, i.e., folio_nr_pages
                                                              ~~~~~~~~~~~~~~

nits: folio_nr_pages() is better for understanding.

Otherwise, LGTM, Thanks!

Reviewed-by: "Huang, Ying" <ying.huang@intel.com>

> = 1. However, in reality, for large folios, add_to_swap_cache() will
> invoke folio_ref_add(folio, nr). To accommodate large folio swap-in,
> this patch eliminates this assumption.
>
> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> Acked-by: Chris Li <chrisl@kernel.org>
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>

> ---
>  mm/memory.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 78422d1c7381..2702d449880e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3856,7 +3856,7 @@ static inline bool should_try_to_free_swap(struct folio *folio,
>  	 * reference only in case it's likely that we'll be the exlusive user.
>  	 */
>  	return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
> -		folio_ref_count(folio) == 2;
> +		folio_ref_count(folio) == (1 + folio_nr_pages(folio));
>  }
>  
>  static vm_fault_t pte_marker_clear(struct vm_fault *vmf)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/5] mm: swap: introduce swap_free_nr() for batched swap_free()
  2024-04-15  7:04     ` Barry Song
@ 2024-04-15  8:06       ` Barry Song
  2024-04-15  8:19       ` Huang, Ying
  1 sibling, 0 replies; 54+ messages in thread
From: Barry Song @ 2024-04-15  8:06 UTC (permalink / raw)
  To: Huang, Ying
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, hanchuanhua, hannes,
	hughd, kasong, ryan.roberts, surenb, v-songbaohua, willy, xiang,
	yosryahmed, yuzhao, ziy, linux-kernel

On Mon, Apr 15, 2024 at 7:04 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Mon, Apr 15, 2024 at 6:19 PM Huang, Ying <ying.huang@intel.com> wrote:
> >
> > Barry Song <21cnbao@gmail.com> writes:
> >
> > > From: Chuanhua Han <hanchuanhua@oppo.com>
> > >
> > > While swapping in a large folio, we need to free swaps related to the whole
> > > folio. To avoid frequently acquiring and releasing swap locks, it is better
> > > to introduce an API for batched free.
> > >
> > > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> > > Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> > > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > > ---
> > >  include/linux/swap.h |  5 +++++
> > >  mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++
> > >  2 files changed, 56 insertions(+)
> > >
> > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > index 11c53692f65f..b7a107e983b8 100644
> > > --- a/include/linux/swap.h
> > > +++ b/include/linux/swap.h
> > > @@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
> > >  extern int swap_duplicate(swp_entry_t);
> > >  extern int swapcache_prepare(swp_entry_t);
> > >  extern void swap_free(swp_entry_t);
> > > +extern void swap_free_nr(swp_entry_t entry, int nr_pages);
> > >  extern void swapcache_free_entries(swp_entry_t *entries, int n);
> > >  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
> > >  int swap_type_of(dev_t device, sector_t offset);
> > > @@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
> > >  {
> > >  }
> > >
> > > +void swap_free_nr(swp_entry_t entry, int nr_pages)
> > > +{
> > > +}
> > > +
> > >  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
> > >  {
> > >  }
> > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > index 28642c188c93..f4c65aeb088d 100644
> > > --- a/mm/swapfile.c
> > > +++ b/mm/swapfile.c
> > > @@ -1356,6 +1356,57 @@ void swap_free(swp_entry_t entry)
> > >               __swap_entry_free(p, entry);
> > >  }
> > >
> > > +/*
> > > + * Free up the maximum number of swap entries at once to limit the
> > > + * maximum kernel stack usage.
> > > + */
> > > +#define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
> > > +
> > > +/*
> > > + * Called after swapping in a large folio,
> >
> > IMHO, it's not good to document the caller in the function definition.
> > Because this will discourage function reusing.
>
> ok. right now there is only one user that is why it is added. but i agree
> we can actually remove this.
>
> >
> > > batched free swap entries
> > > + * for this large folio, entry should be for the first subpage and
> > > + * its offset is aligned with nr_pages
> >
> > Why do we need this?
>
> This is a fundamental requirement for the existing kernel, folio's
> swap offset is naturally aligned from the first moment add_to_swap
> to add swapcache's xa. so this comment is describing the existing
> fact. In the future, if we want to support swap-out folio to discontiguous
> and not-aligned offsets, we can't pass entry as the parameter, we should
> instead pass ptep or another different data struct which can connect
> multiple discontiguous swap offsets.
>
> I feel like we only need "for this large folio, entry should be for
> the first subpage" and drop "and its offset is aligned with nr_pages",
> the latter is not important to this context at all.

upon further consideration, the comment is inaccurate since we do support
nr_pages == 1, and do_swap_page() has indeed been invoked with this value.
Therefore, we should completely remove the comment.

>
> >
> > > + */
> > > +void swap_free_nr(swp_entry_t entry, int nr_pages)
> > > +{
> > > +     int i, j;
> > > +     struct swap_cluster_info *ci;
> > > +     struct swap_info_struct *p;
> > > +     unsigned int type = swp_type(entry);
> > > +     unsigned long offset = swp_offset(entry);
> > > +     int batch_nr, remain_nr;
> > > +     DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };
> > > +
> > > +     /* all swap entries are within a cluster for mTHP */
> > > +     VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
> > > +
> > > +     if (nr_pages == 1) {
> > > +             swap_free(entry);
> > > +             return;
> > > +     }
> >
> > Is it possible to unify swap_free() and swap_free_nr() into one function
> > with acceptable performance?  IIUC, the general rule in mTHP effort is
> > to avoid duplicate functions between mTHP and normal small folio.
> > Right?
>
> I don't see why. but we have lots of places calling swap_free(), we may
> have to change them all to call swap_free_nr(entry, 1); the other possible
> way is making swap_free() a wrapper of swap_free_nr() always using
> 1 as the argument. In both cases, we are changing the semantics of
> swap_free_nr() to partially freeing large folio cases and have to drop
> "entry should be for the first subpage" then.
>
> Right now, the semantics is
> * swap_free_nr() for an entire large folio;
> * swap_free() for one entry of either a large folio or a small folio
>
> >
> > > +
> > > +     remain_nr = nr_pages;
> > > +     p = _swap_info_get(entry);
> > > +     if (p) {
> > > +             for (i = 0; i < nr_pages; i += batch_nr) {
> > > +                     batch_nr = min_t(int, SWAP_BATCH_NR, remain_nr);
> > > +
> > > +                     ci = lock_cluster_or_swap_info(p, offset);
> > > +                     for (j = 0; j < batch_nr; j++) {
> > > +                             if (__swap_entry_free_locked(p, offset + i * SWAP_BATCH_NR + j, 1))
> > > +                                     __bitmap_set(usage, j, 1);
> > > +                     }
> > > +                     unlock_cluster_or_swap_info(p, ci);
> > > +
> > > +                     for_each_clear_bit(j, usage, batch_nr)
> > > +                             free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
> > > +
> > > +                     bitmap_clear(usage, 0, SWAP_BATCH_NR);
> > > +                     remain_nr -= batch_nr;
> > > +             }
> > > +     }
> > > +}
> > > +
> > >  /*
> > >   * Called after dropping swapcache to decrease refcnt to swap entries.
> > >   */
> >
> > put_swap_folio() implements batching in another method.  Do you think
> > that it's good to use the batching method in that function here?  It
> > avoids to use bitmap operations and stack space.
>
> Chuanhua has strictly limited the maximum stack usage to several
> unsigned long, so this should be safe. on the other hand, i believe this
> implementation is more efficient, as  put_swap_folio() might lock/
> unlock much more often whenever __swap_entry_free_locked returns
> 0.
>
> >
> > --
> > Best Regards,
> > Huang, Ying
>
> Thanks
> Barry

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/5] mm: swap: introduce swap_free_nr() for batched swap_free()
  2024-04-15  7:04     ` Barry Song
  2024-04-15  8:06       ` Barry Song
@ 2024-04-15  8:19       ` Huang, Ying
  2024-04-15  8:34         ` Barry Song
  1 sibling, 1 reply; 54+ messages in thread
From: Huang, Ying @ 2024-04-15  8:19 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, hanchuanhua, hannes,
	hughd, kasong, ryan.roberts, surenb, v-songbaohua, willy, xiang,
	yosryahmed, yuzhao, ziy, linux-kernel

Barry Song <21cnbao@gmail.com> writes:

> On Mon, Apr 15, 2024 at 6:19 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Barry Song <21cnbao@gmail.com> writes:
>>
>> > From: Chuanhua Han <hanchuanhua@oppo.com>
>> >
>> > While swapping in a large folio, we need to free swaps related to the whole
>> > folio. To avoid frequently acquiring and releasing swap locks, it is better
>> > to introduce an API for batched free.
>> >
>> > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
>> > Co-developed-by: Barry Song <v-songbaohua@oppo.com>
>> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>> > ---
>> >  include/linux/swap.h |  5 +++++
>> >  mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++
>> >  2 files changed, 56 insertions(+)
>> >
>> > diff --git a/include/linux/swap.h b/include/linux/swap.h
>> > index 11c53692f65f..b7a107e983b8 100644
>> > --- a/include/linux/swap.h
>> > +++ b/include/linux/swap.h
>> > @@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
>> >  extern int swap_duplicate(swp_entry_t);
>> >  extern int swapcache_prepare(swp_entry_t);
>> >  extern void swap_free(swp_entry_t);
>> > +extern void swap_free_nr(swp_entry_t entry, int nr_pages);
>> >  extern void swapcache_free_entries(swp_entry_t *entries, int n);
>> >  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
>> >  int swap_type_of(dev_t device, sector_t offset);
>> > @@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
>> >  {
>> >  }
>> >
>> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
>> > +{
>> > +}
>> > +
>> >  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
>> >  {
>> >  }
>> > diff --git a/mm/swapfile.c b/mm/swapfile.c
>> > index 28642c188c93..f4c65aeb088d 100644
>> > --- a/mm/swapfile.c
>> > +++ b/mm/swapfile.c
>> > @@ -1356,6 +1356,57 @@ void swap_free(swp_entry_t entry)
>> >               __swap_entry_free(p, entry);
>> >  }
>> >
>> > +/*
>> > + * Free up the maximum number of swap entries at once to limit the
>> > + * maximum kernel stack usage.
>> > + */
>> > +#define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
>> > +
>> > +/*
>> > + * Called after swapping in a large folio,
>>
>> IMHO, it's not good to document the caller in the function definition.
>> Because this will discourage function reusing.
>
> ok. right now there is only one user that is why it is added. but i agree
> we can actually remove this.
>
>>
>> > batched free swap entries
>> > + * for this large folio, entry should be for the first subpage and
>> > + * its offset is aligned with nr_pages
>>
>> Why do we need this?
>
> This is a fundamental requirement for the existing kernel, folio's
> swap offset is naturally aligned from the first moment add_to_swap
> to add swapcache's xa. so this comment is describing the existing
> fact. In the future, if we want to support swap-out folio to discontiguous
> and not-aligned offsets, we can't pass entry as the parameter, we should
> instead pass ptep or another different data struct which can connect
> multiple discontiguous swap offsets.
>
> I feel like we only need "for this large folio, entry should be for
> the first subpage" and drop "and its offset is aligned with nr_pages",
> the latter is not important to this context at all.

IIUC, all these are requirements of the only caller now, not the
function itself.  If only part of the all swap entries of a mTHP are
called with swap_free_nr(), can swap_free_nr() still do its work?  If
so, why not make swap_free_nr() as general as possible?

>>
>> > + */
>> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
>> > +{
>> > +     int i, j;
>> > +     struct swap_cluster_info *ci;
>> > +     struct swap_info_struct *p;
>> > +     unsigned int type = swp_type(entry);
>> > +     unsigned long offset = swp_offset(entry);
>> > +     int batch_nr, remain_nr;
>> > +     DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };
>> > +
>> > +     /* all swap entries are within a cluster for mTHP */
>> > +     VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
>> > +
>> > +     if (nr_pages == 1) {
>> > +             swap_free(entry);
>> > +             return;
>> > +     }
>>
>> Is it possible to unify swap_free() and swap_free_nr() into one function
>> with acceptable performance?  IIUC, the general rule in mTHP effort is
>> to avoid duplicate functions between mTHP and normal small folio.
>> Right?
>
> I don't see why.

Because duplicated implementation are hard to maintain in the long term.

> but we have lots of places calling swap_free(), we may
> have to change them all to call swap_free_nr(entry, 1); the other possible
> way is making swap_free() a wrapper of swap_free_nr() always using
> 1 as the argument. In both cases, we are changing the semantics of
> swap_free_nr() to partially freeing large folio cases and have to drop
> "entry should be for the first subpage" then.
>
> Right now, the semantics is
> * swap_free_nr() for an entire large folio;
> * swap_free() for one entry of either a large folio or a small folio

As above, I don't think the these semantics are important for
swap_free_nr() implementation.

>>
>> > +
>> > +     remain_nr = nr_pages;
>> > +     p = _swap_info_get(entry);
>> > +     if (p) {
>> > +             for (i = 0; i < nr_pages; i += batch_nr) {
>> > +                     batch_nr = min_t(int, SWAP_BATCH_NR, remain_nr);
>> > +
>> > +                     ci = lock_cluster_or_swap_info(p, offset);
>> > +                     for (j = 0; j < batch_nr; j++) {
>> > +                             if (__swap_entry_free_locked(p, offset + i * SWAP_BATCH_NR + j, 1))
>> > +                                     __bitmap_set(usage, j, 1);
>> > +                     }
>> > +                     unlock_cluster_or_swap_info(p, ci);
>> > +
>> > +                     for_each_clear_bit(j, usage, batch_nr)
>> > +                             free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
>> > +
>> > +                     bitmap_clear(usage, 0, SWAP_BATCH_NR);
>> > +                     remain_nr -= batch_nr;
>> > +             }
>> > +     }
>> > +}
>> > +
>> >  /*
>> >   * Called after dropping swapcache to decrease refcnt to swap entries.
>> >   */
>>
>> put_swap_folio() implements batching in another method.  Do you think
>> that it's good to use the batching method in that function here?  It
>> avoids to use bitmap operations and stack space.
>
> Chuanhua has strictly limited the maximum stack usage to several
> unsigned long,

512 / 8 = 64 bytes.

So, not trivial.

> so this should be safe. on the other hand, i believe this
> implementation is more efficient, as  put_swap_folio() might lock/
> unlock much more often whenever __swap_entry_free_locked returns
> 0.

There are 2 most common use cases,

- all swap entries have usage count == 0
- all swap entries have usage count != 0

In both cases, we only need to lock/unlock once.  In fact, I didn't
find possible use cases other than above.

And, we should add batching in __swap_entry_free().  That will help
free_swap_and_cache_nr() too.

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/5] mm: swap: introduce swap_free_nr() for batched swap_free()
  2024-04-15  8:19       ` Huang, Ying
@ 2024-04-15  8:34         ` Barry Song
  2024-04-15  8:51           ` Huang, Ying
  0 siblings, 1 reply; 54+ messages in thread
From: Barry Song @ 2024-04-15  8:34 UTC (permalink / raw)
  To: Huang, Ying
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, hanchuanhua, hannes,
	hughd, kasong, ryan.roberts, surenb, v-songbaohua, willy, xiang,
	yosryahmed, yuzhao, ziy, linux-kernel

On Mon, Apr 15, 2024 at 8:21 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Mon, Apr 15, 2024 at 6:19 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Barry Song <21cnbao@gmail.com> writes:
> >>
> >> > From: Chuanhua Han <hanchuanhua@oppo.com>
> >> >
> >> > While swapping in a large folio, we need to free swaps related to the whole
> >> > folio. To avoid frequently acquiring and releasing swap locks, it is better
> >> > to introduce an API for batched free.
> >> >
> >> > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> >> > Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> >> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >> > ---
> >> >  include/linux/swap.h |  5 +++++
> >> >  mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++
> >> >  2 files changed, 56 insertions(+)
> >> >
> >> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> >> > index 11c53692f65f..b7a107e983b8 100644
> >> > --- a/include/linux/swap.h
> >> > +++ b/include/linux/swap.h
> >> > @@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
> >> >  extern int swap_duplicate(swp_entry_t);
> >> >  extern int swapcache_prepare(swp_entry_t);
> >> >  extern void swap_free(swp_entry_t);
> >> > +extern void swap_free_nr(swp_entry_t entry, int nr_pages);
> >> >  extern void swapcache_free_entries(swp_entry_t *entries, int n);
> >> >  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
> >> >  int swap_type_of(dev_t device, sector_t offset);
> >> > @@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
> >> >  {
> >> >  }
> >> >
> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
> >> > +{
> >> > +}
> >> > +
> >> >  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
> >> >  {
> >> >  }
> >> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> >> > index 28642c188c93..f4c65aeb088d 100644
> >> > --- a/mm/swapfile.c
> >> > +++ b/mm/swapfile.c
> >> > @@ -1356,6 +1356,57 @@ void swap_free(swp_entry_t entry)
> >> >               __swap_entry_free(p, entry);
> >> >  }
> >> >
> >> > +/*
> >> > + * Free up the maximum number of swap entries at once to limit the
> >> > + * maximum kernel stack usage.
> >> > + */
> >> > +#define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
> >> > +
> >> > +/*
> >> > + * Called after swapping in a large folio,
> >>
> >> IMHO, it's not good to document the caller in the function definition.
> >> Because this will discourage function reusing.
> >
> > ok. right now there is only one user that is why it is added. but i agree
> > we can actually remove this.
> >
> >>
> >> > batched free swap entries
> >> > + * for this large folio, entry should be for the first subpage and
> >> > + * its offset is aligned with nr_pages
> >>
> >> Why do we need this?
> >
> > This is a fundamental requirement for the existing kernel, folio's
> > swap offset is naturally aligned from the first moment add_to_swap
> > to add swapcache's xa. so this comment is describing the existing
> > fact. In the future, if we want to support swap-out folio to discontiguous
> > and not-aligned offsets, we can't pass entry as the parameter, we should
> > instead pass ptep or another different data struct which can connect
> > multiple discontiguous swap offsets.
> >
> > I feel like we only need "for this large folio, entry should be for
> > the first subpage" and drop "and its offset is aligned with nr_pages",
> > the latter is not important to this context at all.
>
> IIUC, all these are requirements of the only caller now, not the
> function itself.  If only part of the all swap entries of a mTHP are
> called with swap_free_nr(), can swap_free_nr() still do its work?  If
> so, why not make swap_free_nr() as general as possible?

right , i believe we can make swap_free_nr() as general as possible.

>
> >>
> >> > + */
> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
> >> > +{
> >> > +     int i, j;
> >> > +     struct swap_cluster_info *ci;
> >> > +     struct swap_info_struct *p;
> >> > +     unsigned int type = swp_type(entry);
> >> > +     unsigned long offset = swp_offset(entry);
> >> > +     int batch_nr, remain_nr;
> >> > +     DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };
> >> > +
> >> > +     /* all swap entries are within a cluster for mTHP */
> >> > +     VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
> >> > +
> >> > +     if (nr_pages == 1) {
> >> > +             swap_free(entry);
> >> > +             return;
> >> > +     }
> >>
> >> Is it possible to unify swap_free() and swap_free_nr() into one function
> >> with acceptable performance?  IIUC, the general rule in mTHP effort is
> >> to avoid duplicate functions between mTHP and normal small folio.
> >> Right?
> >
> > I don't see why.
>
> Because duplicated implementation are hard to maintain in the long term.

sorry, i actually meant "I don't see why not",  for some reason, the "not"
was missed. Obviously I meant "why not", there was a "but" after it :-)

>
> > but we have lots of places calling swap_free(), we may
> > have to change them all to call swap_free_nr(entry, 1); the other possible
> > way is making swap_free() a wrapper of swap_free_nr() always using
> > 1 as the argument. In both cases, we are changing the semantics of
> > swap_free_nr() to partially freeing large folio cases and have to drop
> > "entry should be for the first subpage" then.
> >
> > Right now, the semantics is
> > * swap_free_nr() for an entire large folio;
> > * swap_free() for one entry of either a large folio or a small folio
>
> As above, I don't think the these semantics are important for
> swap_free_nr() implementation.

right. I agree. If we are ready to change all those callers, nothing
can stop us from removing swap_free().

>
> >>
> >> > +
> >> > +     remain_nr = nr_pages;
> >> > +     p = _swap_info_get(entry);
> >> > +     if (p) {
> >> > +             for (i = 0; i < nr_pages; i += batch_nr) {
> >> > +                     batch_nr = min_t(int, SWAP_BATCH_NR, remain_nr);
> >> > +
> >> > +                     ci = lock_cluster_or_swap_info(p, offset);
> >> > +                     for (j = 0; j < batch_nr; j++) {
> >> > +                             if (__swap_entry_free_locked(p, offset + i * SWAP_BATCH_NR + j, 1))
> >> > +                                     __bitmap_set(usage, j, 1);
> >> > +                     }
> >> > +                     unlock_cluster_or_swap_info(p, ci);
> >> > +
> >> > +                     for_each_clear_bit(j, usage, batch_nr)
> >> > +                             free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
> >> > +
> >> > +                     bitmap_clear(usage, 0, SWAP_BATCH_NR);
> >> > +                     remain_nr -= batch_nr;
> >> > +             }
> >> > +     }
> >> > +}
> >> > +
> >> >  /*
> >> >   * Called after dropping swapcache to decrease refcnt to swap entries.
> >> >   */
> >>
> >> put_swap_folio() implements batching in another method.  Do you think
> >> that it's good to use the batching method in that function here?  It
> >> avoids to use bitmap operations and stack space.
> >
> > Chuanhua has strictly limited the maximum stack usage to several
> > unsigned long,
>
> 512 / 8 = 64 bytes.
>
> So, not trivial.
>
> > so this should be safe. on the other hand, i believe this
> > implementation is more efficient, as  put_swap_folio() might lock/
> > unlock much more often whenever __swap_entry_free_locked returns
> > 0.
>
> There are 2 most common use cases,
>
> - all swap entries have usage count == 0
> - all swap entries have usage count != 0
>
> In both cases, we only need to lock/unlock once.  In fact, I didn't
> find possible use cases other than above.

i guess the point is free_swap_slot() shouldn't be called within
lock_cluster_or_swap_info? so when we are freeing nr_pages slots,
we'll have to unlock and lock nr_pages times?  and this is the most
common scenario.

void put_swap_folio(struct folio *folio, swp_entry_t entry)
{
        ...

        ci = lock_cluster_or_swap_info(si, offset);
        ...
        for (i = 0; i < size; i++, entry.val++) {
                if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE)) {
                        unlock_cluster_or_swap_info(si, ci);
                        free_swap_slot(entry);
                        if (i == size - 1)
                                return;
                        lock_cluster_or_swap_info(si, offset);
                }
        }
        unlock_cluster_or_swap_info(si, ci);
}

>
> And, we should add batching in __swap_entry_free().  That will help
> free_swap_and_cache_nr() too.
>
> --
> Best Regards,
> Huang, Ying

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 4/5] mm: swap: entirely map large folios found in swapcache
  2024-04-09  8:26 ` [PATCH v2 4/5] mm: swap: entirely map large folios found in swapcache Barry Song
  2024-04-11 15:33   ` Ryan Roberts
@ 2024-04-15  8:37   ` Huang, Ying
  2024-04-15  8:53     ` Barry Song
  1 sibling, 1 reply; 54+ messages in thread
From: Huang, Ying @ 2024-04-15  8:37 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, hanchuanhua, hannes,
	hughd, kasong, ryan.roberts, surenb, v-songbaohua, willy, xiang,
	yosryahmed, yuzhao, ziy, linux-kernel

Barry Song <21cnbao@gmail.com> writes:

> From: Chuanhua Han <hanchuanhua@oppo.com>
>
> When a large folio is found in the swapcache, the current implementation
> requires calling do_swap_page() nr_pages times, resulting in nr_pages
> page faults. This patch opts to map the entire large folio at once to
> minimize page faults. Additionally, redundant checks and early exits
> for ARM64 MTE restoring are removed.
>
> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  mm/memory.c | 64 +++++++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 52 insertions(+), 12 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index c4a52e8d740a..9818dc1893c8 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3947,6 +3947,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	pte_t pte;
>  	vm_fault_t ret = 0;
>  	void *shadow = NULL;
> +	int nr_pages = 1;
> +	unsigned long start_address = vmf->address;
> +	pte_t *start_pte = vmf->pte;

IMHO, it's better to rename the above 2 local variables to "address" and
"ptep".  Just my personal opinion.  Feel free to ignore the comments.

> +	bool any_swap_shared = false;
>  
>  	if (!pte_unmap_same(vmf))
>  		goto out;
> @@ -4137,6 +4141,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	 */
>  	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>  			&vmf->ptl);

We should move pte check here.  That is,

  	if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
  		goto out_nomap;

This will simplify the situation for large folio.

> +
> +	/* We hit large folios in swapcache */

The comments seems unnecessary because the code tells that already.

> +	if (start_pte && folio_test_large(folio) && folio_test_swapcache(folio)) {
> +		int nr = folio_nr_pages(folio);
> +		int idx = folio_page_idx(folio, page);
> +		unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
> +		unsigned long folio_end = folio_start + nr * PAGE_SIZE;
> +		pte_t *folio_ptep;
> +		pte_t folio_pte;
> +
> +		if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
> +			goto check_pte;
> +		if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
> +			goto check_pte;
> +
> +		folio_ptep = vmf->pte - idx;
> +		folio_pte = ptep_get(folio_ptep);

It's better to construct pte based on fault PTE via generalizing
pte_next_swp_offset() (may be pte_move_swp_offset()).  Then we can find
inconsistent PTEs quicker.

> +		if (!is_swap_pte(folio_pte) || non_swap_entry(pte_to_swp_entry(folio_pte)) ||
> +		    swap_pte_batch(folio_ptep, nr, folio_pte, &any_swap_shared) != nr)
> +			goto check_pte;
> +
> +		start_address = folio_start;
> +		start_pte = folio_ptep;
> +		nr_pages = nr;
> +		entry = folio->swap;
> +		page = &folio->page;
> +	}
> +
> +check_pte:
>  	if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
>  		goto out_nomap;
>  
> @@ -4190,6 +4223,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  			 */
>  			exclusive = false;
>  		}
> +
> +		/* Reuse the whole large folio iff all entries are exclusive */
> +		if (nr_pages > 1 && any_swap_shared)
> +			exclusive = false;
>  	}
>  
>  	/*
> @@ -4204,12 +4241,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	 * We're already holding a reference on the page but haven't mapped it
>  	 * yet.
>  	 */
> -	swap_free(entry);
> +	swap_free_nr(entry, nr_pages);
>  	if (should_try_to_free_swap(folio, vma, vmf->flags))
>  		folio_free_swap(folio);
>  
> -	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> -	dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> +	folio_ref_add(folio, nr_pages - 1);
> +	add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> +	add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> +
>  	pte = mk_pte(page, vma->vm_page_prot);
>  
>  	/*
> @@ -4219,33 +4258,34 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	 * exclusivity.
>  	 */
>  	if (!folio_test_ksm(folio) &&
> -	    (exclusive || folio_ref_count(folio) == 1)) {
> +	    (exclusive || (folio_ref_count(folio) == nr_pages &&
> +			   folio_nr_pages(folio) == nr_pages))) {
>  		if (vmf->flags & FAULT_FLAG_WRITE) {
>  			pte = maybe_mkwrite(pte_mkdirty(pte), vma);
>  			vmf->flags &= ~FAULT_FLAG_WRITE;
>  		}
>  		rmap_flags |= RMAP_EXCLUSIVE;
>  	}
> -	flush_icache_page(vma, page);
> +	flush_icache_pages(vma, page, nr_pages);
>  	if (pte_swp_soft_dirty(vmf->orig_pte))
>  		pte = pte_mksoft_dirty(pte);
>  	if (pte_swp_uffd_wp(vmf->orig_pte))
>  		pte = pte_mkuffd_wp(pte);
> -	vmf->orig_pte = pte;
>  
>  	/* ksm created a completely new copy */
>  	if (unlikely(folio != swapcache && swapcache)) {
> -		folio_add_new_anon_rmap(folio, vma, vmf->address);
> +		folio_add_new_anon_rmap(folio, vma, start_address);
>  		folio_add_lru_vma(folio, vma);
>  	} else {
> -		folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
> -					rmap_flags);
> +		folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, start_address,
> +					 rmap_flags);
>  	}
>  
>  	VM_BUG_ON(!folio_test_anon(folio) ||
>  			(pte_write(pte) && !PageAnonExclusive(page)));
> -	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> -	arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
> +	set_ptes(vma->vm_mm, start_address, start_pte, pte, nr_pages);
> +	vmf->orig_pte = ptep_get(vmf->pte);
> +	arch_do_swap_page(vma->vm_mm, vma, start_address, pte, pte);

Do we need to call arch_do_swap_page() for each subpage?  IIUC, the
corresponding arch_unmap_one() will be called for each subpage.

>  	folio_unlock(folio);
>  	if (folio != swapcache && swapcache) {
> @@ -4269,7 +4309,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	}
>  
>  	/* No need to invalidate - it was non-present before */
> -	update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
> +	update_mmu_cache_range(vmf, vma, start_address, start_pte, nr_pages);
>  unlock:
>  	if (vmf->pte)
>  		pte_unmap_unlock(vmf->pte, vmf->ptl);

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/5] mm: swap: introduce swap_free_nr() for batched swap_free()
  2024-04-15  8:34         ` Barry Song
@ 2024-04-15  8:51           ` Huang, Ying
  2024-04-15  9:01             ` Barry Song
  0 siblings, 1 reply; 54+ messages in thread
From: Huang, Ying @ 2024-04-15  8:51 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, hanchuanhua, hannes,
	hughd, kasong, ryan.roberts, surenb, v-songbaohua, willy, xiang,
	yosryahmed, yuzhao, ziy, linux-kernel

Barry Song <21cnbao@gmail.com> writes:

> On Mon, Apr 15, 2024 at 8:21 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Barry Song <21cnbao@gmail.com> writes:
>>
>> > On Mon, Apr 15, 2024 at 6:19 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Barry Song <21cnbao@gmail.com> writes:
>> >>
>> >> > From: Chuanhua Han <hanchuanhua@oppo.com>
>> >> >
>> >> > While swapping in a large folio, we need to free swaps related to the whole
>> >> > folio. To avoid frequently acquiring and releasing swap locks, it is better
>> >> > to introduce an API for batched free.
>> >> >
>> >> > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
>> >> > Co-developed-by: Barry Song <v-songbaohua@oppo.com>
>> >> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>> >> > ---
>> >> >  include/linux/swap.h |  5 +++++
>> >> >  mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++
>> >> >  2 files changed, 56 insertions(+)
>> >> >
>> >> > diff --git a/include/linux/swap.h b/include/linux/swap.h
>> >> > index 11c53692f65f..b7a107e983b8 100644
>> >> > --- a/include/linux/swap.h
>> >> > +++ b/include/linux/swap.h
>> >> > @@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
>> >> >  extern int swap_duplicate(swp_entry_t);
>> >> >  extern int swapcache_prepare(swp_entry_t);
>> >> >  extern void swap_free(swp_entry_t);
>> >> > +extern void swap_free_nr(swp_entry_t entry, int nr_pages);
>> >> >  extern void swapcache_free_entries(swp_entry_t *entries, int n);
>> >> >  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
>> >> >  int swap_type_of(dev_t device, sector_t offset);
>> >> > @@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
>> >> >  {
>> >> >  }
>> >> >
>> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
>> >> > +{
>> >> > +}
>> >> > +
>> >> >  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
>> >> >  {
>> >> >  }
>> >> > diff --git a/mm/swapfile.c b/mm/swapfile.c
>> >> > index 28642c188c93..f4c65aeb088d 100644
>> >> > --- a/mm/swapfile.c
>> >> > +++ b/mm/swapfile.c
>> >> > @@ -1356,6 +1356,57 @@ void swap_free(swp_entry_t entry)
>> >> >               __swap_entry_free(p, entry);
>> >> >  }
>> >> >
>> >> > +/*
>> >> > + * Free up the maximum number of swap entries at once to limit the
>> >> > + * maximum kernel stack usage.
>> >> > + */
>> >> > +#define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
>> >> > +
>> >> > +/*
>> >> > + * Called after swapping in a large folio,
>> >>
>> >> IMHO, it's not good to document the caller in the function definition.
>> >> Because this will discourage function reusing.
>> >
>> > ok. right now there is only one user that is why it is added. but i agree
>> > we can actually remove this.
>> >
>> >>
>> >> > batched free swap entries
>> >> > + * for this large folio, entry should be for the first subpage and
>> >> > + * its offset is aligned with nr_pages
>> >>
>> >> Why do we need this?
>> >
>> > This is a fundamental requirement for the existing kernel, folio's
>> > swap offset is naturally aligned from the first moment add_to_swap
>> > to add swapcache's xa. so this comment is describing the existing
>> > fact. In the future, if we want to support swap-out folio to discontiguous
>> > and not-aligned offsets, we can't pass entry as the parameter, we should
>> > instead pass ptep or another different data struct which can connect
>> > multiple discontiguous swap offsets.
>> >
>> > I feel like we only need "for this large folio, entry should be for
>> > the first subpage" and drop "and its offset is aligned with nr_pages",
>> > the latter is not important to this context at all.
>>
>> IIUC, all these are requirements of the only caller now, not the
>> function itself.  If only part of the all swap entries of a mTHP are
>> called with swap_free_nr(), can swap_free_nr() still do its work?  If
>> so, why not make swap_free_nr() as general as possible?
>
> right , i believe we can make swap_free_nr() as general as possible.
>
>>
>> >>
>> >> > + */
>> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
>> >> > +{
>> >> > +     int i, j;
>> >> > +     struct swap_cluster_info *ci;
>> >> > +     struct swap_info_struct *p;
>> >> > +     unsigned int type = swp_type(entry);
>> >> > +     unsigned long offset = swp_offset(entry);
>> >> > +     int batch_nr, remain_nr;
>> >> > +     DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };
>> >> > +
>> >> > +     /* all swap entries are within a cluster for mTHP */
>> >> > +     VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
>> >> > +
>> >> > +     if (nr_pages == 1) {
>> >> > +             swap_free(entry);
>> >> > +             return;
>> >> > +     }
>> >>
>> >> Is it possible to unify swap_free() and swap_free_nr() into one function
>> >> with acceptable performance?  IIUC, the general rule in mTHP effort is
>> >> to avoid duplicate functions between mTHP and normal small folio.
>> >> Right?
>> >
>> > I don't see why.
>>
>> Because duplicated implementation are hard to maintain in the long term.
>
> sorry, i actually meant "I don't see why not",  for some reason, the "not"
> was missed. Obviously I meant "why not", there was a "but" after it :-)
>
>>
>> > but we have lots of places calling swap_free(), we may
>> > have to change them all to call swap_free_nr(entry, 1); the other possible
>> > way is making swap_free() a wrapper of swap_free_nr() always using
>> > 1 as the argument. In both cases, we are changing the semantics of
>> > swap_free_nr() to partially freeing large folio cases and have to drop
>> > "entry should be for the first subpage" then.
>> >
>> > Right now, the semantics is
>> > * swap_free_nr() for an entire large folio;
>> > * swap_free() for one entry of either a large folio or a small folio
>>
>> As above, I don't think the these semantics are important for
>> swap_free_nr() implementation.
>
> right. I agree. If we are ready to change all those callers, nothing
> can stop us from removing swap_free().
>
>>
>> >>
>> >> > +
>> >> > +     remain_nr = nr_pages;
>> >> > +     p = _swap_info_get(entry);
>> >> > +     if (p) {
>> >> > +             for (i = 0; i < nr_pages; i += batch_nr) {
>> >> > +                     batch_nr = min_t(int, SWAP_BATCH_NR, remain_nr);
>> >> > +
>> >> > +                     ci = lock_cluster_or_swap_info(p, offset);
>> >> > +                     for (j = 0; j < batch_nr; j++) {
>> >> > +                             if (__swap_entry_free_locked(p, offset + i * SWAP_BATCH_NR + j, 1))
>> >> > +                                     __bitmap_set(usage, j, 1);
>> >> > +                     }
>> >> > +                     unlock_cluster_or_swap_info(p, ci);
>> >> > +
>> >> > +                     for_each_clear_bit(j, usage, batch_nr)
>> >> > +                             free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
>> >> > +
>> >> > +                     bitmap_clear(usage, 0, SWAP_BATCH_NR);
>> >> > +                     remain_nr -= batch_nr;
>> >> > +             }
>> >> > +     }
>> >> > +}
>> >> > +
>> >> >  /*
>> >> >   * Called after dropping swapcache to decrease refcnt to swap entries.
>> >> >   */
>> >>
>> >> put_swap_folio() implements batching in another method.  Do you think
>> >> that it's good to use the batching method in that function here?  It
>> >> avoids to use bitmap operations and stack space.
>> >
>> > Chuanhua has strictly limited the maximum stack usage to several
>> > unsigned long,
>>
>> 512 / 8 = 64 bytes.
>>
>> So, not trivial.
>>
>> > so this should be safe. on the other hand, i believe this
>> > implementation is more efficient, as  put_swap_folio() might lock/
>> > unlock much more often whenever __swap_entry_free_locked returns
>> > 0.
>>
>> There are 2 most common use cases,
>>
>> - all swap entries have usage count == 0
>> - all swap entries have usage count != 0
>>
>> In both cases, we only need to lock/unlock once.  In fact, I didn't
>> find possible use cases other than above.
>
> i guess the point is free_swap_slot() shouldn't be called within
> lock_cluster_or_swap_info? so when we are freeing nr_pages slots,
> we'll have to unlock and lock nr_pages times?  and this is the most
> common scenario.

No.  In put_swap_folio(), free_entries is either SWAPFILE_CLUSTER (that
is, nr_pages) or 0.  These are the most common cases.

> void put_swap_folio(struct folio *folio, swp_entry_t entry)
> {
>         ...
>
>         ci = lock_cluster_or_swap_info(si, offset);
>         ...
>         for (i = 0; i < size; i++, entry.val++) {
>                 if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE)) {
>                         unlock_cluster_or_swap_info(si, ci);
>                         free_swap_slot(entry);
>                         if (i == size - 1)
>                                 return;
>                         lock_cluster_or_swap_info(si, offset);
>                 }
>         }
>         unlock_cluster_or_swap_info(si, ci);
> }
>
>>
>> And, we should add batching in __swap_entry_free().  That will help
>> free_swap_and_cache_nr() too.

Please consider this too.

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 4/5] mm: swap: entirely map large folios found in swapcache
  2024-04-15  8:37   ` Huang, Ying
@ 2024-04-15  8:53     ` Barry Song
  2024-04-16  2:25       ` Huang, Ying
  0 siblings, 1 reply; 54+ messages in thread
From: Barry Song @ 2024-04-15  8:53 UTC (permalink / raw)
  To: Huang, Ying
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, hanchuanhua, hannes,
	hughd, kasong, ryan.roberts, surenb, v-songbaohua, willy, xiang,
	yosryahmed, yuzhao, ziy, linux-kernel

On Mon, Apr 15, 2024 at 8:39 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > From: Chuanhua Han <hanchuanhua@oppo.com>
> >
> > When a large folio is found in the swapcache, the current implementation
> > requires calling do_swap_page() nr_pages times, resulting in nr_pages
> > page faults. This patch opts to map the entire large folio at once to
> > minimize page faults. Additionally, redundant checks and early exits
> > for ARM64 MTE restoring are removed.
> >
> > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> > Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  mm/memory.c | 64 +++++++++++++++++++++++++++++++++++++++++++----------
> >  1 file changed, 52 insertions(+), 12 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index c4a52e8d740a..9818dc1893c8 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3947,6 +3947,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >       pte_t pte;
> >       vm_fault_t ret = 0;
> >       void *shadow = NULL;
> > +     int nr_pages = 1;
> > +     unsigned long start_address = vmf->address;
> > +     pte_t *start_pte = vmf->pte;
>
> IMHO, it's better to rename the above 2 local variables to "address" and
> "ptep".  Just my personal opinion.  Feel free to ignore the comments.

fine.

>
> > +     bool any_swap_shared = false;
> >
> >       if (!pte_unmap_same(vmf))
> >               goto out;
> > @@ -4137,6 +4141,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >        */
> >       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> >                       &vmf->ptl);
>
> We should move pte check here.  That is,
>
>         if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
>                 goto out_nomap;
>
> This will simplify the situation for large folio.

the plan is moving the whole code block

if (start_pte && folio_test_large(folio) && folio_test_swapcache(folio))

after
        if (unlikely(!folio_test_uptodate(folio))) {
                ret = VM_FAULT_SIGBUS;
                goto out_nomap;
        }

though we couldn't be !folio_test_uptodate(folio)) for hitting
swapcache but it seems
logically better for future use.

>
> > +
> > +     /* We hit large folios in swapcache */
>
> The comments seems unnecessary because the code tells that already.
>
> > +     if (start_pte && folio_test_large(folio) && folio_test_swapcache(folio)) {
> > +             int nr = folio_nr_pages(folio);
> > +             int idx = folio_page_idx(folio, page);
> > +             unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
> > +             unsigned long folio_end = folio_start + nr * PAGE_SIZE;
> > +             pte_t *folio_ptep;
> > +             pte_t folio_pte;
> > +
> > +             if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
> > +                     goto check_pte;
> > +             if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
> > +                     goto check_pte;
> > +
> > +             folio_ptep = vmf->pte - idx;
> > +             folio_pte = ptep_get(folio_ptep);
>
> It's better to construct pte based on fault PTE via generalizing
> pte_next_swp_offset() (may be pte_move_swp_offset()).  Then we can find
> inconsistent PTEs quicker.

it seems your point is getting the pte of page0 by pte_next_swp_offset()
unfortunately pte_next_swp_offset can't go back. on the other hand,
we have to check the real pte value of the 0nd entry right now because
swap_pte_batch() only really reads pte from the 1st entry. it assumes
pte argument is the real value for the 0nd pte entry.

static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte)
{
        pte_t expected_pte = pte_next_swp_offset(pte);
        const pte_t *end_ptep = start_ptep + max_nr;
        pte_t *ptep = start_ptep + 1;

        VM_WARN_ON(max_nr < 1);
        VM_WARN_ON(!is_swap_pte(pte));
        VM_WARN_ON(non_swap_entry(pte_to_swp_entry(pte)));

        while (ptep < end_ptep) {
                pte = ptep_get(ptep);

                if (!pte_same(pte, expected_pte))
                        break;

                expected_pte = pte_next_swp_offset(expected_pte);
                ptep++;
        }

        return ptep - start_ptep;
}

>
> > +             if (!is_swap_pte(folio_pte) || non_swap_entry(pte_to_swp_entry(folio_pte)) ||
> > +                 swap_pte_batch(folio_ptep, nr, folio_pte, &any_swap_shared) != nr)
> > +                     goto check_pte;
> > +
> > +             start_address = folio_start;
> > +             start_pte = folio_ptep;
> > +             nr_pages = nr;
> > +             entry = folio->swap;
> > +             page = &folio->page;
> > +     }
> > +
> > +check_pte:
> >       if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> >               goto out_nomap;
> >
> > @@ -4190,6 +4223,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >                        */
> >                       exclusive = false;
> >               }
> > +
> > +             /* Reuse the whole large folio iff all entries are exclusive */
> > +             if (nr_pages > 1 && any_swap_shared)
> > +                     exclusive = false;
> >       }
> >
> >       /*
> > @@ -4204,12 +4241,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >        * We're already holding a reference on the page but haven't mapped it
> >        * yet.
> >        */
> > -     swap_free(entry);
> > +     swap_free_nr(entry, nr_pages);
> >       if (should_try_to_free_swap(folio, vma, vmf->flags))
> >               folio_free_swap(folio);
> >
> > -     inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> > -     dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> > +     folio_ref_add(folio, nr_pages - 1);
> > +     add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> > +     add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> > +
> >       pte = mk_pte(page, vma->vm_page_prot);
> >
> >       /*
> > @@ -4219,33 +4258,34 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >        * exclusivity.
> >        */
> >       if (!folio_test_ksm(folio) &&
> > -         (exclusive || folio_ref_count(folio) == 1)) {
> > +         (exclusive || (folio_ref_count(folio) == nr_pages &&
> > +                        folio_nr_pages(folio) == nr_pages))) {
> >               if (vmf->flags & FAULT_FLAG_WRITE) {
> >                       pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> >                       vmf->flags &= ~FAULT_FLAG_WRITE;
> >               }
> >               rmap_flags |= RMAP_EXCLUSIVE;
> >       }
> > -     flush_icache_page(vma, page);
> > +     flush_icache_pages(vma, page, nr_pages);
> >       if (pte_swp_soft_dirty(vmf->orig_pte))
> >               pte = pte_mksoft_dirty(pte);
> >       if (pte_swp_uffd_wp(vmf->orig_pte))
> >               pte = pte_mkuffd_wp(pte);
> > -     vmf->orig_pte = pte;
> >
> >       /* ksm created a completely new copy */
> >       if (unlikely(folio != swapcache && swapcache)) {
> > -             folio_add_new_anon_rmap(folio, vma, vmf->address);
> > +             folio_add_new_anon_rmap(folio, vma, start_address);
> >               folio_add_lru_vma(folio, vma);
> >       } else {
> > -             folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
> > -                                     rmap_flags);
> > +             folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, start_address,
> > +                                      rmap_flags);
> >       }
> >
> >       VM_BUG_ON(!folio_test_anon(folio) ||
> >                       (pte_write(pte) && !PageAnonExclusive(page)));
> > -     set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> > -     arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
> > +     set_ptes(vma->vm_mm, start_address, start_pte, pte, nr_pages);
> > +     vmf->orig_pte = ptep_get(vmf->pte);
> > +     arch_do_swap_page(vma->vm_mm, vma, start_address, pte, pte);
>
> Do we need to call arch_do_swap_page() for each subpage?  IIUC, the
> corresponding arch_unmap_one() will be called for each subpage.

i actually thought about this very carefully, right now, the only one who
needs this is sparc and it doesn't support THP_SWAPOUT at all. and
there is no proof doing restoration one by one won't really break sparc.
so i'd like to defer this to when sparc really needs THP_SWAPOUT.
on the other hand, it seems really bad we have both

arch_swap_restore  - for this, arm64 has moved to using folio
and
arch_do_swap_page

we should somehow unify them later if sparc wants THP_SWPOUT.

>
> >       folio_unlock(folio);
> >       if (folio != swapcache && swapcache) {
> > @@ -4269,7 +4309,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >       }
> >
> >       /* No need to invalidate - it was non-present before */
> > -     update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
> > +     update_mmu_cache_range(vmf, vma, start_address, start_pte, nr_pages);
> >  unlock:
> >       if (vmf->pte)
> >               pte_unmap_unlock(vmf->pte, vmf->ptl);
>
> --
> Best Regards,
> Huang, Ying

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/5] mm: swap: introduce swap_free_nr() for batched swap_free()
  2024-04-15  8:51           ` Huang, Ying
@ 2024-04-15  9:01             ` Barry Song
  2024-04-16  1:40               ` Huang, Ying
  0 siblings, 1 reply; 54+ messages in thread
From: Barry Song @ 2024-04-15  9:01 UTC (permalink / raw)
  To: Huang, Ying
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, hanchuanhua, hannes,
	hughd, kasong, ryan.roberts, surenb, v-songbaohua, willy, xiang,
	yosryahmed, yuzhao, ziy, linux-kernel

On Mon, Apr 15, 2024 at 8:53 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Mon, Apr 15, 2024 at 8:21 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Barry Song <21cnbao@gmail.com> writes:
> >>
> >> > On Mon, Apr 15, 2024 at 6:19 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Barry Song <21cnbao@gmail.com> writes:
> >> >>
> >> >> > From: Chuanhua Han <hanchuanhua@oppo.com>
> >> >> >
> >> >> > While swapping in a large folio, we need to free swaps related to the whole
> >> >> > folio. To avoid frequently acquiring and releasing swap locks, it is better
> >> >> > to introduce an API for batched free.
> >> >> >
> >> >> > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> >> >> > Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> >> >> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >> >> > ---
> >> >> >  include/linux/swap.h |  5 +++++
> >> >> >  mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++
> >> >> >  2 files changed, 56 insertions(+)
> >> >> >
> >> >> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> >> >> > index 11c53692f65f..b7a107e983b8 100644
> >> >> > --- a/include/linux/swap.h
> >> >> > +++ b/include/linux/swap.h
> >> >> > @@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
> >> >> >  extern int swap_duplicate(swp_entry_t);
> >> >> >  extern int swapcache_prepare(swp_entry_t);
> >> >> >  extern void swap_free(swp_entry_t);
> >> >> > +extern void swap_free_nr(swp_entry_t entry, int nr_pages);
> >> >> >  extern void swapcache_free_entries(swp_entry_t *entries, int n);
> >> >> >  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
> >> >> >  int swap_type_of(dev_t device, sector_t offset);
> >> >> > @@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
> >> >> >  {
> >> >> >  }
> >> >> >
> >> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
> >> >> > +{
> >> >> > +}
> >> >> > +
> >> >> >  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
> >> >> >  {
> >> >> >  }
> >> >> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> >> >> > index 28642c188c93..f4c65aeb088d 100644
> >> >> > --- a/mm/swapfile.c
> >> >> > +++ b/mm/swapfile.c
> >> >> > @@ -1356,6 +1356,57 @@ void swap_free(swp_entry_t entry)
> >> >> >               __swap_entry_free(p, entry);
> >> >> >  }
> >> >> >
> >> >> > +/*
> >> >> > + * Free up the maximum number of swap entries at once to limit the
> >> >> > + * maximum kernel stack usage.
> >> >> > + */
> >> >> > +#define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
> >> >> > +
> >> >> > +/*
> >> >> > + * Called after swapping in a large folio,
> >> >>
> >> >> IMHO, it's not good to document the caller in the function definition.
> >> >> Because this will discourage function reusing.
> >> >
> >> > ok. right now there is only one user that is why it is added. but i agree
> >> > we can actually remove this.
> >> >
> >> >>
> >> >> > batched free swap entries
> >> >> > + * for this large folio, entry should be for the first subpage and
> >> >> > + * its offset is aligned with nr_pages
> >> >>
> >> >> Why do we need this?
> >> >
> >> > This is a fundamental requirement for the existing kernel, folio's
> >> > swap offset is naturally aligned from the first moment add_to_swap
> >> > to add swapcache's xa. so this comment is describing the existing
> >> > fact. In the future, if we want to support swap-out folio to discontiguous
> >> > and not-aligned offsets, we can't pass entry as the parameter, we should
> >> > instead pass ptep or another different data struct which can connect
> >> > multiple discontiguous swap offsets.
> >> >
> >> > I feel like we only need "for this large folio, entry should be for
> >> > the first subpage" and drop "and its offset is aligned with nr_pages",
> >> > the latter is not important to this context at all.
> >>
> >> IIUC, all these are requirements of the only caller now, not the
> >> function itself.  If only part of the all swap entries of a mTHP are
> >> called with swap_free_nr(), can swap_free_nr() still do its work?  If
> >> so, why not make swap_free_nr() as general as possible?
> >
> > right , i believe we can make swap_free_nr() as general as possible.
> >
> >>
> >> >>
> >> >> > + */
> >> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
> >> >> > +{
> >> >> > +     int i, j;
> >> >> > +     struct swap_cluster_info *ci;
> >> >> > +     struct swap_info_struct *p;
> >> >> > +     unsigned int type = swp_type(entry);
> >> >> > +     unsigned long offset = swp_offset(entry);
> >> >> > +     int batch_nr, remain_nr;
> >> >> > +     DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };
> >> >> > +
> >> >> > +     /* all swap entries are within a cluster for mTHP */
> >> >> > +     VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
> >> >> > +
> >> >> > +     if (nr_pages == 1) {
> >> >> > +             swap_free(entry);
> >> >> > +             return;
> >> >> > +     }
> >> >>
> >> >> Is it possible to unify swap_free() and swap_free_nr() into one function
> >> >> with acceptable performance?  IIUC, the general rule in mTHP effort is
> >> >> to avoid duplicate functions between mTHP and normal small folio.
> >> >> Right?
> >> >
> >> > I don't see why.
> >>
> >> Because duplicated implementation are hard to maintain in the long term.
> >
> > sorry, i actually meant "I don't see why not",  for some reason, the "not"
> > was missed. Obviously I meant "why not", there was a "but" after it :-)
> >
> >>
> >> > but we have lots of places calling swap_free(), we may
> >> > have to change them all to call swap_free_nr(entry, 1); the other possible
> >> > way is making swap_free() a wrapper of swap_free_nr() always using
> >> > 1 as the argument. In both cases, we are changing the semantics of
> >> > swap_free_nr() to partially freeing large folio cases and have to drop
> >> > "entry should be for the first subpage" then.
> >> >
> >> > Right now, the semantics is
> >> > * swap_free_nr() for an entire large folio;
> >> > * swap_free() for one entry of either a large folio or a small folio
> >>
> >> As above, I don't think the these semantics are important for
> >> swap_free_nr() implementation.
> >
> > right. I agree. If we are ready to change all those callers, nothing
> > can stop us from removing swap_free().
> >
> >>
> >> >>
> >> >> > +
> >> >> > +     remain_nr = nr_pages;
> >> >> > +     p = _swap_info_get(entry);
> >> >> > +     if (p) {
> >> >> > +             for (i = 0; i < nr_pages; i += batch_nr) {
> >> >> > +                     batch_nr = min_t(int, SWAP_BATCH_NR, remain_nr);
> >> >> > +
> >> >> > +                     ci = lock_cluster_or_swap_info(p, offset);
> >> >> > +                     for (j = 0; j < batch_nr; j++) {
> >> >> > +                             if (__swap_entry_free_locked(p, offset + i * SWAP_BATCH_NR + j, 1))
> >> >> > +                                     __bitmap_set(usage, j, 1);
> >> >> > +                     }
> >> >> > +                     unlock_cluster_or_swap_info(p, ci);
> >> >> > +
> >> >> > +                     for_each_clear_bit(j, usage, batch_nr)
> >> >> > +                             free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
> >> >> > +
> >> >> > +                     bitmap_clear(usage, 0, SWAP_BATCH_NR);
> >> >> > +                     remain_nr -= batch_nr;
> >> >> > +             }
> >> >> > +     }
> >> >> > +}
> >> >> > +
> >> >> >  /*
> >> >> >   * Called after dropping swapcache to decrease refcnt to swap entries.
> >> >> >   */
> >> >>
> >> >> put_swap_folio() implements batching in another method.  Do you think
> >> >> that it's good to use the batching method in that function here?  It
> >> >> avoids to use bitmap operations and stack space.
> >> >
> >> > Chuanhua has strictly limited the maximum stack usage to several
> >> > unsigned long,
> >>
> >> 512 / 8 = 64 bytes.
> >>
> >> So, not trivial.
> >>
> >> > so this should be safe. on the other hand, i believe this
> >> > implementation is more efficient, as  put_swap_folio() might lock/
> >> > unlock much more often whenever __swap_entry_free_locked returns
> >> > 0.
> >>
> >> There are 2 most common use cases,
> >>
> >> - all swap entries have usage count == 0
> >> - all swap entries have usage count != 0
> >>
> >> In both cases, we only need to lock/unlock once.  In fact, I didn't
> >> find possible use cases other than above.
> >
> > i guess the point is free_swap_slot() shouldn't be called within
> > lock_cluster_or_swap_info? so when we are freeing nr_pages slots,
> > we'll have to unlock and lock nr_pages times?  and this is the most
> > common scenario.
>
> No.  In put_swap_folio(), free_entries is either SWAPFILE_CLUSTER (that
> is, nr_pages) or 0.  These are the most common cases.
>

i am actually talking about the below code path,

void put_swap_folio(struct folio *folio, swp_entry_t entry)
{
        ...

        ci = lock_cluster_or_swap_info(si, offset);
        ...
        for (i = 0; i < size; i++, entry.val++) {
                if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE)) {
                        unlock_cluster_or_swap_info(si, ci);
                        free_swap_slot(entry);
                        if (i == size - 1)
                                return;
                        lock_cluster_or_swap_info(si, offset);
                }
        }
        unlock_cluster_or_swap_info(si, ci);
}

but i guess you are talking about the below code path:

void put_swap_folio(struct folio *folio, swp_entry_t entry)
{
        ...

        ci = lock_cluster_or_swap_info(si, offset);
        if (size == SWAPFILE_CLUSTER) {
                map = si->swap_map + offset;
                for (i = 0; i < SWAPFILE_CLUSTER; i++) {
                        val = map[i];
                        VM_BUG_ON(!(val & SWAP_HAS_CACHE));
                        if (val == SWAP_HAS_CACHE)
                                free_entries++;
                }
                if (free_entries == SWAPFILE_CLUSTER) {
                        unlock_cluster_or_swap_info(si, ci);
                        spin_lock(&si->lock);
                        mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER);
                        swap_free_cluster(si, idx);
                        spin_unlock(&si->lock);
                        return;
                }
        }
}

we are mTHP, so we can't assume our size is SWAPFILE_CLUSTER?
or you want to check free_entries == "1 << swap_entry_order(folio_order(folio))"
instead of SWAPFILE_CLUSTER for the "for (i = 0; i < size; i++, entry.val++)"
path?


> >>
> >> And, we should add batching in __swap_entry_free().  That will help
> >> free_swap_and_cache_nr() too.

Chris Li and I actually discussed it before, while I completely
agree this can be batched. but i'd like to defer this as an incremental
patchset later to keep this swapcache-refault small.

>
> Please consider this too.
>
> --
> Best Regards,
> Huang, Ying

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/5] mm: swap: introduce swap_free_nr() for batched swap_free()
  2024-04-15  9:01             ` Barry Song
@ 2024-04-16  1:40               ` Huang, Ying
  2024-04-16  2:08                 ` Barry Song
  0 siblings, 1 reply; 54+ messages in thread
From: Huang, Ying @ 2024-04-16  1:40 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, hanchuanhua, hannes,
	hughd, kasong, ryan.roberts, surenb, v-songbaohua, willy, xiang,
	yosryahmed, yuzhao, ziy, linux-kernel

Barry Song <21cnbao@gmail.com> writes:

> On Mon, Apr 15, 2024 at 8:53 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Barry Song <21cnbao@gmail.com> writes:
>>
>> > On Mon, Apr 15, 2024 at 8:21 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Barry Song <21cnbao@gmail.com> writes:
>> >>
>> >> > On Mon, Apr 15, 2024 at 6:19 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >>
>> >> >> Barry Song <21cnbao@gmail.com> writes:
>> >> >>
>> >> >> > From: Chuanhua Han <hanchuanhua@oppo.com>
>> >> >> >
>> >> >> > While swapping in a large folio, we need to free swaps related to the whole
>> >> >> > folio. To avoid frequently acquiring and releasing swap locks, it is better
>> >> >> > to introduce an API for batched free.
>> >> >> >
>> >> >> > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
>> >> >> > Co-developed-by: Barry Song <v-songbaohua@oppo.com>
>> >> >> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>> >> >> > ---
>> >> >> >  include/linux/swap.h |  5 +++++
>> >> >> >  mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++
>> >> >> >  2 files changed, 56 insertions(+)
>> >> >> >
>> >> >> > diff --git a/include/linux/swap.h b/include/linux/swap.h
>> >> >> > index 11c53692f65f..b7a107e983b8 100644
>> >> >> > --- a/include/linux/swap.h
>> >> >> > +++ b/include/linux/swap.h
>> >> >> > @@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
>> >> >> >  extern int swap_duplicate(swp_entry_t);
>> >> >> >  extern int swapcache_prepare(swp_entry_t);
>> >> >> >  extern void swap_free(swp_entry_t);
>> >> >> > +extern void swap_free_nr(swp_entry_t entry, int nr_pages);
>> >> >> >  extern void swapcache_free_entries(swp_entry_t *entries, int n);
>> >> >> >  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
>> >> >> >  int swap_type_of(dev_t device, sector_t offset);
>> >> >> > @@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
>> >> >> >  {
>> >> >> >  }
>> >> >> >
>> >> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
>> >> >> > +{
>> >> >> > +}
>> >> >> > +
>> >> >> >  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
>> >> >> >  {
>> >> >> >  }
>> >> >> > diff --git a/mm/swapfile.c b/mm/swapfile.c
>> >> >> > index 28642c188c93..f4c65aeb088d 100644
>> >> >> > --- a/mm/swapfile.c
>> >> >> > +++ b/mm/swapfile.c
>> >> >> > @@ -1356,6 +1356,57 @@ void swap_free(swp_entry_t entry)
>> >> >> >               __swap_entry_free(p, entry);
>> >> >> >  }
>> >> >> >
>> >> >> > +/*
>> >> >> > + * Free up the maximum number of swap entries at once to limit the
>> >> >> > + * maximum kernel stack usage.
>> >> >> > + */
>> >> >> > +#define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
>> >> >> > +
>> >> >> > +/*
>> >> >> > + * Called after swapping in a large folio,
>> >> >>
>> >> >> IMHO, it's not good to document the caller in the function definition.
>> >> >> Because this will discourage function reusing.
>> >> >
>> >> > ok. right now there is only one user that is why it is added. but i agree
>> >> > we can actually remove this.
>> >> >
>> >> >>
>> >> >> > batched free swap entries
>> >> >> > + * for this large folio, entry should be for the first subpage and
>> >> >> > + * its offset is aligned with nr_pages
>> >> >>
>> >> >> Why do we need this?
>> >> >
>> >> > This is a fundamental requirement for the existing kernel, folio's
>> >> > swap offset is naturally aligned from the first moment add_to_swap
>> >> > to add swapcache's xa. so this comment is describing the existing
>> >> > fact. In the future, if we want to support swap-out folio to discontiguous
>> >> > and not-aligned offsets, we can't pass entry as the parameter, we should
>> >> > instead pass ptep or another different data struct which can connect
>> >> > multiple discontiguous swap offsets.
>> >> >
>> >> > I feel like we only need "for this large folio, entry should be for
>> >> > the first subpage" and drop "and its offset is aligned with nr_pages",
>> >> > the latter is not important to this context at all.
>> >>
>> >> IIUC, all these are requirements of the only caller now, not the
>> >> function itself.  If only part of the all swap entries of a mTHP are
>> >> called with swap_free_nr(), can swap_free_nr() still do its work?  If
>> >> so, why not make swap_free_nr() as general as possible?
>> >
>> > right , i believe we can make swap_free_nr() as general as possible.
>> >
>> >>
>> >> >>
>> >> >> > + */
>> >> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
>> >> >> > +{
>> >> >> > +     int i, j;
>> >> >> > +     struct swap_cluster_info *ci;
>> >> >> > +     struct swap_info_struct *p;
>> >> >> > +     unsigned int type = swp_type(entry);
>> >> >> > +     unsigned long offset = swp_offset(entry);
>> >> >> > +     int batch_nr, remain_nr;
>> >> >> > +     DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };
>> >> >> > +
>> >> >> > +     /* all swap entries are within a cluster for mTHP */
>> >> >> > +     VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
>> >> >> > +
>> >> >> > +     if (nr_pages == 1) {
>> >> >> > +             swap_free(entry);
>> >> >> > +             return;
>> >> >> > +     }
>> >> >>
>> >> >> Is it possible to unify swap_free() and swap_free_nr() into one function
>> >> >> with acceptable performance?  IIUC, the general rule in mTHP effort is
>> >> >> to avoid duplicate functions between mTHP and normal small folio.
>> >> >> Right?
>> >> >
>> >> > I don't see why.
>> >>
>> >> Because duplicated implementation are hard to maintain in the long term.
>> >
>> > sorry, i actually meant "I don't see why not",  for some reason, the "not"
>> > was missed. Obviously I meant "why not", there was a "but" after it :-)
>> >
>> >>
>> >> > but we have lots of places calling swap_free(), we may
>> >> > have to change them all to call swap_free_nr(entry, 1); the other possible
>> >> > way is making swap_free() a wrapper of swap_free_nr() always using
>> >> > 1 as the argument. In both cases, we are changing the semantics of
>> >> > swap_free_nr() to partially freeing large folio cases and have to drop
>> >> > "entry should be for the first subpage" then.
>> >> >
>> >> > Right now, the semantics is
>> >> > * swap_free_nr() for an entire large folio;
>> >> > * swap_free() for one entry of either a large folio or a small folio
>> >>
>> >> As above, I don't think the these semantics are important for
>> >> swap_free_nr() implementation.
>> >
>> > right. I agree. If we are ready to change all those callers, nothing
>> > can stop us from removing swap_free().
>> >
>> >>
>> >> >>
>> >> >> > +
>> >> >> > +     remain_nr = nr_pages;
>> >> >> > +     p = _swap_info_get(entry);
>> >> >> > +     if (p) {
>> >> >> > +             for (i = 0; i < nr_pages; i += batch_nr) {
>> >> >> > +                     batch_nr = min_t(int, SWAP_BATCH_NR, remain_nr);
>> >> >> > +
>> >> >> > +                     ci = lock_cluster_or_swap_info(p, offset);
>> >> >> > +                     for (j = 0; j < batch_nr; j++) {
>> >> >> > +                             if (__swap_entry_free_locked(p, offset + i * SWAP_BATCH_NR + j, 1))
>> >> >> > +                                     __bitmap_set(usage, j, 1);
>> >> >> > +                     }
>> >> >> > +                     unlock_cluster_or_swap_info(p, ci);
>> >> >> > +
>> >> >> > +                     for_each_clear_bit(j, usage, batch_nr)
>> >> >> > +                             free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
>> >> >> > +
>> >> >> > +                     bitmap_clear(usage, 0, SWAP_BATCH_NR);
>> >> >> > +                     remain_nr -= batch_nr;
>> >> >> > +             }
>> >> >> > +     }
>> >> >> > +}
>> >> >> > +
>> >> >> >  /*
>> >> >> >   * Called after dropping swapcache to decrease refcnt to swap entries.
>> >> >> >   */
>> >> >>
>> >> >> put_swap_folio() implements batching in another method.  Do you think
>> >> >> that it's good to use the batching method in that function here?  It
>> >> >> avoids to use bitmap operations and stack space.
>> >> >
>> >> > Chuanhua has strictly limited the maximum stack usage to several
>> >> > unsigned long,
>> >>
>> >> 512 / 8 = 64 bytes.
>> >>
>> >> So, not trivial.
>> >>
>> >> > so this should be safe. on the other hand, i believe this
>> >> > implementation is more efficient, as  put_swap_folio() might lock/
>> >> > unlock much more often whenever __swap_entry_free_locked returns
>> >> > 0.
>> >>
>> >> There are 2 most common use cases,
>> >>
>> >> - all swap entries have usage count == 0
>> >> - all swap entries have usage count != 0
>> >>
>> >> In both cases, we only need to lock/unlock once.  In fact, I didn't
>> >> find possible use cases other than above.
>> >
>> > i guess the point is free_swap_slot() shouldn't be called within
>> > lock_cluster_or_swap_info? so when we are freeing nr_pages slots,
>> > we'll have to unlock and lock nr_pages times?  and this is the most
>> > common scenario.
>>
>> No.  In put_swap_folio(), free_entries is either SWAPFILE_CLUSTER (that
>> is, nr_pages) or 0.  These are the most common cases.
>>
>
> i am actually talking about the below code path,
>
> void put_swap_folio(struct folio *folio, swp_entry_t entry)
> {
>         ...
>
>         ci = lock_cluster_or_swap_info(si, offset);
>         ...
>         for (i = 0; i < size; i++, entry.val++) {
>                 if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE)) {
>                         unlock_cluster_or_swap_info(si, ci);
>                         free_swap_slot(entry);
>                         if (i == size - 1)
>                                 return;
>                         lock_cluster_or_swap_info(si, offset);
>                 }
>         }
>         unlock_cluster_or_swap_info(si, ci);
> }
>
> but i guess you are talking about the below code path:
>
> void put_swap_folio(struct folio *folio, swp_entry_t entry)
> {
>         ...
>
>         ci = lock_cluster_or_swap_info(si, offset);
>         if (size == SWAPFILE_CLUSTER) {
>                 map = si->swap_map + offset;
>                 for (i = 0; i < SWAPFILE_CLUSTER; i++) {
>                         val = map[i];
>                         VM_BUG_ON(!(val & SWAP_HAS_CACHE));
>                         if (val == SWAP_HAS_CACHE)
>                                 free_entries++;
>                 }
>                 if (free_entries == SWAPFILE_CLUSTER) {
>                         unlock_cluster_or_swap_info(si, ci);
>                         spin_lock(&si->lock);
>                         mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER);
>                         swap_free_cluster(si, idx);
>                         spin_unlock(&si->lock);
>                         return;
>                 }
>         }
> }

I am talking about both code paths.  In 2 most common cases,
__swap_entry_free_locked() will return 0 or !0 for all entries in range.

> we are mTHP, so we can't assume our size is SWAPFILE_CLUSTER?
> or you want to check free_entries == "1 << swap_entry_order(folio_order(folio))"
> instead of SWAPFILE_CLUSTER for the "for (i = 0; i < size; i++, entry.val++)"
> path?

Just replace SWAPFILE_CLUSTER with "nr_pages" in your code.

>
>> >>
>> >> And, we should add batching in __swap_entry_free().  That will help
>> >> free_swap_and_cache_nr() too.
>
> Chris Li and I actually discussed it before, while I completely
> agree this can be batched. but i'd like to defer this as an incremental
> patchset later to keep this swapcache-refault small.

OK.

>>
>> Please consider this too.

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/5] mm: swap: introduce swap_free_nr() for batched swap_free()
  2024-04-16  1:40               ` Huang, Ying
@ 2024-04-16  2:08                 ` Barry Song
  2024-04-16  3:11                   ` Huang, Ying
  0 siblings, 1 reply; 54+ messages in thread
From: Barry Song @ 2024-04-16  2:08 UTC (permalink / raw)
  To: Huang, Ying
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, hanchuanhua, hannes,
	hughd, kasong, ryan.roberts, surenb, v-songbaohua, willy, xiang,
	yosryahmed, yuzhao, ziy, linux-kernel

On Tue, Apr 16, 2024 at 1:42 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Mon, Apr 15, 2024 at 8:53 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Barry Song <21cnbao@gmail.com> writes:
> >>
> >> > On Mon, Apr 15, 2024 at 8:21 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Barry Song <21cnbao@gmail.com> writes:
> >> >>
> >> >> > On Mon, Apr 15, 2024 at 6:19 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >>
> >> >> >> Barry Song <21cnbao@gmail.com> writes:
> >> >> >>
> >> >> >> > From: Chuanhua Han <hanchuanhua@oppo.com>
> >> >> >> >
> >> >> >> > While swapping in a large folio, we need to free swaps related to the whole
> >> >> >> > folio. To avoid frequently acquiring and releasing swap locks, it is better
> >> >> >> > to introduce an API for batched free.
> >> >> >> >
> >> >> >> > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> >> >> >> > Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> >> >> >> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >> >> >> > ---
> >> >> >> >  include/linux/swap.h |  5 +++++
> >> >> >> >  mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++
> >> >> >> >  2 files changed, 56 insertions(+)
> >> >> >> >
> >> >> >> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> >> >> >> > index 11c53692f65f..b7a107e983b8 100644
> >> >> >> > --- a/include/linux/swap.h
> >> >> >> > +++ b/include/linux/swap.h
> >> >> >> > @@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
> >> >> >> >  extern int swap_duplicate(swp_entry_t);
> >> >> >> >  extern int swapcache_prepare(swp_entry_t);
> >> >> >> >  extern void swap_free(swp_entry_t);
> >> >> >> > +extern void swap_free_nr(swp_entry_t entry, int nr_pages);
> >> >> >> >  extern void swapcache_free_entries(swp_entry_t *entries, int n);
> >> >> >> >  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
> >> >> >> >  int swap_type_of(dev_t device, sector_t offset);
> >> >> >> > @@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
> >> >> >> >  {
> >> >> >> >  }
> >> >> >> >
> >> >> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
> >> >> >> > +{
> >> >> >> > +}
> >> >> >> > +
> >> >> >> >  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
> >> >> >> >  {
> >> >> >> >  }
> >> >> >> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> >> >> >> > index 28642c188c93..f4c65aeb088d 100644
> >> >> >> > --- a/mm/swapfile.c
> >> >> >> > +++ b/mm/swapfile.c
> >> >> >> > @@ -1356,6 +1356,57 @@ void swap_free(swp_entry_t entry)
> >> >> >> >               __swap_entry_free(p, entry);
> >> >> >> >  }
> >> >> >> >
> >> >> >> > +/*
> >> >> >> > + * Free up the maximum number of swap entries at once to limit the
> >> >> >> > + * maximum kernel stack usage.
> >> >> >> > + */
> >> >> >> > +#define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
> >> >> >> > +
> >> >> >> > +/*
> >> >> >> > + * Called after swapping in a large folio,
> >> >> >>
> >> >> >> IMHO, it's not good to document the caller in the function definition.
> >> >> >> Because this will discourage function reusing.
> >> >> >
> >> >> > ok. right now there is only one user that is why it is added. but i agree
> >> >> > we can actually remove this.
> >> >> >
> >> >> >>
> >> >> >> > batched free swap entries
> >> >> >> > + * for this large folio, entry should be for the first subpage and
> >> >> >> > + * its offset is aligned with nr_pages
> >> >> >>
> >> >> >> Why do we need this?
> >> >> >
> >> >> > This is a fundamental requirement for the existing kernel, folio's
> >> >> > swap offset is naturally aligned from the first moment add_to_swap
> >> >> > to add swapcache's xa. so this comment is describing the existing
> >> >> > fact. In the future, if we want to support swap-out folio to discontiguous
> >> >> > and not-aligned offsets, we can't pass entry as the parameter, we should
> >> >> > instead pass ptep or another different data struct which can connect
> >> >> > multiple discontiguous swap offsets.
> >> >> >
> >> >> > I feel like we only need "for this large folio, entry should be for
> >> >> > the first subpage" and drop "and its offset is aligned with nr_pages",
> >> >> > the latter is not important to this context at all.
> >> >>
> >> >> IIUC, all these are requirements of the only caller now, not the
> >> >> function itself.  If only part of the all swap entries of a mTHP are
> >> >> called with swap_free_nr(), can swap_free_nr() still do its work?  If
> >> >> so, why not make swap_free_nr() as general as possible?
> >> >
> >> > right , i believe we can make swap_free_nr() as general as possible.
> >> >
> >> >>
> >> >> >>
> >> >> >> > + */
> >> >> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
> >> >> >> > +{
> >> >> >> > +     int i, j;
> >> >> >> > +     struct swap_cluster_info *ci;
> >> >> >> > +     struct swap_info_struct *p;
> >> >> >> > +     unsigned int type = swp_type(entry);
> >> >> >> > +     unsigned long offset = swp_offset(entry);
> >> >> >> > +     int batch_nr, remain_nr;
> >> >> >> > +     DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };
> >> >> >> > +
> >> >> >> > +     /* all swap entries are within a cluster for mTHP */
> >> >> >> > +     VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
> >> >> >> > +
> >> >> >> > +     if (nr_pages == 1) {
> >> >> >> > +             swap_free(entry);
> >> >> >> > +             return;
> >> >> >> > +     }
> >> >> >>
> >> >> >> Is it possible to unify swap_free() and swap_free_nr() into one function
> >> >> >> with acceptable performance?  IIUC, the general rule in mTHP effort is
> >> >> >> to avoid duplicate functions between mTHP and normal small folio.
> >> >> >> Right?
> >> >> >
> >> >> > I don't see why.
> >> >>
> >> >> Because duplicated implementation are hard to maintain in the long term.
> >> >
> >> > sorry, i actually meant "I don't see why not",  for some reason, the "not"
> >> > was missed. Obviously I meant "why not", there was a "but" after it :-)
> >> >
> >> >>
> >> >> > but we have lots of places calling swap_free(), we may
> >> >> > have to change them all to call swap_free_nr(entry, 1); the other possible
> >> >> > way is making swap_free() a wrapper of swap_free_nr() always using
> >> >> > 1 as the argument. In both cases, we are changing the semantics of
> >> >> > swap_free_nr() to partially freeing large folio cases and have to drop
> >> >> > "entry should be for the first subpage" then.
> >> >> >
> >> >> > Right now, the semantics is
> >> >> > * swap_free_nr() for an entire large folio;
> >> >> > * swap_free() for one entry of either a large folio or a small folio
> >> >>
> >> >> As above, I don't think the these semantics are important for
> >> >> swap_free_nr() implementation.
> >> >
> >> > right. I agree. If we are ready to change all those callers, nothing
> >> > can stop us from removing swap_free().
> >> >
> >> >>
> >> >> >>
> >> >> >> > +
> >> >> >> > +     remain_nr = nr_pages;
> >> >> >> > +     p = _swap_info_get(entry);
> >> >> >> > +     if (p) {
> >> >> >> > +             for (i = 0; i < nr_pages; i += batch_nr) {
> >> >> >> > +                     batch_nr = min_t(int, SWAP_BATCH_NR, remain_nr);
> >> >> >> > +
> >> >> >> > +                     ci = lock_cluster_or_swap_info(p, offset);
> >> >> >> > +                     for (j = 0; j < batch_nr; j++) {
> >> >> >> > +                             if (__swap_entry_free_locked(p, offset + i * SWAP_BATCH_NR + j, 1))
> >> >> >> > +                                     __bitmap_set(usage, j, 1);
> >> >> >> > +                     }
> >> >> >> > +                     unlock_cluster_or_swap_info(p, ci);
> >> >> >> > +
> >> >> >> > +                     for_each_clear_bit(j, usage, batch_nr)
> >> >> >> > +                             free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
> >> >> >> > +
> >> >> >> > +                     bitmap_clear(usage, 0, SWAP_BATCH_NR);
> >> >> >> > +                     remain_nr -= batch_nr;
> >> >> >> > +             }
> >> >> >> > +     }
> >> >> >> > +}
> >> >> >> > +
> >> >> >> >  /*
> >> >> >> >   * Called after dropping swapcache to decrease refcnt to swap entries.
> >> >> >> >   */
> >> >> >>
> >> >> >> put_swap_folio() implements batching in another method.  Do you think
> >> >> >> that it's good to use the batching method in that function here?  It
> >> >> >> avoids to use bitmap operations and stack space.
> >> >> >
> >> >> > Chuanhua has strictly limited the maximum stack usage to several
> >> >> > unsigned long,
> >> >>
> >> >> 512 / 8 = 64 bytes.
> >> >>
> >> >> So, not trivial.
> >> >>
> >> >> > so this should be safe. on the other hand, i believe this
> >> >> > implementation is more efficient, as  put_swap_folio() might lock/
> >> >> > unlock much more often whenever __swap_entry_free_locked returns
> >> >> > 0.
> >> >>
> >> >> There are 2 most common use cases,
> >> >>
> >> >> - all swap entries have usage count == 0
> >> >> - all swap entries have usage count != 0
> >> >>
> >> >> In both cases, we only need to lock/unlock once.  In fact, I didn't
> >> >> find possible use cases other than above.
> >> >
> >> > i guess the point is free_swap_slot() shouldn't be called within
> >> > lock_cluster_or_swap_info? so when we are freeing nr_pages slots,
> >> > we'll have to unlock and lock nr_pages times?  and this is the most
> >> > common scenario.
> >>
> >> No.  In put_swap_folio(), free_entries is either SWAPFILE_CLUSTER (that
> >> is, nr_pages) or 0.  These are the most common cases.
> >>
> >
> > i am actually talking about the below code path,
> >
> > void put_swap_folio(struct folio *folio, swp_entry_t entry)
> > {
> >         ...
> >
> >         ci = lock_cluster_or_swap_info(si, offset);
> >         ...
> >         for (i = 0; i < size; i++, entry.val++) {
> >                 if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE)) {
> >                         unlock_cluster_or_swap_info(si, ci);
> >                         free_swap_slot(entry);
> >                         if (i == size - 1)
> >                                 return;
> >                         lock_cluster_or_swap_info(si, offset);
> >                 }
> >         }
> >         unlock_cluster_or_swap_info(si, ci);
> > }
> >
> > but i guess you are talking about the below code path:
> >
> > void put_swap_folio(struct folio *folio, swp_entry_t entry)
> > {
> >         ...
> >
> >         ci = lock_cluster_or_swap_info(si, offset);
> >         if (size == SWAPFILE_CLUSTER) {
> >                 map = si->swap_map + offset;
> >                 for (i = 0; i < SWAPFILE_CLUSTER; i++) {
> >                         val = map[i];
> >                         VM_BUG_ON(!(val & SWAP_HAS_CACHE));
> >                         if (val == SWAP_HAS_CACHE)
> >                                 free_entries++;
> >                 }
> >                 if (free_entries == SWAPFILE_CLUSTER) {
> >                         unlock_cluster_or_swap_info(si, ci);
> >                         spin_lock(&si->lock);
> >                         mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER);
> >                         swap_free_cluster(si, idx);
> >                         spin_unlock(&si->lock);
> >                         return;
> >                 }
> >         }
> > }
>
> I am talking about both code paths.  In 2 most common cases,
> __swap_entry_free_locked() will return 0 or !0 for all entries in range.

I grasp your point, but if conditions involving 0 or non-0 values fail, we'll
end up repeatedly unlocking and locking. Picture a scenario with a large
folio shared by multiple processes. One process might unmap a portion
while another still holds an entire mapping to it. This could lead to situations
where free_entries doesn't equal 0 and free_entries doesn't equal
nr_pages, resulting in multiple unlock and lock operations.

Chuanhua has invested significant effort in following Ryan's suggestion
for the current approach, which generally handles all cases, especially
partial unmapping. Additionally, the widespread use of swap_free_nr()
as you suggested across various scenarios is noteworthy.

Unless there's evidence indicating performance issues or bugs, I believe
the current approach remains preferable.

>
> > we are mTHP, so we can't assume our size is SWAPFILE_CLUSTER?
> > or you want to check free_entries == "1 << swap_entry_order(folio_order(folio))"
> > instead of SWAPFILE_CLUSTER for the "for (i = 0; i < size; i++, entry.val++)"
> > path?
>
> Just replace SWAPFILE_CLUSTER with "nr_pages" in your code.
>
> >
> >> >>
> >> >> And, we should add batching in __swap_entry_free().  That will help
> >> >> free_swap_and_cache_nr() too.
> >
> > Chris Li and I actually discussed it before, while I completely
> > agree this can be batched. but i'd like to defer this as an incremental
> > patchset later to keep this swapcache-refault small.
>
> OK.
>
> >>
> >> Please consider this too.
>
> --
> Best Regards,
> Huang, Ying

Thanks
Barry

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 4/5] mm: swap: entirely map large folios found in swapcache
  2024-04-15  8:53     ` Barry Song
@ 2024-04-16  2:25       ` Huang, Ying
  2024-04-16  2:36         ` Barry Song
  0 siblings, 1 reply; 54+ messages in thread
From: Huang, Ying @ 2024-04-16  2:25 UTC (permalink / raw)
  To: Barry Song, Khalid Aziz
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, hanchuanhua, hannes,
	hughd, kasong, ryan.roberts, surenb, v-songbaohua, willy, xiang,
	yosryahmed, yuzhao, ziy, linux-kernel


Added Khalid for arch_do_swap_page().

Barry Song <21cnbao@gmail.com> writes:

> On Mon, Apr 15, 2024 at 8:39 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Barry Song <21cnbao@gmail.com> writes:

[snip]

>>
>> > +     bool any_swap_shared = false;
>> >
>> >       if (!pte_unmap_same(vmf))
>> >               goto out;
>> > @@ -4137,6 +4141,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> >        */
>> >       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>> >                       &vmf->ptl);
>>
>> We should move pte check here.  That is,
>>
>>         if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
>>                 goto out_nomap;
>>
>> This will simplify the situation for large folio.
>
> the plan is moving the whole code block
>
> if (start_pte && folio_test_large(folio) && folio_test_swapcache(folio))
>
> after
>         if (unlikely(!folio_test_uptodate(folio))) {
>                 ret = VM_FAULT_SIGBUS;
>                 goto out_nomap;
>         }
>
> though we couldn't be !folio_test_uptodate(folio)) for hitting
> swapcache but it seems
> logically better for future use.

LGTM, Thanks!

>>
>> > +
>> > +     /* We hit large folios in swapcache */
>>
>> The comments seems unnecessary because the code tells that already.
>>
>> > +     if (start_pte && folio_test_large(folio) && folio_test_swapcache(folio)) {
>> > +             int nr = folio_nr_pages(folio);
>> > +             int idx = folio_page_idx(folio, page);
>> > +             unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
>> > +             unsigned long folio_end = folio_start + nr * PAGE_SIZE;
>> > +             pte_t *folio_ptep;
>> > +             pte_t folio_pte;
>> > +
>> > +             if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
>> > +                     goto check_pte;
>> > +             if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
>> > +                     goto check_pte;
>> > +
>> > +             folio_ptep = vmf->pte - idx;
>> > +             folio_pte = ptep_get(folio_ptep);
>>
>> It's better to construct pte based on fault PTE via generalizing
>> pte_next_swp_offset() (may be pte_move_swp_offset()).  Then we can find
>> inconsistent PTEs quicker.
>
> it seems your point is getting the pte of page0 by pte_next_swp_offset()
> unfortunately pte_next_swp_offset can't go back. on the other hand,
> we have to check the real pte value of the 0nd entry right now because
> swap_pte_batch() only really reads pte from the 1st entry. it assumes
> pte argument is the real value for the 0nd pte entry.
>
> static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte)
> {
>         pte_t expected_pte = pte_next_swp_offset(pte);
>         const pte_t *end_ptep = start_ptep + max_nr;
>         pte_t *ptep = start_ptep + 1;
>
>         VM_WARN_ON(max_nr < 1);
>         VM_WARN_ON(!is_swap_pte(pte));
>         VM_WARN_ON(non_swap_entry(pte_to_swp_entry(pte)));
>
>         while (ptep < end_ptep) {
>                 pte = ptep_get(ptep);
>
>                 if (!pte_same(pte, expected_pte))
>                         break;
>
>                 expected_pte = pte_next_swp_offset(expected_pte);
>                 ptep++;
>         }
>
>         return ptep - start_ptep;
> }

Yes.  You are right.

But we may check whether the pte of page0 is same as "vmf->orig_pte -
folio_page_idx()" (fake code).

You need to check the pte of page 0 anyway.

>>
>> > +             if (!is_swap_pte(folio_pte) || non_swap_entry(pte_to_swp_entry(folio_pte)) ||
>> > +                 swap_pte_batch(folio_ptep, nr, folio_pte, &any_swap_shared) != nr)
>> > +                     goto check_pte;
>> > +
>> > +             start_address = folio_start;
>> > +             start_pte = folio_ptep;
>> > +             nr_pages = nr;
>> > +             entry = folio->swap;
>> > +             page = &folio->page;
>> > +     }
>> > +
>> > +check_pte:
>> >       if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
>> >               goto out_nomap;
>> >
>> > @@ -4190,6 +4223,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> >                        */
>> >                       exclusive = false;
>> >               }
>> > +
>> > +             /* Reuse the whole large folio iff all entries are exclusive */
>> > +             if (nr_pages > 1 && any_swap_shared)
>> > +                     exclusive = false;
>> >       }
>> >
>> >       /*
>> > @@ -4204,12 +4241,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> >        * We're already holding a reference on the page but haven't mapped it
>> >        * yet.
>> >        */
>> > -     swap_free(entry);
>> > +     swap_free_nr(entry, nr_pages);
>> >       if (should_try_to_free_swap(folio, vma, vmf->flags))
>> >               folio_free_swap(folio);
>> >
>> > -     inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>> > -     dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
>> > +     folio_ref_add(folio, nr_pages - 1);
>> > +     add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
>> > +     add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
>> > +
>> >       pte = mk_pte(page, vma->vm_page_prot);
>> >
>> >       /*
>> > @@ -4219,33 +4258,34 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> >        * exclusivity.
>> >        */
>> >       if (!folio_test_ksm(folio) &&
>> > -         (exclusive || folio_ref_count(folio) == 1)) {
>> > +         (exclusive || (folio_ref_count(folio) == nr_pages &&
>> > +                        folio_nr_pages(folio) == nr_pages))) {
>> >               if (vmf->flags & FAULT_FLAG_WRITE) {
>> >                       pte = maybe_mkwrite(pte_mkdirty(pte), vma);
>> >                       vmf->flags &= ~FAULT_FLAG_WRITE;
>> >               }
>> >               rmap_flags |= RMAP_EXCLUSIVE;
>> >       }
>> > -     flush_icache_page(vma, page);
>> > +     flush_icache_pages(vma, page, nr_pages);
>> >       if (pte_swp_soft_dirty(vmf->orig_pte))
>> >               pte = pte_mksoft_dirty(pte);
>> >       if (pte_swp_uffd_wp(vmf->orig_pte))
>> >               pte = pte_mkuffd_wp(pte);
>> > -     vmf->orig_pte = pte;
>> >
>> >       /* ksm created a completely new copy */
>> >       if (unlikely(folio != swapcache && swapcache)) {
>> > -             folio_add_new_anon_rmap(folio, vma, vmf->address);
>> > +             folio_add_new_anon_rmap(folio, vma, start_address);
>> >               folio_add_lru_vma(folio, vma);
>> >       } else {
>> > -             folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
>> > -                                     rmap_flags);
>> > +             folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, start_address,
>> > +                                      rmap_flags);
>> >       }
>> >
>> >       VM_BUG_ON(!folio_test_anon(folio) ||
>> >                       (pte_write(pte) && !PageAnonExclusive(page)));
>> > -     set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
>> > -     arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
>> > +     set_ptes(vma->vm_mm, start_address, start_pte, pte, nr_pages);
>> > +     vmf->orig_pte = ptep_get(vmf->pte);
>> > +     arch_do_swap_page(vma->vm_mm, vma, start_address, pte, pte);
>>
>> Do we need to call arch_do_swap_page() for each subpage?  IIUC, the
>> corresponding arch_unmap_one() will be called for each subpage.
>
> i actually thought about this very carefully, right now, the only one who
> needs this is sparc and it doesn't support THP_SWAPOUT at all. and
> there is no proof doing restoration one by one won't really break sparc.
> so i'd like to defer this to when sparc really needs THP_SWAPOUT.

Let's ask SPARC developer (Cced) for this.

IMHO, even if we cannot get help, we need to change code with our
understanding instead of deferring it.

> on the other hand, it seems really bad we have both
> arch_swap_restore  - for this, arm64 has moved to using folio
> and
> arch_do_swap_page
>
> we should somehow unify them later if sparc wants THP_SWPOUT.
>
>>
>> >       folio_unlock(folio);
>> >       if (folio != swapcache && swapcache) {
>> > @@ -4269,7 +4309,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> >       }
>> >
>> >       /* No need to invalidate - it was non-present before */
>> > -     update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
>> > +     update_mmu_cache_range(vmf, vma, start_address, start_pte, nr_pages);
>> >  unlock:
>> >       if (vmf->pte)
>> >               pte_unmap_unlock(vmf->pte, vmf->ptl);

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 4/5] mm: swap: entirely map large folios found in swapcache
  2024-04-16  2:25       ` Huang, Ying
@ 2024-04-16  2:36         ` Barry Song
  2024-04-16  2:39           ` Huang, Ying
  2024-04-18  9:55           ` Barry Song
  0 siblings, 2 replies; 54+ messages in thread
From: Barry Song @ 2024-04-16  2:36 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Khalid Aziz, akpm, linux-mm, baolin.wang, chrisl, david,
	hanchuanhua, hannes, hughd, kasong, ryan.roberts, surenb,
	v-songbaohua, willy, xiang, yosryahmed, yuzhao, ziy,
	linux-kernel

On Tue, Apr 16, 2024 at 2:27 PM Huang, Ying <ying.huang@intel.com> wrote:
>
>
> Added Khalid for arch_do_swap_page().
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Mon, Apr 15, 2024 at 8:39 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Barry Song <21cnbao@gmail.com> writes:
>
> [snip]
>
> >>
> >> > +     bool any_swap_shared = false;
> >> >
> >> >       if (!pte_unmap_same(vmf))
> >> >               goto out;
> >> > @@ -4137,6 +4141,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >> >        */
> >> >       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> >> >                       &vmf->ptl);
> >>
> >> We should move pte check here.  That is,
> >>
> >>         if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> >>                 goto out_nomap;
> >>
> >> This will simplify the situation for large folio.
> >
> > the plan is moving the whole code block
> >
> > if (start_pte && folio_test_large(folio) && folio_test_swapcache(folio))
> >
> > after
> >         if (unlikely(!folio_test_uptodate(folio))) {
> >                 ret = VM_FAULT_SIGBUS;
> >                 goto out_nomap;
> >         }
> >
> > though we couldn't be !folio_test_uptodate(folio)) for hitting
> > swapcache but it seems
> > logically better for future use.
>
> LGTM, Thanks!
>
> >>
> >> > +
> >> > +     /* We hit large folios in swapcache */
> >>
> >> The comments seems unnecessary because the code tells that already.
> >>
> >> > +     if (start_pte && folio_test_large(folio) && folio_test_swapcache(folio)) {
> >> > +             int nr = folio_nr_pages(folio);
> >> > +             int idx = folio_page_idx(folio, page);
> >> > +             unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
> >> > +             unsigned long folio_end = folio_start + nr * PAGE_SIZE;
> >> > +             pte_t *folio_ptep;
> >> > +             pte_t folio_pte;
> >> > +
> >> > +             if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
> >> > +                     goto check_pte;
> >> > +             if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
> >> > +                     goto check_pte;
> >> > +
> >> > +             folio_ptep = vmf->pte - idx;
> >> > +             folio_pte = ptep_get(folio_ptep);
> >>
> >> It's better to construct pte based on fault PTE via generalizing
> >> pte_next_swp_offset() (may be pte_move_swp_offset()).  Then we can find
> >> inconsistent PTEs quicker.
> >
> > it seems your point is getting the pte of page0 by pte_next_swp_offset()
> > unfortunately pte_next_swp_offset can't go back. on the other hand,
> > we have to check the real pte value of the 0nd entry right now because
> > swap_pte_batch() only really reads pte from the 1st entry. it assumes
> > pte argument is the real value for the 0nd pte entry.
> >
> > static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte)
> > {
> >         pte_t expected_pte = pte_next_swp_offset(pte);
> >         const pte_t *end_ptep = start_ptep + max_nr;
> >         pte_t *ptep = start_ptep + 1;
> >
> >         VM_WARN_ON(max_nr < 1);
> >         VM_WARN_ON(!is_swap_pte(pte));
> >         VM_WARN_ON(non_swap_entry(pte_to_swp_entry(pte)));
> >
> >         while (ptep < end_ptep) {
> >                 pte = ptep_get(ptep);
> >
> >                 if (!pte_same(pte, expected_pte))
> >                         break;
> >
> >                 expected_pte = pte_next_swp_offset(expected_pte);
> >                 ptep++;
> >         }
> >
> >         return ptep - start_ptep;
> > }
>
> Yes.  You are right.
>
> But we may check whether the pte of page0 is same as "vmf->orig_pte -
> folio_page_idx()" (fake code).

right, that is why we are reading and checking PTE0 before calling
swap_pte_batch()
right now.

  folio_ptep = vmf->pte - idx;
  folio_pte = ptep_get(folio_ptep);
  if (!is_swap_pte(folio_pte) || non_swap_entry(pte_to_swp_entry(folio_pte)) ||
      swap_pte_batch(folio_ptep, nr, folio_pte, &any_swap_shared) != nr)
   goto check_pte;

So, if I understand correctly, you're proposing that we should directly check
PTE0 in swap_pte_batch(). Personally, I don't have any objections to this idea.
However, I'd also like to hear the feedback from Ryan and David :-)

>
> You need to check the pte of page 0 anyway.
>
> >>
> >> > +             if (!is_swap_pte(folio_pte) || non_swap_entry(pte_to_swp_entry(folio_pte)) ||
> >> > +                 swap_pte_batch(folio_ptep, nr, folio_pte, &any_swap_shared) != nr)
> >> > +                     goto check_pte;
> >> > +
> >> > +             start_address = folio_start;
> >> > +             start_pte = folio_ptep;
> >> > +             nr_pages = nr;
> >> > +             entry = folio->swap;
> >> > +             page = &folio->page;
> >> > +     }
> >> > +
> >> > +check_pte:
> >> >       if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> >> >               goto out_nomap;
> >> >
> >> > @@ -4190,6 +4223,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >> >                        */
> >> >                       exclusive = false;
> >> >               }
> >> > +
> >> > +             /* Reuse the whole large folio iff all entries are exclusive */
> >> > +             if (nr_pages > 1 && any_swap_shared)
> >> > +                     exclusive = false;
> >> >       }
> >> >
> >> >       /*
> >> > @@ -4204,12 +4241,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >> >        * We're already holding a reference on the page but haven't mapped it
> >> >        * yet.
> >> >        */
> >> > -     swap_free(entry);
> >> > +     swap_free_nr(entry, nr_pages);
> >> >       if (should_try_to_free_swap(folio, vma, vmf->flags))
> >> >               folio_free_swap(folio);
> >> >
> >> > -     inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> >> > -     dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> >> > +     folio_ref_add(folio, nr_pages - 1);
> >> > +     add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> >> > +     add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> >> > +
> >> >       pte = mk_pte(page, vma->vm_page_prot);
> >> >
> >> >       /*
> >> > @@ -4219,33 +4258,34 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >> >        * exclusivity.
> >> >        */
> >> >       if (!folio_test_ksm(folio) &&
> >> > -         (exclusive || folio_ref_count(folio) == 1)) {
> >> > +         (exclusive || (folio_ref_count(folio) == nr_pages &&
> >> > +                        folio_nr_pages(folio) == nr_pages))) {
> >> >               if (vmf->flags & FAULT_FLAG_WRITE) {
> >> >                       pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> >> >                       vmf->flags &= ~FAULT_FLAG_WRITE;
> >> >               }
> >> >               rmap_flags |= RMAP_EXCLUSIVE;
> >> >       }
> >> > -     flush_icache_page(vma, page);
> >> > +     flush_icache_pages(vma, page, nr_pages);
> >> >       if (pte_swp_soft_dirty(vmf->orig_pte))
> >> >               pte = pte_mksoft_dirty(pte);
> >> >       if (pte_swp_uffd_wp(vmf->orig_pte))
> >> >               pte = pte_mkuffd_wp(pte);
> >> > -     vmf->orig_pte = pte;
> >> >
> >> >       /* ksm created a completely new copy */
> >> >       if (unlikely(folio != swapcache && swapcache)) {
> >> > -             folio_add_new_anon_rmap(folio, vma, vmf->address);
> >> > +             folio_add_new_anon_rmap(folio, vma, start_address);
> >> >               folio_add_lru_vma(folio, vma);
> >> >       } else {
> >> > -             folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
> >> > -                                     rmap_flags);
> >> > +             folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, start_address,
> >> > +                                      rmap_flags);
> >> >       }
> >> >
> >> >       VM_BUG_ON(!folio_test_anon(folio) ||
> >> >                       (pte_write(pte) && !PageAnonExclusive(page)));
> >> > -     set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> >> > -     arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
> >> > +     set_ptes(vma->vm_mm, start_address, start_pte, pte, nr_pages);
> >> > +     vmf->orig_pte = ptep_get(vmf->pte);
> >> > +     arch_do_swap_page(vma->vm_mm, vma, start_address, pte, pte);
> >>
> >> Do we need to call arch_do_swap_page() for each subpage?  IIUC, the
> >> corresponding arch_unmap_one() will be called for each subpage.
> >
> > i actually thought about this very carefully, right now, the only one who
> > needs this is sparc and it doesn't support THP_SWAPOUT at all. and
> > there is no proof doing restoration one by one won't really break sparc.
> > so i'd like to defer this to when sparc really needs THP_SWAPOUT.
>
> Let's ask SPARC developer (Cced) for this.
>
> IMHO, even if we cannot get help, we need to change code with our
> understanding instead of deferring it.

ok. Thanks for Ccing sparc developers.

>
> > on the other hand, it seems really bad we have both
> > arch_swap_restore  - for this, arm64 has moved to using folio
> > and
> > arch_do_swap_page
> >
> > we should somehow unify them later if sparc wants THP_SWPOUT.
> >
> >>
> >> >       folio_unlock(folio);
> >> >       if (folio != swapcache && swapcache) {
> >> > @@ -4269,7 +4309,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >> >       }
> >> >
> >> >       /* No need to invalidate - it was non-present before */
> >> > -     update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
> >> > +     update_mmu_cache_range(vmf, vma, start_address, start_pte, nr_pages);
> >> >  unlock:
> >> >       if (vmf->pte)
> >> >               pte_unmap_unlock(vmf->pte, vmf->ptl);
>
> --
> Best Regards,
> Huang, Ying

Thanks
Barry

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 4/5] mm: swap: entirely map large folios found in swapcache
  2024-04-16  2:36         ` Barry Song
@ 2024-04-16  2:39           ` Huang, Ying
  2024-04-16  2:52             ` Barry Song
  2024-04-18  9:55           ` Barry Song
  1 sibling, 1 reply; 54+ messages in thread
From: Huang, Ying @ 2024-04-16  2:39 UTC (permalink / raw)
  To: Barry Song
  Cc: Khalid Aziz, akpm, linux-mm, baolin.wang, chrisl, david,
	hanchuanhua, hannes, hughd, kasong, ryan.roberts, surenb,
	v-songbaohua, willy, xiang, yosryahmed, yuzhao, ziy,
	linux-kernel

Barry Song <21cnbao@gmail.com> writes:

> On Tue, Apr 16, 2024 at 2:27 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>>
>> Added Khalid for arch_do_swap_page().
>>
>> Barry Song <21cnbao@gmail.com> writes:
>>
>> > On Mon, Apr 15, 2024 at 8:39 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Barry Song <21cnbao@gmail.com> writes:
>>
>> [snip]
>>
>> >>
>> >> > +     bool any_swap_shared = false;
>> >> >
>> >> >       if (!pte_unmap_same(vmf))
>> >> >               goto out;
>> >> > @@ -4137,6 +4141,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> >> >        */
>> >> >       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>> >> >                       &vmf->ptl);
>> >>
>> >> We should move pte check here.  That is,
>> >>
>> >>         if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
>> >>                 goto out_nomap;
>> >>
>> >> This will simplify the situation for large folio.
>> >
>> > the plan is moving the whole code block
>> >
>> > if (start_pte && folio_test_large(folio) && folio_test_swapcache(folio))
>> >
>> > after
>> >         if (unlikely(!folio_test_uptodate(folio))) {
>> >                 ret = VM_FAULT_SIGBUS;
>> >                 goto out_nomap;
>> >         }
>> >
>> > though we couldn't be !folio_test_uptodate(folio)) for hitting
>> > swapcache but it seems
>> > logically better for future use.
>>
>> LGTM, Thanks!
>>
>> >>
>> >> > +
>> >> > +     /* We hit large folios in swapcache */
>> >>
>> >> The comments seems unnecessary because the code tells that already.
>> >>
>> >> > +     if (start_pte && folio_test_large(folio) && folio_test_swapcache(folio)) {
>> >> > +             int nr = folio_nr_pages(folio);
>> >> > +             int idx = folio_page_idx(folio, page);
>> >> > +             unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
>> >> > +             unsigned long folio_end = folio_start + nr * PAGE_SIZE;
>> >> > +             pte_t *folio_ptep;
>> >> > +             pte_t folio_pte;
>> >> > +
>> >> > +             if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
>> >> > +                     goto check_pte;
>> >> > +             if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
>> >> > +                     goto check_pte;
>> >> > +
>> >> > +             folio_ptep = vmf->pte - idx;
>> >> > +             folio_pte = ptep_get(folio_ptep);
>> >>
>> >> It's better to construct pte based on fault PTE via generalizing
>> >> pte_next_swp_offset() (may be pte_move_swp_offset()).  Then we can find
>> >> inconsistent PTEs quicker.
>> >
>> > it seems your point is getting the pte of page0 by pte_next_swp_offset()
>> > unfortunately pte_next_swp_offset can't go back. on the other hand,
>> > we have to check the real pte value of the 0nd entry right now because
>> > swap_pte_batch() only really reads pte from the 1st entry. it assumes
>> > pte argument is the real value for the 0nd pte entry.
>> >
>> > static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte)
>> > {
>> >         pte_t expected_pte = pte_next_swp_offset(pte);
>> >         const pte_t *end_ptep = start_ptep + max_nr;
>> >         pte_t *ptep = start_ptep + 1;
>> >
>> >         VM_WARN_ON(max_nr < 1);
>> >         VM_WARN_ON(!is_swap_pte(pte));
>> >         VM_WARN_ON(non_swap_entry(pte_to_swp_entry(pte)));
>> >
>> >         while (ptep < end_ptep) {
>> >                 pte = ptep_get(ptep);
>> >
>> >                 if (!pte_same(pte, expected_pte))
>> >                         break;
>> >
>> >                 expected_pte = pte_next_swp_offset(expected_pte);
>> >                 ptep++;
>> >         }
>> >
>> >         return ptep - start_ptep;
>> > }
>>
>> Yes.  You are right.
>>
>> But we may check whether the pte of page0 is same as "vmf->orig_pte -
>> folio_page_idx()" (fake code).
>
> right, that is why we are reading and checking PTE0 before calling
> swap_pte_batch()
> right now.
>
>   folio_ptep = vmf->pte - idx;
>   folio_pte = ptep_get(folio_ptep);
>   if (!is_swap_pte(folio_pte) || non_swap_entry(pte_to_swp_entry(folio_pte)) ||
>       swap_pte_batch(folio_ptep, nr, folio_pte, &any_swap_shared) != nr)
>    goto check_pte;
>
> So, if I understand correctly, you're proposing that we should directly check
> PTE0 in swap_pte_batch(). Personally, I don't have any objections to this idea.
> However, I'd also like to hear the feedback from Ryan and David :-)

I mean that we can replace 

        !is_swap_pte(folio_pte) || non_swap_entry(pte_to_swp_entry(folio_pte))

in above code with pte_same() with constructed expected first pte.

>>
>> You need to check the pte of page 0 anyway.
>>
>> >>
>> >> > +             if (!is_swap_pte(folio_pte) || non_swap_entry(pte_to_swp_entry(folio_pte)) ||
>> >> > +                 swap_pte_batch(folio_ptep, nr, folio_pte, &any_swap_shared) != nr)
>> >> > +                     goto check_pte;
>> >> > +
>> >> > +             start_address = folio_start;
>> >> > +             start_pte = folio_ptep;
>> >> > +             nr_pages = nr;
>> >> > +             entry = folio->swap;
>> >> > +             page = &folio->page;
>> >> > +     }
>> >> > +
>> >> > +check_pte:
>> >> >       if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
>> >> >               goto out_nomap;
>> >> >
>> >> > @@ -4190,6 +4223,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> >> >                        */
>> >> >                       exclusive = false;
>> >> >               }
>> >> > +
>> >> > +             /* Reuse the whole large folio iff all entries are exclusive */
>> >> > +             if (nr_pages > 1 && any_swap_shared)
>> >> > +                     exclusive = false;
>> >> >       }
>> >> >

[snip]

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 4/5] mm: swap: entirely map large folios found in swapcache
  2024-04-16  2:39           ` Huang, Ying
@ 2024-04-16  2:52             ` Barry Song
  2024-04-16  3:17               ` Huang, Ying
  0 siblings, 1 reply; 54+ messages in thread
From: Barry Song @ 2024-04-16  2:52 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Khalid Aziz, akpm, linux-mm, baolin.wang, chrisl, david,
	hanchuanhua, hannes, hughd, kasong, ryan.roberts, surenb,
	v-songbaohua, willy, xiang, yosryahmed, yuzhao, ziy,
	linux-kernel

On Tue, Apr 16, 2024 at 2:41 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Tue, Apr 16, 2024 at 2:27 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >>
> >> Added Khalid for arch_do_swap_page().
> >>
> >> Barry Song <21cnbao@gmail.com> writes:
> >>
> >> > On Mon, Apr 15, 2024 at 8:39 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Barry Song <21cnbao@gmail.com> writes:
> >>
> >> [snip]
> >>
> >> >>
> >> >> > +     bool any_swap_shared = false;
> >> >> >
> >> >> >       if (!pte_unmap_same(vmf))
> >> >> >               goto out;
> >> >> > @@ -4137,6 +4141,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >> >> >        */
> >> >> >       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> >> >> >                       &vmf->ptl);
> >> >>
> >> >> We should move pte check here.  That is,
> >> >>
> >> >>         if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> >> >>                 goto out_nomap;
> >> >>
> >> >> This will simplify the situation for large folio.
> >> >
> >> > the plan is moving the whole code block
> >> >
> >> > if (start_pte && folio_test_large(folio) && folio_test_swapcache(folio))
> >> >
> >> > after
> >> >         if (unlikely(!folio_test_uptodate(folio))) {
> >> >                 ret = VM_FAULT_SIGBUS;
> >> >                 goto out_nomap;
> >> >         }
> >> >
> >> > though we couldn't be !folio_test_uptodate(folio)) for hitting
> >> > swapcache but it seems
> >> > logically better for future use.
> >>
> >> LGTM, Thanks!
> >>
> >> >>
> >> >> > +
> >> >> > +     /* We hit large folios in swapcache */
> >> >>
> >> >> The comments seems unnecessary because the code tells that already.
> >> >>
> >> >> > +     if (start_pte && folio_test_large(folio) && folio_test_swapcache(folio)) {
> >> >> > +             int nr = folio_nr_pages(folio);
> >> >> > +             int idx = folio_page_idx(folio, page);
> >> >> > +             unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
> >> >> > +             unsigned long folio_end = folio_start + nr * PAGE_SIZE;
> >> >> > +             pte_t *folio_ptep;
> >> >> > +             pte_t folio_pte;
> >> >> > +
> >> >> > +             if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
> >> >> > +                     goto check_pte;
> >> >> > +             if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
> >> >> > +                     goto check_pte;
> >> >> > +
> >> >> > +             folio_ptep = vmf->pte - idx;
> >> >> > +             folio_pte = ptep_get(folio_ptep);
> >> >>
> >> >> It's better to construct pte based on fault PTE via generalizing
> >> >> pte_next_swp_offset() (may be pte_move_swp_offset()).  Then we can find
> >> >> inconsistent PTEs quicker.
> >> >
> >> > it seems your point is getting the pte of page0 by pte_next_swp_offset()
> >> > unfortunately pte_next_swp_offset can't go back. on the other hand,
> >> > we have to check the real pte value of the 0nd entry right now because
> >> > swap_pte_batch() only really reads pte from the 1st entry. it assumes
> >> > pte argument is the real value for the 0nd pte entry.
> >> >
> >> > static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte)
> >> > {
> >> >         pte_t expected_pte = pte_next_swp_offset(pte);
> >> >         const pte_t *end_ptep = start_ptep + max_nr;
> >> >         pte_t *ptep = start_ptep + 1;
> >> >
> >> >         VM_WARN_ON(max_nr < 1);
> >> >         VM_WARN_ON(!is_swap_pte(pte));
> >> >         VM_WARN_ON(non_swap_entry(pte_to_swp_entry(pte)));
> >> >
> >> >         while (ptep < end_ptep) {
> >> >                 pte = ptep_get(ptep);
> >> >
> >> >                 if (!pte_same(pte, expected_pte))
> >> >                         break;
> >> >
> >> >                 expected_pte = pte_next_swp_offset(expected_pte);
> >> >                 ptep++;
> >> >         }
> >> >
> >> >         return ptep - start_ptep;
> >> > }
> >>
> >> Yes.  You are right.
> >>
> >> But we may check whether the pte of page0 is same as "vmf->orig_pte -
> >> folio_page_idx()" (fake code).
> >
> > right, that is why we are reading and checking PTE0 before calling
> > swap_pte_batch()
> > right now.
> >
> >   folio_ptep = vmf->pte - idx;
> >   folio_pte = ptep_get(folio_ptep);
> >   if (!is_swap_pte(folio_pte) || non_swap_entry(pte_to_swp_entry(folio_pte)) ||
> >       swap_pte_batch(folio_ptep, nr, folio_pte, &any_swap_shared) != nr)
> >    goto check_pte;
> >
> > So, if I understand correctly, you're proposing that we should directly check
> > PTE0 in swap_pte_batch(). Personally, I don't have any objections to this idea.
> > However, I'd also like to hear the feedback from Ryan and David :-)
>
> I mean that we can replace
>
>         !is_swap_pte(folio_pte) || non_swap_entry(pte_to_swp_entry(folio_pte))
>
> in above code with pte_same() with constructed expected first pte.

Got it. It could be quite tricky, especially with considerations like
pte_swp_soft_dirty, pte_swp_exclusive, and pte_swp_uffd_wp. We might
require a helper function similar to pte_next_swp_offset() but capable of
moving both forward and backward. For instance:

pte_move_swp_offset(pte_t pte, long delta)

pte_next_swp_offset can insteadly call it by:
pte_move_swp_offset(pte, 1);

Is it what you are proposing?

>
> >>
> >> You need to check the pte of page 0 anyway.
> >>
> >> >>
> >> >> > +             if (!is_swap_pte(folio_pte) || non_swap_entry(pte_to_swp_entry(folio_pte)) ||
> >> >> > +                 swap_pte_batch(folio_ptep, nr, folio_pte, &any_swap_shared) != nr)
> >> >> > +                     goto check_pte;
> >> >> > +
> >> >> > +             start_address = folio_start;
> >> >> > +             start_pte = folio_ptep;
> >> >> > +             nr_pages = nr;
> >> >> > +             entry = folio->swap;
> >> >> > +             page = &folio->page;
> >> >> > +     }
> >> >> > +
> >> >> > +check_pte:
> >> >> >       if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> >> >> >               goto out_nomap;
> >> >> >
> >> >> > @@ -4190,6 +4223,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >> >> >                        */
> >> >> >                       exclusive = false;
> >> >> >               }
> >> >> > +
> >> >> > +             /* Reuse the whole large folio iff all entries are exclusive */
> >> >> > +             if (nr_pages > 1 && any_swap_shared)
> >> >> > +                     exclusive = false;
> >> >> >       }
> >> >> >
>
> [snip]
>
> --
> Best Regards,
> Huang, Ying

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/5] mm: swap: introduce swap_free_nr() for batched swap_free()
  2024-04-16  2:08                 ` Barry Song
@ 2024-04-16  3:11                   ` Huang, Ying
  2024-04-16  4:32                     ` Barry Song
  0 siblings, 1 reply; 54+ messages in thread
From: Huang, Ying @ 2024-04-16  3:11 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, hanchuanhua, hannes,
	hughd, kasong, ryan.roberts, surenb, v-songbaohua, willy, xiang,
	yosryahmed, yuzhao, ziy, linux-kernel

Barry Song <21cnbao@gmail.com> writes:

> On Tue, Apr 16, 2024 at 1:42 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Barry Song <21cnbao@gmail.com> writes:
>>
>> > On Mon, Apr 15, 2024 at 8:53 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Barry Song <21cnbao@gmail.com> writes:
>> >>
>> >> > On Mon, Apr 15, 2024 at 8:21 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >>
>> >> >> Barry Song <21cnbao@gmail.com> writes:
>> >> >>
>> >> >> > On Mon, Apr 15, 2024 at 6:19 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >>
>> >> >> >> Barry Song <21cnbao@gmail.com> writes:
>> >> >> >>
>> >> >> >> > From: Chuanhua Han <hanchuanhua@oppo.com>
>> >> >> >> >
>> >> >> >> > While swapping in a large folio, we need to free swaps related to the whole
>> >> >> >> > folio. To avoid frequently acquiring and releasing swap locks, it is better
>> >> >> >> > to introduce an API for batched free.
>> >> >> >> >
>> >> >> >> > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
>> >> >> >> > Co-developed-by: Barry Song <v-songbaohua@oppo.com>
>> >> >> >> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>> >> >> >> > ---
>> >> >> >> >  include/linux/swap.h |  5 +++++
>> >> >> >> >  mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++
>> >> >> >> >  2 files changed, 56 insertions(+)
>> >> >> >> >
>> >> >> >> > diff --git a/include/linux/swap.h b/include/linux/swap.h
>> >> >> >> > index 11c53692f65f..b7a107e983b8 100644
>> >> >> >> > --- a/include/linux/swap.h
>> >> >> >> > +++ b/include/linux/swap.h
>> >> >> >> > @@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
>> >> >> >> >  extern int swap_duplicate(swp_entry_t);
>> >> >> >> >  extern int swapcache_prepare(swp_entry_t);
>> >> >> >> >  extern void swap_free(swp_entry_t);
>> >> >> >> > +extern void swap_free_nr(swp_entry_t entry, int nr_pages);
>> >> >> >> >  extern void swapcache_free_entries(swp_entry_t *entries, int n);
>> >> >> >> >  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
>> >> >> >> >  int swap_type_of(dev_t device, sector_t offset);
>> >> >> >> > @@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
>> >> >> >> >  {
>> >> >> >> >  }
>> >> >> >> >
>> >> >> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
>> >> >> >> > +{
>> >> >> >> > +}
>> >> >> >> > +
>> >> >> >> >  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
>> >> >> >> >  {
>> >> >> >> >  }
>> >> >> >> > diff --git a/mm/swapfile.c b/mm/swapfile.c
>> >> >> >> > index 28642c188c93..f4c65aeb088d 100644
>> >> >> >> > --- a/mm/swapfile.c
>> >> >> >> > +++ b/mm/swapfile.c
>> >> >> >> > @@ -1356,6 +1356,57 @@ void swap_free(swp_entry_t entry)
>> >> >> >> >               __swap_entry_free(p, entry);
>> >> >> >> >  }
>> >> >> >> >
>> >> >> >> > +/*
>> >> >> >> > + * Free up the maximum number of swap entries at once to limit the
>> >> >> >> > + * maximum kernel stack usage.
>> >> >> >> > + */
>> >> >> >> > +#define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
>> >> >> >> > +
>> >> >> >> > +/*
>> >> >> >> > + * Called after swapping in a large folio,
>> >> >> >>
>> >> >> >> IMHO, it's not good to document the caller in the function definition.
>> >> >> >> Because this will discourage function reusing.
>> >> >> >
>> >> >> > ok. right now there is only one user that is why it is added. but i agree
>> >> >> > we can actually remove this.
>> >> >> >
>> >> >> >>
>> >> >> >> > batched free swap entries
>> >> >> >> > + * for this large folio, entry should be for the first subpage and
>> >> >> >> > + * its offset is aligned with nr_pages
>> >> >> >>
>> >> >> >> Why do we need this?
>> >> >> >
>> >> >> > This is a fundamental requirement for the existing kernel, folio's
>> >> >> > swap offset is naturally aligned from the first moment add_to_swap
>> >> >> > to add swapcache's xa. so this comment is describing the existing
>> >> >> > fact. In the future, if we want to support swap-out folio to discontiguous
>> >> >> > and not-aligned offsets, we can't pass entry as the parameter, we should
>> >> >> > instead pass ptep or another different data struct which can connect
>> >> >> > multiple discontiguous swap offsets.
>> >> >> >
>> >> >> > I feel like we only need "for this large folio, entry should be for
>> >> >> > the first subpage" and drop "and its offset is aligned with nr_pages",
>> >> >> > the latter is not important to this context at all.
>> >> >>
>> >> >> IIUC, all these are requirements of the only caller now, not the
>> >> >> function itself.  If only part of the all swap entries of a mTHP are
>> >> >> called with swap_free_nr(), can swap_free_nr() still do its work?  If
>> >> >> so, why not make swap_free_nr() as general as possible?
>> >> >
>> >> > right , i believe we can make swap_free_nr() as general as possible.
>> >> >
>> >> >>
>> >> >> >>
>> >> >> >> > + */
>> >> >> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
>> >> >> >> > +{
>> >> >> >> > +     int i, j;
>> >> >> >> > +     struct swap_cluster_info *ci;
>> >> >> >> > +     struct swap_info_struct *p;
>> >> >> >> > +     unsigned int type = swp_type(entry);
>> >> >> >> > +     unsigned long offset = swp_offset(entry);
>> >> >> >> > +     int batch_nr, remain_nr;
>> >> >> >> > +     DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };
>> >> >> >> > +
>> >> >> >> > +     /* all swap entries are within a cluster for mTHP */
>> >> >> >> > +     VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
>> >> >> >> > +
>> >> >> >> > +     if (nr_pages == 1) {
>> >> >> >> > +             swap_free(entry);
>> >> >> >> > +             return;
>> >> >> >> > +     }
>> >> >> >>
>> >> >> >> Is it possible to unify swap_free() and swap_free_nr() into one function
>> >> >> >> with acceptable performance?  IIUC, the general rule in mTHP effort is
>> >> >> >> to avoid duplicate functions between mTHP and normal small folio.
>> >> >> >> Right?
>> >> >> >
>> >> >> > I don't see why.
>> >> >>
>> >> >> Because duplicated implementation are hard to maintain in the long term.
>> >> >
>> >> > sorry, i actually meant "I don't see why not",  for some reason, the "not"
>> >> > was missed. Obviously I meant "why not", there was a "but" after it :-)
>> >> >
>> >> >>
>> >> >> > but we have lots of places calling swap_free(), we may
>> >> >> > have to change them all to call swap_free_nr(entry, 1); the other possible
>> >> >> > way is making swap_free() a wrapper of swap_free_nr() always using
>> >> >> > 1 as the argument. In both cases, we are changing the semantics of
>> >> >> > swap_free_nr() to partially freeing large folio cases and have to drop
>> >> >> > "entry should be for the first subpage" then.
>> >> >> >
>> >> >> > Right now, the semantics is
>> >> >> > * swap_free_nr() for an entire large folio;
>> >> >> > * swap_free() for one entry of either a large folio or a small folio
>> >> >>
>> >> >> As above, I don't think the these semantics are important for
>> >> >> swap_free_nr() implementation.
>> >> >
>> >> > right. I agree. If we are ready to change all those callers, nothing
>> >> > can stop us from removing swap_free().
>> >> >
>> >> >>
>> >> >> >>
>> >> >> >> > +
>> >> >> >> > +     remain_nr = nr_pages;
>> >> >> >> > +     p = _swap_info_get(entry);
>> >> >> >> > +     if (p) {
>> >> >> >> > +             for (i = 0; i < nr_pages; i += batch_nr) {
>> >> >> >> > +                     batch_nr = min_t(int, SWAP_BATCH_NR, remain_nr);
>> >> >> >> > +
>> >> >> >> > +                     ci = lock_cluster_or_swap_info(p, offset);
>> >> >> >> > +                     for (j = 0; j < batch_nr; j++) {
>> >> >> >> > +                             if (__swap_entry_free_locked(p, offset + i * SWAP_BATCH_NR + j, 1))
>> >> >> >> > +                                     __bitmap_set(usage, j, 1);
>> >> >> >> > +                     }
>> >> >> >> > +                     unlock_cluster_or_swap_info(p, ci);
>> >> >> >> > +
>> >> >> >> > +                     for_each_clear_bit(j, usage, batch_nr)
>> >> >> >> > +                             free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
>> >> >> >> > +
>> >> >> >> > +                     bitmap_clear(usage, 0, SWAP_BATCH_NR);
>> >> >> >> > +                     remain_nr -= batch_nr;
>> >> >> >> > +             }
>> >> >> >> > +     }
>> >> >> >> > +}
>> >> >> >> > +
>> >> >> >> >  /*
>> >> >> >> >   * Called after dropping swapcache to decrease refcnt to swap entries.
>> >> >> >> >   */
>> >> >> >>
>> >> >> >> put_swap_folio() implements batching in another method.  Do you think
>> >> >> >> that it's good to use the batching method in that function here?  It
>> >> >> >> avoids to use bitmap operations and stack space.
>> >> >> >
>> >> >> > Chuanhua has strictly limited the maximum stack usage to several
>> >> >> > unsigned long,
>> >> >>
>> >> >> 512 / 8 = 64 bytes.
>> >> >>
>> >> >> So, not trivial.
>> >> >>
>> >> >> > so this should be safe. on the other hand, i believe this
>> >> >> > implementation is more efficient, as  put_swap_folio() might lock/
>> >> >> > unlock much more often whenever __swap_entry_free_locked returns
>> >> >> > 0.
>> >> >>
>> >> >> There are 2 most common use cases,
>> >> >>
>> >> >> - all swap entries have usage count == 0
>> >> >> - all swap entries have usage count != 0
>> >> >>
>> >> >> In both cases, we only need to lock/unlock once.  In fact, I didn't
>> >> >> find possible use cases other than above.
>> >> >
>> >> > i guess the point is free_swap_slot() shouldn't be called within
>> >> > lock_cluster_or_swap_info? so when we are freeing nr_pages slots,
>> >> > we'll have to unlock and lock nr_pages times?  and this is the most
>> >> > common scenario.
>> >>
>> >> No.  In put_swap_folio(), free_entries is either SWAPFILE_CLUSTER (that
>> >> is, nr_pages) or 0.  These are the most common cases.
>> >>
>> >
>> > i am actually talking about the below code path,
>> >
>> > void put_swap_folio(struct folio *folio, swp_entry_t entry)
>> > {
>> >         ...
>> >
>> >         ci = lock_cluster_or_swap_info(si, offset);
>> >         ...
>> >         for (i = 0; i < size; i++, entry.val++) {
>> >                 if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE)) {
>> >                         unlock_cluster_or_swap_info(si, ci);
>> >                         free_swap_slot(entry);
>> >                         if (i == size - 1)
>> >                                 return;
>> >                         lock_cluster_or_swap_info(si, offset);
>> >                 }
>> >         }
>> >         unlock_cluster_or_swap_info(si, ci);
>> > }
>> >
>> > but i guess you are talking about the below code path:
>> >
>> > void put_swap_folio(struct folio *folio, swp_entry_t entry)
>> > {
>> >         ...
>> >
>> >         ci = lock_cluster_or_swap_info(si, offset);
>> >         if (size == SWAPFILE_CLUSTER) {
>> >                 map = si->swap_map + offset;
>> >                 for (i = 0; i < SWAPFILE_CLUSTER; i++) {
>> >                         val = map[i];
>> >                         VM_BUG_ON(!(val & SWAP_HAS_CACHE));
>> >                         if (val == SWAP_HAS_CACHE)
>> >                                 free_entries++;
>> >                 }
>> >                 if (free_entries == SWAPFILE_CLUSTER) {
>> >                         unlock_cluster_or_swap_info(si, ci);
>> >                         spin_lock(&si->lock);
>> >                         mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER);
>> >                         swap_free_cluster(si, idx);
>> >                         spin_unlock(&si->lock);
>> >                         return;
>> >                 }
>> >         }
>> > }
>>
>> I am talking about both code paths.  In 2 most common cases,
>> __swap_entry_free_locked() will return 0 or !0 for all entries in range.
>
> I grasp your point, but if conditions involving 0 or non-0 values fail, we'll
> end up repeatedly unlocking and locking. Picture a scenario with a large
> folio shared by multiple processes. One process might unmap a portion
> while another still holds an entire mapping to it. This could lead to situations
> where free_entries doesn't equal 0 and free_entries doesn't equal
> nr_pages, resulting in multiple unlock and lock operations.

This is impossible in current caller, because the folio is in the swap
cache.  But if we move the change to __swap_entry_free_nr(), we may run
into this situation.

> Chuanhua has invested significant effort in following Ryan's suggestion
> for the current approach, which generally handles all cases, especially
> partial unmapping. Additionally, the widespread use of swap_free_nr()
> as you suggested across various scenarios is noteworthy.
>
> Unless there's evidence indicating performance issues or bugs, I believe
> the current approach remains preferable.

TBH, I don't like the large stack space usage (64 bytes).  How about use
a "unsigned long" as bitmap?  Then, we use much less stack space, use
bitmap == 0 and bitmap == (unsigned long)(-1) to check the most common
use cases.  And, we have enough batching.

>>
>> > we are mTHP, so we can't assume our size is SWAPFILE_CLUSTER?
>> > or you want to check free_entries == "1 << swap_entry_order(folio_order(folio))"
>> > instead of SWAPFILE_CLUSTER for the "for (i = 0; i < size; i++, entry.val++)"
>> > path?
>>
>> Just replace SWAPFILE_CLUSTER with "nr_pages" in your code.
>>
>> >
>> >> >>
>> >> >> And, we should add batching in __swap_entry_free().  That will help
>> >> >> free_swap_and_cache_nr() too.
>> >
>> > Chris Li and I actually discussed it before, while I completely
>> > agree this can be batched. but i'd like to defer this as an incremental
>> > patchset later to keep this swapcache-refault small.
>>
>> OK.
>>
>> >>
>> >> Please consider this too.

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 4/5] mm: swap: entirely map large folios found in swapcache
  2024-04-16  2:52             ` Barry Song
@ 2024-04-16  3:17               ` Huang, Ying
  2024-04-16  4:40                 ` Barry Song
  0 siblings, 1 reply; 54+ messages in thread
From: Huang, Ying @ 2024-04-16  3:17 UTC (permalink / raw)
  To: Barry Song
  Cc: Khalid Aziz, akpm, linux-mm, baolin.wang, chrisl, david,
	hanchuanhua, hannes, hughd, kasong, ryan.roberts, surenb,
	v-songbaohua, willy, xiang, yosryahmed, yuzhao, ziy,
	linux-kernel

Barry Song <21cnbao@gmail.com> writes:

> On Tue, Apr 16, 2024 at 2:41 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Barry Song <21cnbao@gmail.com> writes:
>>
>> > On Tue, Apr 16, 2024 at 2:27 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >>
>> >> Added Khalid for arch_do_swap_page().
>> >>
>> >> Barry Song <21cnbao@gmail.com> writes:
>> >>
>> >> > On Mon, Apr 15, 2024 at 8:39 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >>
>> >> >> Barry Song <21cnbao@gmail.com> writes:
>> >>
>> >> [snip]
>> >>
>> >> >>
>> >> >> > +     bool any_swap_shared = false;
>> >> >> >
>> >> >> >       if (!pte_unmap_same(vmf))
>> >> >> >               goto out;
>> >> >> > @@ -4137,6 +4141,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> >> >> >        */
>> >> >> >       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>> >> >> >                       &vmf->ptl);
>> >> >>
>> >> >> We should move pte check here.  That is,
>> >> >>
>> >> >>         if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
>> >> >>                 goto out_nomap;
>> >> >>
>> >> >> This will simplify the situation for large folio.
>> >> >
>> >> > the plan is moving the whole code block
>> >> >
>> >> > if (start_pte && folio_test_large(folio) && folio_test_swapcache(folio))
>> >> >
>> >> > after
>> >> >         if (unlikely(!folio_test_uptodate(folio))) {
>> >> >                 ret = VM_FAULT_SIGBUS;
>> >> >                 goto out_nomap;
>> >> >         }
>> >> >
>> >> > though we couldn't be !folio_test_uptodate(folio)) for hitting
>> >> > swapcache but it seems
>> >> > logically better for future use.
>> >>
>> >> LGTM, Thanks!
>> >>
>> >> >>
>> >> >> > +
>> >> >> > +     /* We hit large folios in swapcache */
>> >> >>
>> >> >> The comments seems unnecessary because the code tells that already.
>> >> >>
>> >> >> > +     if (start_pte && folio_test_large(folio) && folio_test_swapcache(folio)) {
>> >> >> > +             int nr = folio_nr_pages(folio);
>> >> >> > +             int idx = folio_page_idx(folio, page);
>> >> >> > +             unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
>> >> >> > +             unsigned long folio_end = folio_start + nr * PAGE_SIZE;
>> >> >> > +             pte_t *folio_ptep;
>> >> >> > +             pte_t folio_pte;
>> >> >> > +
>> >> >> > +             if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
>> >> >> > +                     goto check_pte;
>> >> >> > +             if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
>> >> >> > +                     goto check_pte;
>> >> >> > +
>> >> >> > +             folio_ptep = vmf->pte - idx;
>> >> >> > +             folio_pte = ptep_get(folio_ptep);
>> >> >>
>> >> >> It's better to construct pte based on fault PTE via generalizing
>> >> >> pte_next_swp_offset() (may be pte_move_swp_offset()).  Then we can find
>> >> >> inconsistent PTEs quicker.
>> >> >
>> >> > it seems your point is getting the pte of page0 by pte_next_swp_offset()
>> >> > unfortunately pte_next_swp_offset can't go back. on the other hand,
>> >> > we have to check the real pte value of the 0nd entry right now because
>> >> > swap_pte_batch() only really reads pte from the 1st entry. it assumes
>> >> > pte argument is the real value for the 0nd pte entry.
>> >> >
>> >> > static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte)
>> >> > {
>> >> >         pte_t expected_pte = pte_next_swp_offset(pte);
>> >> >         const pte_t *end_ptep = start_ptep + max_nr;
>> >> >         pte_t *ptep = start_ptep + 1;
>> >> >
>> >> >         VM_WARN_ON(max_nr < 1);
>> >> >         VM_WARN_ON(!is_swap_pte(pte));
>> >> >         VM_WARN_ON(non_swap_entry(pte_to_swp_entry(pte)));
>> >> >
>> >> >         while (ptep < end_ptep) {
>> >> >                 pte = ptep_get(ptep);
>> >> >
>> >> >                 if (!pte_same(pte, expected_pte))
>> >> >                         break;
>> >> >
>> >> >                 expected_pte = pte_next_swp_offset(expected_pte);
>> >> >                 ptep++;
>> >> >         }
>> >> >
>> >> >         return ptep - start_ptep;
>> >> > }
>> >>
>> >> Yes.  You are right.
>> >>
>> >> But we may check whether the pte of page0 is same as "vmf->orig_pte -
>> >> folio_page_idx()" (fake code).
>> >
>> > right, that is why we are reading and checking PTE0 before calling
>> > swap_pte_batch()
>> > right now.
>> >
>> >   folio_ptep = vmf->pte - idx;
>> >   folio_pte = ptep_get(folio_ptep);
>> >   if (!is_swap_pte(folio_pte) || non_swap_entry(pte_to_swp_entry(folio_pte)) ||
>> >       swap_pte_batch(folio_ptep, nr, folio_pte, &any_swap_shared) != nr)
>> >    goto check_pte;
>> >
>> > So, if I understand correctly, you're proposing that we should directly check
>> > PTE0 in swap_pte_batch(). Personally, I don't have any objections to this idea.
>> > However, I'd also like to hear the feedback from Ryan and David :-)
>>
>> I mean that we can replace
>>
>>         !is_swap_pte(folio_pte) || non_swap_entry(pte_to_swp_entry(folio_pte))
>>
>> in above code with pte_same() with constructed expected first pte.
>
> Got it. It could be quite tricky, especially with considerations like
> pte_swp_soft_dirty, pte_swp_exclusive, and pte_swp_uffd_wp. We might
> require a helper function similar to pte_next_swp_offset() but capable of
> moving both forward and backward. For instance:
>
> pte_move_swp_offset(pte_t pte, long delta)
>
> pte_next_swp_offset can insteadly call it by:
> pte_move_swp_offset(pte, 1);
>
> Is it what you are proposing?

Yes.  Exactly.

--
Best Regards,
Huang, Ying

>>
>> >>
>> >> You need to check the pte of page 0 anyway.
>> >>
>> >> >>
>> >> >> > +             if (!is_swap_pte(folio_pte) || non_swap_entry(pte_to_swp_entry(folio_pte)) ||
>> >> >> > +                 swap_pte_batch(folio_ptep, nr, folio_pte, &any_swap_shared) != nr)
>> >> >> > +                     goto check_pte;
>> >> >> > +
>> >> >> > +             start_address = folio_start;
>> >> >> > +             start_pte = folio_ptep;
>> >> >> > +             nr_pages = nr;
>> >> >> > +             entry = folio->swap;
>> >> >> > +             page = &folio->page;
>> >> >> > +     }
>> >> >> > +
>> >> >> > +check_pte:
>> >> >> >       if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
>> >> >> >               goto out_nomap;
>> >> >> >
>> >> >> > @@ -4190,6 +4223,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> >> >> >                        */
>> >> >> >                       exclusive = false;
>> >> >> >               }
>> >> >> > +
>> >> >> > +             /* Reuse the whole large folio iff all entries are exclusive */
>> >> >> > +             if (nr_pages > 1 && any_swap_shared)
>> >> >> > +                     exclusive = false;
>> >> >> >       }
>> >> >> >
>>
>> [snip]
>>
>> --
>> Best Regards,
>> Huang, Ying

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/5] mm: swap: introduce swap_free_nr() for batched swap_free()
  2024-04-16  3:11                   ` Huang, Ying
@ 2024-04-16  4:32                     ` Barry Song
  2024-04-17  0:32                       ` Huang, Ying
  0 siblings, 1 reply; 54+ messages in thread
From: Barry Song @ 2024-04-16  4:32 UTC (permalink / raw)
  To: Huang, Ying
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, hanchuanhua, hannes,
	hughd, kasong, ryan.roberts, surenb, v-songbaohua, willy, xiang,
	yosryahmed, yuzhao, ziy, linux-kernel

On Tue, Apr 16, 2024 at 3:13 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Tue, Apr 16, 2024 at 1:42 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Barry Song <21cnbao@gmail.com> writes:
> >>
> >> > On Mon, Apr 15, 2024 at 8:53 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Barry Song <21cnbao@gmail.com> writes:
> >> >>
> >> >> > On Mon, Apr 15, 2024 at 8:21 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >>
> >> >> >> Barry Song <21cnbao@gmail.com> writes:
> >> >> >>
> >> >> >> > On Mon, Apr 15, 2024 at 6:19 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >>
> >> >> >> >> Barry Song <21cnbao@gmail.com> writes:
> >> >> >> >>
> >> >> >> >> > From: Chuanhua Han <hanchuanhua@oppo.com>
> >> >> >> >> >
> >> >> >> >> > While swapping in a large folio, we need to free swaps related to the whole
> >> >> >> >> > folio. To avoid frequently acquiring and releasing swap locks, it is better
> >> >> >> >> > to introduce an API for batched free.
> >> >> >> >> >
> >> >> >> >> > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> >> >> >> >> > Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> >> >> >> >> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >> >> >> >> > ---
> >> >> >> >> >  include/linux/swap.h |  5 +++++
> >> >> >> >> >  mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++
> >> >> >> >> >  2 files changed, 56 insertions(+)
> >> >> >> >> >
> >> >> >> >> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> >> >> >> >> > index 11c53692f65f..b7a107e983b8 100644
> >> >> >> >> > --- a/include/linux/swap.h
> >> >> >> >> > +++ b/include/linux/swap.h
> >> >> >> >> > @@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
> >> >> >> >> >  extern int swap_duplicate(swp_entry_t);
> >> >> >> >> >  extern int swapcache_prepare(swp_entry_t);
> >> >> >> >> >  extern void swap_free(swp_entry_t);
> >> >> >> >> > +extern void swap_free_nr(swp_entry_t entry, int nr_pages);
> >> >> >> >> >  extern void swapcache_free_entries(swp_entry_t *entries, int n);
> >> >> >> >> >  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
> >> >> >> >> >  int swap_type_of(dev_t device, sector_t offset);
> >> >> >> >> > @@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
> >> >> >> >> >  {
> >> >> >> >> >  }
> >> >> >> >> >
> >> >> >> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
> >> >> >> >> > +{
> >> >> >> >> > +}
> >> >> >> >> > +
> >> >> >> >> >  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
> >> >> >> >> >  {
> >> >> >> >> >  }
> >> >> >> >> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> >> >> >> >> > index 28642c188c93..f4c65aeb088d 100644
> >> >> >> >> > --- a/mm/swapfile.c
> >> >> >> >> > +++ b/mm/swapfile.c
> >> >> >> >> > @@ -1356,6 +1356,57 @@ void swap_free(swp_entry_t entry)
> >> >> >> >> >               __swap_entry_free(p, entry);
> >> >> >> >> >  }
> >> >> >> >> >
> >> >> >> >> > +/*
> >> >> >> >> > + * Free up the maximum number of swap entries at once to limit the
> >> >> >> >> > + * maximum kernel stack usage.
> >> >> >> >> > + */
> >> >> >> >> > +#define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
> >> >> >> >> > +
> >> >> >> >> > +/*
> >> >> >> >> > + * Called after swapping in a large folio,
> >> >> >> >>
> >> >> >> >> IMHO, it's not good to document the caller in the function definition.
> >> >> >> >> Because this will discourage function reusing.
> >> >> >> >
> >> >> >> > ok. right now there is only one user that is why it is added. but i agree
> >> >> >> > we can actually remove this.
> >> >> >> >
> >> >> >> >>
> >> >> >> >> > batched free swap entries
> >> >> >> >> > + * for this large folio, entry should be for the first subpage and
> >> >> >> >> > + * its offset is aligned with nr_pages
> >> >> >> >>
> >> >> >> >> Why do we need this?
> >> >> >> >
> >> >> >> > This is a fundamental requirement for the existing kernel, folio's
> >> >> >> > swap offset is naturally aligned from the first moment add_to_swap
> >> >> >> > to add swapcache's xa. so this comment is describing the existing
> >> >> >> > fact. In the future, if we want to support swap-out folio to discontiguous
> >> >> >> > and not-aligned offsets, we can't pass entry as the parameter, we should
> >> >> >> > instead pass ptep or another different data struct which can connect
> >> >> >> > multiple discontiguous swap offsets.
> >> >> >> >
> >> >> >> > I feel like we only need "for this large folio, entry should be for
> >> >> >> > the first subpage" and drop "and its offset is aligned with nr_pages",
> >> >> >> > the latter is not important to this context at all.
> >> >> >>
> >> >> >> IIUC, all these are requirements of the only caller now, not the
> >> >> >> function itself.  If only part of the all swap entries of a mTHP are
> >> >> >> called with swap_free_nr(), can swap_free_nr() still do its work?  If
> >> >> >> so, why not make swap_free_nr() as general as possible?
> >> >> >
> >> >> > right , i believe we can make swap_free_nr() as general as possible.
> >> >> >
> >> >> >>
> >> >> >> >>
> >> >> >> >> > + */
> >> >> >> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
> >> >> >> >> > +{
> >> >> >> >> > +     int i, j;
> >> >> >> >> > +     struct swap_cluster_info *ci;
> >> >> >> >> > +     struct swap_info_struct *p;
> >> >> >> >> > +     unsigned int type = swp_type(entry);
> >> >> >> >> > +     unsigned long offset = swp_offset(entry);
> >> >> >> >> > +     int batch_nr, remain_nr;
> >> >> >> >> > +     DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };
> >> >> >> >> > +
> >> >> >> >> > +     /* all swap entries are within a cluster for mTHP */
> >> >> >> >> > +     VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
> >> >> >> >> > +
> >> >> >> >> > +     if (nr_pages == 1) {
> >> >> >> >> > +             swap_free(entry);
> >> >> >> >> > +             return;
> >> >> >> >> > +     }
> >> >> >> >>
> >> >> >> >> Is it possible to unify swap_free() and swap_free_nr() into one function
> >> >> >> >> with acceptable performance?  IIUC, the general rule in mTHP effort is
> >> >> >> >> to avoid duplicate functions between mTHP and normal small folio.
> >> >> >> >> Right?
> >> >> >> >
> >> >> >> > I don't see why.
> >> >> >>
> >> >> >> Because duplicated implementation are hard to maintain in the long term.
> >> >> >
> >> >> > sorry, i actually meant "I don't see why not",  for some reason, the "not"
> >> >> > was missed. Obviously I meant "why not", there was a "but" after it :-)
> >> >> >
> >> >> >>
> >> >> >> > but we have lots of places calling swap_free(), we may
> >> >> >> > have to change them all to call swap_free_nr(entry, 1); the other possible
> >> >> >> > way is making swap_free() a wrapper of swap_free_nr() always using
> >> >> >> > 1 as the argument. In both cases, we are changing the semantics of
> >> >> >> > swap_free_nr() to partially freeing large folio cases and have to drop
> >> >> >> > "entry should be for the first subpage" then.
> >> >> >> >
> >> >> >> > Right now, the semantics is
> >> >> >> > * swap_free_nr() for an entire large folio;
> >> >> >> > * swap_free() for one entry of either a large folio or a small folio
> >> >> >>
> >> >> >> As above, I don't think the these semantics are important for
> >> >> >> swap_free_nr() implementation.
> >> >> >
> >> >> > right. I agree. If we are ready to change all those callers, nothing
> >> >> > can stop us from removing swap_free().
> >> >> >
> >> >> >>
> >> >> >> >>
> >> >> >> >> > +
> >> >> >> >> > +     remain_nr = nr_pages;
> >> >> >> >> > +     p = _swap_info_get(entry);
> >> >> >> >> > +     if (p) {
> >> >> >> >> > +             for (i = 0; i < nr_pages; i += batch_nr) {
> >> >> >> >> > +                     batch_nr = min_t(int, SWAP_BATCH_NR, remain_nr);
> >> >> >> >> > +
> >> >> >> >> > +                     ci = lock_cluster_or_swap_info(p, offset);
> >> >> >> >> > +                     for (j = 0; j < batch_nr; j++) {
> >> >> >> >> > +                             if (__swap_entry_free_locked(p, offset + i * SWAP_BATCH_NR + j, 1))
> >> >> >> >> > +                                     __bitmap_set(usage, j, 1);
> >> >> >> >> > +                     }
> >> >> >> >> > +                     unlock_cluster_or_swap_info(p, ci);
> >> >> >> >> > +
> >> >> >> >> > +                     for_each_clear_bit(j, usage, batch_nr)
> >> >> >> >> > +                             free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
> >> >> >> >> > +
> >> >> >> >> > +                     bitmap_clear(usage, 0, SWAP_BATCH_NR);
> >> >> >> >> > +                     remain_nr -= batch_nr;
> >> >> >> >> > +             }
> >> >> >> >> > +     }
> >> >> >> >> > +}
> >> >> >> >> > +
> >> >> >> >> >  /*
> >> >> >> >> >   * Called after dropping swapcache to decrease refcnt to swap entries.
> >> >> >> >> >   */
> >> >> >> >>
> >> >> >> >> put_swap_folio() implements batching in another method.  Do you think
> >> >> >> >> that it's good to use the batching method in that function here?  It
> >> >> >> >> avoids to use bitmap operations and stack space.
> >> >> >> >
> >> >> >> > Chuanhua has strictly limited the maximum stack usage to several
> >> >> >> > unsigned long,
> >> >> >>
> >> >> >> 512 / 8 = 64 bytes.
> >> >> >>
> >> >> >> So, not trivial.
> >> >> >>
> >> >> >> > so this should be safe. on the other hand, i believe this
> >> >> >> > implementation is more efficient, as  put_swap_folio() might lock/
> >> >> >> > unlock much more often whenever __swap_entry_free_locked returns
> >> >> >> > 0.
> >> >> >>
> >> >> >> There are 2 most common use cases,
> >> >> >>
> >> >> >> - all swap entries have usage count == 0
> >> >> >> - all swap entries have usage count != 0
> >> >> >>
> >> >> >> In both cases, we only need to lock/unlock once.  In fact, I didn't
> >> >> >> find possible use cases other than above.
> >> >> >
> >> >> > i guess the point is free_swap_slot() shouldn't be called within
> >> >> > lock_cluster_or_swap_info? so when we are freeing nr_pages slots,
> >> >> > we'll have to unlock and lock nr_pages times?  and this is the most
> >> >> > common scenario.
> >> >>
> >> >> No.  In put_swap_folio(), free_entries is either SWAPFILE_CLUSTER (that
> >> >> is, nr_pages) or 0.  These are the most common cases.
> >> >>
> >> >
> >> > i am actually talking about the below code path,
> >> >
> >> > void put_swap_folio(struct folio *folio, swp_entry_t entry)
> >> > {
> >> >         ...
> >> >
> >> >         ci = lock_cluster_or_swap_info(si, offset);
> >> >         ...
> >> >         for (i = 0; i < size; i++, entry.val++) {
> >> >                 if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE)) {
> >> >                         unlock_cluster_or_swap_info(si, ci);
> >> >                         free_swap_slot(entry);
> >> >                         if (i == size - 1)
> >> >                                 return;
> >> >                         lock_cluster_or_swap_info(si, offset);
> >> >                 }
> >> >         }
> >> >         unlock_cluster_or_swap_info(si, ci);
> >> > }
> >> >
> >> > but i guess you are talking about the below code path:
> >> >
> >> > void put_swap_folio(struct folio *folio, swp_entry_t entry)
> >> > {
> >> >         ...
> >> >
> >> >         ci = lock_cluster_or_swap_info(si, offset);
> >> >         if (size == SWAPFILE_CLUSTER) {
> >> >                 map = si->swap_map + offset;
> >> >                 for (i = 0; i < SWAPFILE_CLUSTER; i++) {
> >> >                         val = map[i];
> >> >                         VM_BUG_ON(!(val & SWAP_HAS_CACHE));
> >> >                         if (val == SWAP_HAS_CACHE)
> >> >                                 free_entries++;
> >> >                 }
> >> >                 if (free_entries == SWAPFILE_CLUSTER) {
> >> >                         unlock_cluster_or_swap_info(si, ci);
> >> >                         spin_lock(&si->lock);
> >> >                         mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER);
> >> >                         swap_free_cluster(si, idx);
> >> >                         spin_unlock(&si->lock);
> >> >                         return;
> >> >                 }
> >> >         }
> >> > }
> >>
> >> I am talking about both code paths.  In 2 most common cases,
> >> __swap_entry_free_locked() will return 0 or !0 for all entries in range.
> >
> > I grasp your point, but if conditions involving 0 or non-0 values fail, we'll
> > end up repeatedly unlocking and locking. Picture a scenario with a large
> > folio shared by multiple processes. One process might unmap a portion
> > while another still holds an entire mapping to it. This could lead to situations
> > where free_entries doesn't equal 0 and free_entries doesn't equal
> > nr_pages, resulting in multiple unlock and lock operations.
>
> This is impossible in current caller, because the folio is in the swap
> cache.  But if we move the change to __swap_entry_free_nr(), we may run
> into this situation.

I don't understand why it is impossible, after try_to_unmap_one() has done
on one process, mprotect and munmap called on a part of the large folio
pte entries which now have been swap entries, we are removing the PTE
for this part. Another process can entirely hit the swapcache and have
all swap entries mapped there, and we call swap_free_nr(entry, nr_pages) in
do_swap_page. Within those swap entries, some have swapcount=1 and others
have swapcount > 1. Am I missing something?

>
> > Chuanhua has invested significant effort in following Ryan's suggestion
> > for the current approach, which generally handles all cases, especially
> > partial unmapping. Additionally, the widespread use of swap_free_nr()
> > as you suggested across various scenarios is noteworthy.
> >
> > Unless there's evidence indicating performance issues or bugs, I believe
> > the current approach remains preferable.
>
> TBH, I don't like the large stack space usage (64 bytes).  How about use
> a "unsigned long" as bitmap?  Then, we use much less stack space, use
> bitmap == 0 and bitmap == (unsigned long)(-1) to check the most common
> use cases.  And, we have enough batching.

that is quite a straightforward modification like,

- #define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
+ #define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 64 ? 64 : SWAPFILE_CLUSTER)

there is no necessity to remove the bitmap API and move to raw
unsigned long operations.
as bitmap is exactly some unsigned long. on 64bit CPU, we are now one
unsigned long,
on 32bit CPU, it is now two unsigned long.

>
> >>
> >> > we are mTHP, so we can't assume our size is SWAPFILE_CLUSTER?
> >> > or you want to check free_entries == "1 << swap_entry_order(folio_order(folio))"
> >> > instead of SWAPFILE_CLUSTER for the "for (i = 0; i < size; i++, entry.val++)"
> >> > path?
> >>
> >> Just replace SWAPFILE_CLUSTER with "nr_pages" in your code.
> >>
> >> >
> >> >> >>
> >> >> >> And, we should add batching in __swap_entry_free().  That will help
> >> >> >> free_swap_and_cache_nr() too.
> >> >
> >> > Chris Li and I actually discussed it before, while I completely
> >> > agree this can be batched. but i'd like to defer this as an incremental
> >> > patchset later to keep this swapcache-refault small.
> >>
> >> OK.
> >>
> >> >>
> >> >> Please consider this too.
>
> --
> Best Regards,
> Huang, Ying

Thanks
Barry

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 4/5] mm: swap: entirely map large folios found in swapcache
  2024-04-16  3:17               ` Huang, Ying
@ 2024-04-16  4:40                 ` Barry Song
  0 siblings, 0 replies; 54+ messages in thread
From: Barry Song @ 2024-04-16  4:40 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Khalid Aziz, akpm, linux-mm, baolin.wang, chrisl, david,
	hanchuanhua, hannes, hughd, kasong, ryan.roberts, surenb,
	v-songbaohua, willy, xiang, yosryahmed, yuzhao, ziy,
	linux-kernel

On Tue, Apr 16, 2024 at 3:19 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Tue, Apr 16, 2024 at 2:41 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Barry Song <21cnbao@gmail.com> writes:
> >>
> >> > On Tue, Apr 16, 2024 at 2:27 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >>
> >> >> Added Khalid for arch_do_swap_page().
> >> >>
> >> >> Barry Song <21cnbao@gmail.com> writes:
> >> >>
> >> >> > On Mon, Apr 15, 2024 at 8:39 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >>
> >> >> >> Barry Song <21cnbao@gmail.com> writes:
> >> >>
> >> >> [snip]
> >> >>
> >> >> >>
> >> >> >> > +     bool any_swap_shared = false;
> >> >> >> >
> >> >> >> >       if (!pte_unmap_same(vmf))
> >> >> >> >               goto out;
> >> >> >> > @@ -4137,6 +4141,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >> >> >> >        */
> >> >> >> >       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> >> >> >> >                       &vmf->ptl);
> >> >> >>
> >> >> >> We should move pte check here.  That is,
> >> >> >>
> >> >> >>         if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> >> >> >>                 goto out_nomap;
> >> >> >>
> >> >> >> This will simplify the situation for large folio.
> >> >> >
> >> >> > the plan is moving the whole code block
> >> >> >
> >> >> > if (start_pte && folio_test_large(folio) && folio_test_swapcache(folio))
> >> >> >
> >> >> > after
> >> >> >         if (unlikely(!folio_test_uptodate(folio))) {
> >> >> >                 ret = VM_FAULT_SIGBUS;
> >> >> >                 goto out_nomap;
> >> >> >         }
> >> >> >
> >> >> > though we couldn't be !folio_test_uptodate(folio)) for hitting
> >> >> > swapcache but it seems
> >> >> > logically better for future use.
> >> >>
> >> >> LGTM, Thanks!
> >> >>
> >> >> >>
> >> >> >> > +
> >> >> >> > +     /* We hit large folios in swapcache */
> >> >> >>
> >> >> >> The comments seems unnecessary because the code tells that already.
> >> >> >>
> >> >> >> > +     if (start_pte && folio_test_large(folio) && folio_test_swapcache(folio)) {
> >> >> >> > +             int nr = folio_nr_pages(folio);
> >> >> >> > +             int idx = folio_page_idx(folio, page);
> >> >> >> > +             unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
> >> >> >> > +             unsigned long folio_end = folio_start + nr * PAGE_SIZE;
> >> >> >> > +             pte_t *folio_ptep;
> >> >> >> > +             pte_t folio_pte;
> >> >> >> > +
> >> >> >> > +             if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
> >> >> >> > +                     goto check_pte;
> >> >> >> > +             if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
> >> >> >> > +                     goto check_pte;
> >> >> >> > +
> >> >> >> > +             folio_ptep = vmf->pte - idx;
> >> >> >> > +             folio_pte = ptep_get(folio_ptep);
> >> >> >>
> >> >> >> It's better to construct pte based on fault PTE via generalizing
> >> >> >> pte_next_swp_offset() (may be pte_move_swp_offset()).  Then we can find
> >> >> >> inconsistent PTEs quicker.
> >> >> >
> >> >> > it seems your point is getting the pte of page0 by pte_next_swp_offset()
> >> >> > unfortunately pte_next_swp_offset can't go back. on the other hand,
> >> >> > we have to check the real pte value of the 0nd entry right now because
> >> >> > swap_pte_batch() only really reads pte from the 1st entry. it assumes
> >> >> > pte argument is the real value for the 0nd pte entry.
> >> >> >
> >> >> > static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte)
> >> >> > {
> >> >> >         pte_t expected_pte = pte_next_swp_offset(pte);
> >> >> >         const pte_t *end_ptep = start_ptep + max_nr;
> >> >> >         pte_t *ptep = start_ptep + 1;
> >> >> >
> >> >> >         VM_WARN_ON(max_nr < 1);
> >> >> >         VM_WARN_ON(!is_swap_pte(pte));
> >> >> >         VM_WARN_ON(non_swap_entry(pte_to_swp_entry(pte)));
> >> >> >
> >> >> >         while (ptep < end_ptep) {
> >> >> >                 pte = ptep_get(ptep);
> >> >> >
> >> >> >                 if (!pte_same(pte, expected_pte))
> >> >> >                         break;
> >> >> >
> >> >> >                 expected_pte = pte_next_swp_offset(expected_pte);
> >> >> >                 ptep++;
> >> >> >         }
> >> >> >
> >> >> >         return ptep - start_ptep;
> >> >> > }
> >> >>
> >> >> Yes.  You are right.
> >> >>
> >> >> But we may check whether the pte of page0 is same as "vmf->orig_pte -
> >> >> folio_page_idx()" (fake code).
> >> >
> >> > right, that is why we are reading and checking PTE0 before calling
> >> > swap_pte_batch()
> >> > right now.
> >> >
> >> >   folio_ptep = vmf->pte - idx;
> >> >   folio_pte = ptep_get(folio_ptep);
> >> >   if (!is_swap_pte(folio_pte) || non_swap_entry(pte_to_swp_entry(folio_pte)) ||
> >> >       swap_pte_batch(folio_ptep, nr, folio_pte, &any_swap_shared) != nr)
> >> >    goto check_pte;
> >> >
> >> > So, if I understand correctly, you're proposing that we should directly check
> >> > PTE0 in swap_pte_batch(). Personally, I don't have any objections to this idea.
> >> > However, I'd also like to hear the feedback from Ryan and David :-)
> >>
> >> I mean that we can replace
> >>
> >>         !is_swap_pte(folio_pte) || non_swap_entry(pte_to_swp_entry(folio_pte))
> >>
> >> in above code with pte_same() with constructed expected first pte.
> >
> > Got it. It could be quite tricky, especially with considerations like
> > pte_swp_soft_dirty, pte_swp_exclusive, and pte_swp_uffd_wp. We might
> > require a helper function similar to pte_next_swp_offset() but capable of
> > moving both forward and backward. For instance:
> >
> > pte_move_swp_offset(pte_t pte, long delta)
> >
> > pte_next_swp_offset can insteadly call it by:
> > pte_move_swp_offset(pte, 1);
> >
> > Is it what you are proposing?
>
> Yes.  Exactly.

Great. I agree that this appears to be much cleaner than the current code.

>
> --
> Best Regards,
> Huang, Ying
>
> >>
> >> >>
> >> >> You need to check the pte of page 0 anyway.
> >> >>
> >> >> >>
> >> >> >> > +             if (!is_swap_pte(folio_pte) || non_swap_entry(pte_to_swp_entry(folio_pte)) ||
> >> >> >> > +                 swap_pte_batch(folio_ptep, nr, folio_pte, &any_swap_shared) != nr)
> >> >> >> > +                     goto check_pte;
> >> >> >> > +
> >> >> >> > +             start_address = folio_start;
> >> >> >> > +             start_pte = folio_ptep;
> >> >> >> > +             nr_pages = nr;
> >> >> >> > +             entry = folio->swap;
> >> >> >> > +             page = &folio->page;
> >> >> >> > +     }
> >> >> >> > +
> >> >> >> > +check_pte:
> >> >> >> >       if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> >> >> >> >               goto out_nomap;
> >> >> >> >
> >> >> >> > @@ -4190,6 +4223,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >> >> >> >                        */
> >> >> >> >                       exclusive = false;
> >> >> >> >               }
> >> >> >> > +
> >> >> >> > +             /* Reuse the whole large folio iff all entries are exclusive */
> >> >> >> > +             if (nr_pages > 1 && any_swap_shared)
> >> >> >> > +                     exclusive = false;
> >> >> >> >       }
> >> >> >> >
> >>
> >> [snip]
> >>
> >> --
> >> Best Regards,
> >> Huang, Ying

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/5] mm: swap: introduce swap_free_nr() for batched swap_free()
  2024-04-16  4:32                     ` Barry Song
@ 2024-04-17  0:32                       ` Huang, Ying
  2024-04-17  1:35                         ` Barry Song
  0 siblings, 1 reply; 54+ messages in thread
From: Huang, Ying @ 2024-04-17  0:32 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, hanchuanhua, hannes,
	hughd, kasong, ryan.roberts, surenb, v-songbaohua, willy, xiang,
	yosryahmed, yuzhao, ziy, linux-kernel

Barry Song <21cnbao@gmail.com> writes:

> On Tue, Apr 16, 2024 at 3:13 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Barry Song <21cnbao@gmail.com> writes:
>>
>> > On Tue, Apr 16, 2024 at 1:42 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Barry Song <21cnbao@gmail.com> writes:
>> >>
>> >> > On Mon, Apr 15, 2024 at 8:53 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >>
>> >> >> Barry Song <21cnbao@gmail.com> writes:
>> >> >>
>> >> >> > On Mon, Apr 15, 2024 at 8:21 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >>
>> >> >> >> Barry Song <21cnbao@gmail.com> writes:
>> >> >> >>
>> >> >> >> > On Mon, Apr 15, 2024 at 6:19 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >> >>
>> >> >> >> >> Barry Song <21cnbao@gmail.com> writes:
>> >> >> >> >>
>> >> >> >> >> > From: Chuanhua Han <hanchuanhua@oppo.com>
>> >> >> >> >> >
>> >> >> >> >> > While swapping in a large folio, we need to free swaps related to the whole
>> >> >> >> >> > folio. To avoid frequently acquiring and releasing swap locks, it is better
>> >> >> >> >> > to introduce an API for batched free.
>> >> >> >> >> >
>> >> >> >> >> > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
>> >> >> >> >> > Co-developed-by: Barry Song <v-songbaohua@oppo.com>
>> >> >> >> >> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>> >> >> >> >> > ---
>> >> >> >> >> >  include/linux/swap.h |  5 +++++
>> >> >> >> >> >  mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++
>> >> >> >> >> >  2 files changed, 56 insertions(+)
>> >> >> >> >> >
>> >> >> >> >> > diff --git a/include/linux/swap.h b/include/linux/swap.h
>> >> >> >> >> > index 11c53692f65f..b7a107e983b8 100644
>> >> >> >> >> > --- a/include/linux/swap.h
>> >> >> >> >> > +++ b/include/linux/swap.h
>> >> >> >> >> > @@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
>> >> >> >> >> >  extern int swap_duplicate(swp_entry_t);
>> >> >> >> >> >  extern int swapcache_prepare(swp_entry_t);
>> >> >> >> >> >  extern void swap_free(swp_entry_t);
>> >> >> >> >> > +extern void swap_free_nr(swp_entry_t entry, int nr_pages);
>> >> >> >> >> >  extern void swapcache_free_entries(swp_entry_t *entries, int n);
>> >> >> >> >> >  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
>> >> >> >> >> >  int swap_type_of(dev_t device, sector_t offset);
>> >> >> >> >> > @@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
>> >> >> >> >> >  {
>> >> >> >> >> >  }
>> >> >> >> >> >
>> >> >> >> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
>> >> >> >> >> > +{
>> >> >> >> >> > +}
>> >> >> >> >> > +
>> >> >> >> >> >  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
>> >> >> >> >> >  {
>> >> >> >> >> >  }
>> >> >> >> >> > diff --git a/mm/swapfile.c b/mm/swapfile.c
>> >> >> >> >> > index 28642c188c93..f4c65aeb088d 100644
>> >> >> >> >> > --- a/mm/swapfile.c
>> >> >> >> >> > +++ b/mm/swapfile.c
>> >> >> >> >> > @@ -1356,6 +1356,57 @@ void swap_free(swp_entry_t entry)
>> >> >> >> >> >               __swap_entry_free(p, entry);
>> >> >> >> >> >  }
>> >> >> >> >> >
>> >> >> >> >> > +/*
>> >> >> >> >> > + * Free up the maximum number of swap entries at once to limit the
>> >> >> >> >> > + * maximum kernel stack usage.
>> >> >> >> >> > + */
>> >> >> >> >> > +#define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
>> >> >> >> >> > +
>> >> >> >> >> > +/*
>> >> >> >> >> > + * Called after swapping in a large folio,
>> >> >> >> >>
>> >> >> >> >> IMHO, it's not good to document the caller in the function definition.
>> >> >> >> >> Because this will discourage function reusing.
>> >> >> >> >
>> >> >> >> > ok. right now there is only one user that is why it is added. but i agree
>> >> >> >> > we can actually remove this.
>> >> >> >> >
>> >> >> >> >>
>> >> >> >> >> > batched free swap entries
>> >> >> >> >> > + * for this large folio, entry should be for the first subpage and
>> >> >> >> >> > + * its offset is aligned with nr_pages
>> >> >> >> >>
>> >> >> >> >> Why do we need this?
>> >> >> >> >
>> >> >> >> > This is a fundamental requirement for the existing kernel, folio's
>> >> >> >> > swap offset is naturally aligned from the first moment add_to_swap
>> >> >> >> > to add swapcache's xa. so this comment is describing the existing
>> >> >> >> > fact. In the future, if we want to support swap-out folio to discontiguous
>> >> >> >> > and not-aligned offsets, we can't pass entry as the parameter, we should
>> >> >> >> > instead pass ptep or another different data struct which can connect
>> >> >> >> > multiple discontiguous swap offsets.
>> >> >> >> >
>> >> >> >> > I feel like we only need "for this large folio, entry should be for
>> >> >> >> > the first subpage" and drop "and its offset is aligned with nr_pages",
>> >> >> >> > the latter is not important to this context at all.
>> >> >> >>
>> >> >> >> IIUC, all these are requirements of the only caller now, not the
>> >> >> >> function itself.  If only part of the all swap entries of a mTHP are
>> >> >> >> called with swap_free_nr(), can swap_free_nr() still do its work?  If
>> >> >> >> so, why not make swap_free_nr() as general as possible?
>> >> >> >
>> >> >> > right , i believe we can make swap_free_nr() as general as possible.
>> >> >> >
>> >> >> >>
>> >> >> >> >>
>> >> >> >> >> > + */
>> >> >> >> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
>> >> >> >> >> > +{
>> >> >> >> >> > +     int i, j;
>> >> >> >> >> > +     struct swap_cluster_info *ci;
>> >> >> >> >> > +     struct swap_info_struct *p;
>> >> >> >> >> > +     unsigned int type = swp_type(entry);
>> >> >> >> >> > +     unsigned long offset = swp_offset(entry);
>> >> >> >> >> > +     int batch_nr, remain_nr;
>> >> >> >> >> > +     DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };
>> >> >> >> >> > +
>> >> >> >> >> > +     /* all swap entries are within a cluster for mTHP */
>> >> >> >> >> > +     VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
>> >> >> >> >> > +
>> >> >> >> >> > +     if (nr_pages == 1) {
>> >> >> >> >> > +             swap_free(entry);
>> >> >> >> >> > +             return;
>> >> >> >> >> > +     }
>> >> >> >> >>
>> >> >> >> >> Is it possible to unify swap_free() and swap_free_nr() into one function
>> >> >> >> >> with acceptable performance?  IIUC, the general rule in mTHP effort is
>> >> >> >> >> to avoid duplicate functions between mTHP and normal small folio.
>> >> >> >> >> Right?
>> >> >> >> >
>> >> >> >> > I don't see why.
>> >> >> >>
>> >> >> >> Because duplicated implementation are hard to maintain in the long term.
>> >> >> >
>> >> >> > sorry, i actually meant "I don't see why not",  for some reason, the "not"
>> >> >> > was missed. Obviously I meant "why not", there was a "but" after it :-)
>> >> >> >
>> >> >> >>
>> >> >> >> > but we have lots of places calling swap_free(), we may
>> >> >> >> > have to change them all to call swap_free_nr(entry, 1); the other possible
>> >> >> >> > way is making swap_free() a wrapper of swap_free_nr() always using
>> >> >> >> > 1 as the argument. In both cases, we are changing the semantics of
>> >> >> >> > swap_free_nr() to partially freeing large folio cases and have to drop
>> >> >> >> > "entry should be for the first subpage" then.
>> >> >> >> >
>> >> >> >> > Right now, the semantics is
>> >> >> >> > * swap_free_nr() for an entire large folio;
>> >> >> >> > * swap_free() for one entry of either a large folio or a small folio
>> >> >> >>
>> >> >> >> As above, I don't think the these semantics are important for
>> >> >> >> swap_free_nr() implementation.
>> >> >> >
>> >> >> > right. I agree. If we are ready to change all those callers, nothing
>> >> >> > can stop us from removing swap_free().
>> >> >> >
>> >> >> >>
>> >> >> >> >>
>> >> >> >> >> > +
>> >> >> >> >> > +     remain_nr = nr_pages;
>> >> >> >> >> > +     p = _swap_info_get(entry);
>> >> >> >> >> > +     if (p) {
>> >> >> >> >> > +             for (i = 0; i < nr_pages; i += batch_nr) {
>> >> >> >> >> > +                     batch_nr = min_t(int, SWAP_BATCH_NR, remain_nr);
>> >> >> >> >> > +
>> >> >> >> >> > +                     ci = lock_cluster_or_swap_info(p, offset);
>> >> >> >> >> > +                     for (j = 0; j < batch_nr; j++) {
>> >> >> >> >> > +                             if (__swap_entry_free_locked(p, offset + i * SWAP_BATCH_NR + j, 1))
>> >> >> >> >> > +                                     __bitmap_set(usage, j, 1);
>> >> >> >> >> > +                     }
>> >> >> >> >> > +                     unlock_cluster_or_swap_info(p, ci);
>> >> >> >> >> > +
>> >> >> >> >> > +                     for_each_clear_bit(j, usage, batch_nr)
>> >> >> >> >> > +                             free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
>> >> >> >> >> > +
>> >> >> >> >> > +                     bitmap_clear(usage, 0, SWAP_BATCH_NR);
>> >> >> >> >> > +                     remain_nr -= batch_nr;
>> >> >> >> >> > +             }
>> >> >> >> >> > +     }
>> >> >> >> >> > +}
>> >> >> >> >> > +
>> >> >> >> >> >  /*
>> >> >> >> >> >   * Called after dropping swapcache to decrease refcnt to swap entries.
>> >> >> >> >> >   */
>> >> >> >> >>
>> >> >> >> >> put_swap_folio() implements batching in another method.  Do you think
>> >> >> >> >> that it's good to use the batching method in that function here?  It
>> >> >> >> >> avoids to use bitmap operations and stack space.
>> >> >> >> >
>> >> >> >> > Chuanhua has strictly limited the maximum stack usage to several
>> >> >> >> > unsigned long,
>> >> >> >>
>> >> >> >> 512 / 8 = 64 bytes.
>> >> >> >>
>> >> >> >> So, not trivial.
>> >> >> >>
>> >> >> >> > so this should be safe. on the other hand, i believe this
>> >> >> >> > implementation is more efficient, as  put_swap_folio() might lock/
>> >> >> >> > unlock much more often whenever __swap_entry_free_locked returns
>> >> >> >> > 0.
>> >> >> >>
>> >> >> >> There are 2 most common use cases,
>> >> >> >>
>> >> >> >> - all swap entries have usage count == 0
>> >> >> >> - all swap entries have usage count != 0
>> >> >> >>
>> >> >> >> In both cases, we only need to lock/unlock once.  In fact, I didn't
>> >> >> >> find possible use cases other than above.
>> >> >> >
>> >> >> > i guess the point is free_swap_slot() shouldn't be called within
>> >> >> > lock_cluster_or_swap_info? so when we are freeing nr_pages slots,
>> >> >> > we'll have to unlock and lock nr_pages times?  and this is the most
>> >> >> > common scenario.
>> >> >>
>> >> >> No.  In put_swap_folio(), free_entries is either SWAPFILE_CLUSTER (that
>> >> >> is, nr_pages) or 0.  These are the most common cases.
>> >> >>
>> >> >
>> >> > i am actually talking about the below code path,
>> >> >
>> >> > void put_swap_folio(struct folio *folio, swp_entry_t entry)
>> >> > {
>> >> >         ...
>> >> >
>> >> >         ci = lock_cluster_or_swap_info(si, offset);
>> >> >         ...
>> >> >         for (i = 0; i < size; i++, entry.val++) {
>> >> >                 if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE)) {
>> >> >                         unlock_cluster_or_swap_info(si, ci);
>> >> >                         free_swap_slot(entry);
>> >> >                         if (i == size - 1)
>> >> >                                 return;
>> >> >                         lock_cluster_or_swap_info(si, offset);
>> >> >                 }
>> >> >         }
>> >> >         unlock_cluster_or_swap_info(si, ci);
>> >> > }
>> >> >
>> >> > but i guess you are talking about the below code path:
>> >> >
>> >> > void put_swap_folio(struct folio *folio, swp_entry_t entry)
>> >> > {
>> >> >         ...
>> >> >
>> >> >         ci = lock_cluster_or_swap_info(si, offset);
>> >> >         if (size == SWAPFILE_CLUSTER) {
>> >> >                 map = si->swap_map + offset;
>> >> >                 for (i = 0; i < SWAPFILE_CLUSTER; i++) {
>> >> >                         val = map[i];
>> >> >                         VM_BUG_ON(!(val & SWAP_HAS_CACHE));
>> >> >                         if (val == SWAP_HAS_CACHE)
>> >> >                                 free_entries++;
>> >> >                 }
>> >> >                 if (free_entries == SWAPFILE_CLUSTER) {
>> >> >                         unlock_cluster_or_swap_info(si, ci);
>> >> >                         spin_lock(&si->lock);
>> >> >                         mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER);
>> >> >                         swap_free_cluster(si, idx);
>> >> >                         spin_unlock(&si->lock);
>> >> >                         return;
>> >> >                 }
>> >> >         }
>> >> > }
>> >>
>> >> I am talking about both code paths.  In 2 most common cases,
>> >> __swap_entry_free_locked() will return 0 or !0 for all entries in range.
>> >
>> > I grasp your point, but if conditions involving 0 or non-0 values fail, we'll
>> > end up repeatedly unlocking and locking. Picture a scenario with a large
>> > folio shared by multiple processes. One process might unmap a portion
>> > while another still holds an entire mapping to it. This could lead to situations
>> > where free_entries doesn't equal 0 and free_entries doesn't equal
>> > nr_pages, resulting in multiple unlock and lock operations.
>>
>> This is impossible in current caller, because the folio is in the swap
>> cache.  But if we move the change to __swap_entry_free_nr(), we may run
>> into this situation.
>
> I don't understand why it is impossible, after try_to_unmap_one() has done
> on one process, mprotect and munmap called on a part of the large folio
> pte entries which now have been swap entries, we are removing the PTE
> for this part. Another process can entirely hit the swapcache and have
> all swap entries mapped there, and we call swap_free_nr(entry, nr_pages) in
> do_swap_page. Within those swap entries, some have swapcount=1 and others
> have swapcount > 1. Am I missing something?

For swap entries with swapcount=1, its sis->swap_map[] will be

1 | SWAP_HAS_CACHE

so, __swap_entry_free_locked() will return SWAP_HAS_CACHE instead of 0.

The swap entries will be free in

folio_free_swap
  delete_from_swap_cache
    put_swap_folio

>> > Chuanhua has invested significant effort in following Ryan's suggestion
>> > for the current approach, which generally handles all cases, especially
>> > partial unmapping. Additionally, the widespread use of swap_free_nr()
>> > as you suggested across various scenarios is noteworthy.
>> >
>> > Unless there's evidence indicating performance issues or bugs, I believe
>> > the current approach remains preferable.
>>
>> TBH, I don't like the large stack space usage (64 bytes).  How about use
>> a "unsigned long" as bitmap?  Then, we use much less stack space, use
>> bitmap == 0 and bitmap == (unsigned long)(-1) to check the most common
>> use cases.  And, we have enough batching.
>
> that is quite a straightforward modification like,
>
> - #define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
> + #define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 64 ? 64 : SWAPFILE_CLUSTER)
>
> there is no necessity to remove the bitmap API and move to raw
> unsigned long operations.
> as bitmap is exactly some unsigned long. on 64bit CPU, we are now one
> unsigned long,
> on 32bit CPU, it is now two unsigned long.

Yes.  We can still use most bitmap APIs if we use "unsigned long" as
bitmap.  The advantage of "unsigned long" is to guarantee that
bitmap_empty() and bitmap_full() is trivial.  We can use that for
optimization.  For example, we can skip unlock/lock if bitmap_empty().

>>
>> >>
>> >> > we are mTHP, so we can't assume our size is SWAPFILE_CLUSTER?
>> >> > or you want to check free_entries == "1 << swap_entry_order(folio_order(folio))"
>> >> > instead of SWAPFILE_CLUSTER for the "for (i = 0; i < size; i++, entry.val++)"
>> >> > path?
>> >>
>> >> Just replace SWAPFILE_CLUSTER with "nr_pages" in your code.
>> >>
>> >> >
>> >> >> >>
>> >> >> >> And, we should add batching in __swap_entry_free().  That will help
>> >> >> >> free_swap_and_cache_nr() too.
>> >> >
>> >> > Chris Li and I actually discussed it before, while I completely
>> >> > agree this can be batched. but i'd like to defer this as an incremental
>> >> > patchset later to keep this swapcache-refault small.
>> >>
>> >> OK.
>> >>
>> >> >>
>> >> >> Please consider this too.

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 5/5] mm: add per-order mTHP swpin_refault counter
  2024-04-09  8:26 ` [PATCH v2 5/5] mm: add per-order mTHP swpin_refault counter Barry Song
  2024-04-10 23:15   ` SeongJae Park
  2024-04-11 15:53   ` Ryan Roberts
@ 2024-04-17  0:45   ` Huang, Ying
  2024-04-17  1:16     ` Barry Song
  2 siblings, 1 reply; 54+ messages in thread
From: Huang, Ying @ 2024-04-17  0:45 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, hanchuanhua, hannes,
	hughd, kasong, ryan.roberts, surenb, v-songbaohua, willy, xiang,
	yosryahmed, yuzhao, ziy, linux-kernel

Barry Song <21cnbao@gmail.com> writes:

> From: Barry Song <v-songbaohua@oppo.com>
>
> Currently, we are handling the scenario where we've hit a
> large folio in the swapcache, and the reclaiming process
> for this large folio is still ongoing.
>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  include/linux/huge_mm.h | 1 +
>  mm/huge_memory.c        | 2 ++
>  mm/memory.c             | 1 +
>  3 files changed, 4 insertions(+)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index c8256af83e33..b67294d5814f 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -269,6 +269,7 @@ enum mthp_stat_item {
>  	MTHP_STAT_ANON_ALLOC_FALLBACK,
>  	MTHP_STAT_ANON_SWPOUT,
>  	MTHP_STAT_ANON_SWPOUT_FALLBACK,
> +	MTHP_STAT_ANON_SWPIN_REFAULT,

This is different from the refault concept used in other place in mm
subystem.  Please check the following code

	if (shadow)
		workingset_refault(folio, shadow);

in __read_swap_cache_async().

>  	__MTHP_STAT_COUNT
>  };

--
Best Regards,
Huang, Ying

> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d8d2ed80b0bf..fb95345b0bde 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -556,12 +556,14 @@ DEFINE_MTHP_STAT_ATTR(anon_alloc, MTHP_STAT_ANON_ALLOC);
>  DEFINE_MTHP_STAT_ATTR(anon_alloc_fallback, MTHP_STAT_ANON_ALLOC_FALLBACK);
>  DEFINE_MTHP_STAT_ATTR(anon_swpout, MTHP_STAT_ANON_SWPOUT);
>  DEFINE_MTHP_STAT_ATTR(anon_swpout_fallback, MTHP_STAT_ANON_SWPOUT_FALLBACK);
> +DEFINE_MTHP_STAT_ATTR(anon_swpin_refault, MTHP_STAT_ANON_SWPIN_REFAULT);
>  
>  static struct attribute *stats_attrs[] = {
>  	&anon_alloc_attr.attr,
>  	&anon_alloc_fallback_attr.attr,
>  	&anon_swpout_attr.attr,
>  	&anon_swpout_fallback_attr.attr,
> +	&anon_swpin_refault_attr.attr,
>  	NULL,
>  };
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index 9818dc1893c8..acc023795a4d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4167,6 +4167,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  		nr_pages = nr;
>  		entry = folio->swap;
>  		page = &folio->page;
> +		count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_SWPIN_REFAULT);
>  	}
>  
>  check_pte:

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 5/5] mm: add per-order mTHP swpin_refault counter
  2024-04-17  0:45   ` Huang, Ying
@ 2024-04-17  1:16     ` Barry Song
  2024-04-17  1:38       ` Huang, Ying
  0 siblings, 1 reply; 54+ messages in thread
From: Barry Song @ 2024-04-17  1:16 UTC (permalink / raw)
  To: Huang, Ying
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, hanchuanhua, hannes,
	hughd, kasong, ryan.roberts, surenb, v-songbaohua, willy, xiang,
	yosryahmed, yuzhao, ziy, linux-kernel

On Wed, Apr 17, 2024 at 8:47 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > Currently, we are handling the scenario where we've hit a
> > large folio in the swapcache, and the reclaiming process
> > for this large folio is still ongoing.
> >
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  include/linux/huge_mm.h | 1 +
> >  mm/huge_memory.c        | 2 ++
> >  mm/memory.c             | 1 +
> >  3 files changed, 4 insertions(+)
> >
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index c8256af83e33..b67294d5814f 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -269,6 +269,7 @@ enum mthp_stat_item {
> >       MTHP_STAT_ANON_ALLOC_FALLBACK,
> >       MTHP_STAT_ANON_SWPOUT,
> >       MTHP_STAT_ANON_SWPOUT_FALLBACK,
> > +     MTHP_STAT_ANON_SWPIN_REFAULT,
>
> This is different from the refault concept used in other place in mm
> subystem.  Please check the following code
>
>         if (shadow)
>                 workingset_refault(folio, shadow);
>
> in __read_swap_cache_async().

right. it is slightly different as refault can also cover the case folios
have been entirely released and a new page fault happens soon
after it.
Do you have a better name for this?
MTHP_STAT_ANON_SWPIN_UNDER_RECLAIM
or
MTHP_STAT_ANON_SWPIN_RECLAIMING ?

>
> >       __MTHP_STAT_COUNT
> >  };
>
> --
> Best Regards,
> Huang, Ying
>
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index d8d2ed80b0bf..fb95345b0bde 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -556,12 +556,14 @@ DEFINE_MTHP_STAT_ATTR(anon_alloc, MTHP_STAT_ANON_ALLOC);
> >  DEFINE_MTHP_STAT_ATTR(anon_alloc_fallback, MTHP_STAT_ANON_ALLOC_FALLBACK);
> >  DEFINE_MTHP_STAT_ATTR(anon_swpout, MTHP_STAT_ANON_SWPOUT);
> >  DEFINE_MTHP_STAT_ATTR(anon_swpout_fallback, MTHP_STAT_ANON_SWPOUT_FALLBACK);
> > +DEFINE_MTHP_STAT_ATTR(anon_swpin_refault, MTHP_STAT_ANON_SWPIN_REFAULT);
> >
> >  static struct attribute *stats_attrs[] = {
> >       &anon_alloc_attr.attr,
> >       &anon_alloc_fallback_attr.attr,
> >       &anon_swpout_attr.attr,
> >       &anon_swpout_fallback_attr.attr,
> > +     &anon_swpin_refault_attr.attr,
> >       NULL,
> >  };
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 9818dc1893c8..acc023795a4d 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4167,6 +4167,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >               nr_pages = nr;
> >               entry = folio->swap;
> >               page = &folio->page;
> > +             count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_SWPIN_REFAULT);
> >       }
> >
> >  check_pte:

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/5] mm: swap: introduce swap_free_nr() for batched swap_free()
  2024-04-17  0:32                       ` Huang, Ying
@ 2024-04-17  1:35                         ` Barry Song
  2024-04-18  5:27                           ` Barry Song
  0 siblings, 1 reply; 54+ messages in thread
From: Barry Song @ 2024-04-17  1:35 UTC (permalink / raw)
  To: Huang, Ying
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, hanchuanhua, hannes,
	hughd, kasong, ryan.roberts, surenb, v-songbaohua, willy, xiang,
	yosryahmed, yuzhao, ziy, linux-kernel

On Wed, Apr 17, 2024 at 12:34 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Tue, Apr 16, 2024 at 3:13 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Barry Song <21cnbao@gmail.com> writes:
> >>
> >> > On Tue, Apr 16, 2024 at 1:42 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Barry Song <21cnbao@gmail.com> writes:
> >> >>
> >> >> > On Mon, Apr 15, 2024 at 8:53 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >>
> >> >> >> Barry Song <21cnbao@gmail.com> writes:
> >> >> >>
> >> >> >> > On Mon, Apr 15, 2024 at 8:21 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >>
> >> >> >> >> Barry Song <21cnbao@gmail.com> writes:
> >> >> >> >>
> >> >> >> >> > On Mon, Apr 15, 2024 at 6:19 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >> >>
> >> >> >> >> >> Barry Song <21cnbao@gmail.com> writes:
> >> >> >> >> >>
> >> >> >> >> >> > From: Chuanhua Han <hanchuanhua@oppo.com>
> >> >> >> >> >> >
> >> >> >> >> >> > While swapping in a large folio, we need to free swaps related to the whole
> >> >> >> >> >> > folio. To avoid frequently acquiring and releasing swap locks, it is better
> >> >> >> >> >> > to introduce an API for batched free.
> >> >> >> >> >> >
> >> >> >> >> >> > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> >> >> >> >> >> > Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> >> >> >> >> >> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >> >> >> >> >> > ---
> >> >> >> >> >> >  include/linux/swap.h |  5 +++++
> >> >> >> >> >> >  mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++
> >> >> >> >> >> >  2 files changed, 56 insertions(+)
> >> >> >> >> >> >
> >> >> >> >> >> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> >> >> >> >> >> > index 11c53692f65f..b7a107e983b8 100644
> >> >> >> >> >> > --- a/include/linux/swap.h
> >> >> >> >> >> > +++ b/include/linux/swap.h
> >> >> >> >> >> > @@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
> >> >> >> >> >> >  extern int swap_duplicate(swp_entry_t);
> >> >> >> >> >> >  extern int swapcache_prepare(swp_entry_t);
> >> >> >> >> >> >  extern void swap_free(swp_entry_t);
> >> >> >> >> >> > +extern void swap_free_nr(swp_entry_t entry, int nr_pages);
> >> >> >> >> >> >  extern void swapcache_free_entries(swp_entry_t *entries, int n);
> >> >> >> >> >> >  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
> >> >> >> >> >> >  int swap_type_of(dev_t device, sector_t offset);
> >> >> >> >> >> > @@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
> >> >> >> >> >> >  {
> >> >> >> >> >> >  }
> >> >> >> >> >> >
> >> >> >> >> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
> >> >> >> >> >> > +{
> >> >> >> >> >> > +}
> >> >> >> >> >> > +
> >> >> >> >> >> >  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
> >> >> >> >> >> >  {
> >> >> >> >> >> >  }
> >> >> >> >> >> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> >> >> >> >> >> > index 28642c188c93..f4c65aeb088d 100644
> >> >> >> >> >> > --- a/mm/swapfile.c
> >> >> >> >> >> > +++ b/mm/swapfile.c
> >> >> >> >> >> > @@ -1356,6 +1356,57 @@ void swap_free(swp_entry_t entry)
> >> >> >> >> >> >               __swap_entry_free(p, entry);
> >> >> >> >> >> >  }
> >> >> >> >> >> >
> >> >> >> >> >> > +/*
> >> >> >> >> >> > + * Free up the maximum number of swap entries at once to limit the
> >> >> >> >> >> > + * maximum kernel stack usage.
> >> >> >> >> >> > + */
> >> >> >> >> >> > +#define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
> >> >> >> >> >> > +
> >> >> >> >> >> > +/*
> >> >> >> >> >> > + * Called after swapping in a large folio,
> >> >> >> >> >>
> >> >> >> >> >> IMHO, it's not good to document the caller in the function definition.
> >> >> >> >> >> Because this will discourage function reusing.
> >> >> >> >> >
> >> >> >> >> > ok. right now there is only one user that is why it is added. but i agree
> >> >> >> >> > we can actually remove this.
> >> >> >> >> >
> >> >> >> >> >>
> >> >> >> >> >> > batched free swap entries
> >> >> >> >> >> > + * for this large folio, entry should be for the first subpage and
> >> >> >> >> >> > + * its offset is aligned with nr_pages
> >> >> >> >> >>
> >> >> >> >> >> Why do we need this?
> >> >> >> >> >
> >> >> >> >> > This is a fundamental requirement for the existing kernel, folio's
> >> >> >> >> > swap offset is naturally aligned from the first moment add_to_swap
> >> >> >> >> > to add swapcache's xa. so this comment is describing the existing
> >> >> >> >> > fact. In the future, if we want to support swap-out folio to discontiguous
> >> >> >> >> > and not-aligned offsets, we can't pass entry as the parameter, we should
> >> >> >> >> > instead pass ptep or another different data struct which can connect
> >> >> >> >> > multiple discontiguous swap offsets.
> >> >> >> >> >
> >> >> >> >> > I feel like we only need "for this large folio, entry should be for
> >> >> >> >> > the first subpage" and drop "and its offset is aligned with nr_pages",
> >> >> >> >> > the latter is not important to this context at all.
> >> >> >> >>
> >> >> >> >> IIUC, all these are requirements of the only caller now, not the
> >> >> >> >> function itself.  If only part of the all swap entries of a mTHP are
> >> >> >> >> called with swap_free_nr(), can swap_free_nr() still do its work?  If
> >> >> >> >> so, why not make swap_free_nr() as general as possible?
> >> >> >> >
> >> >> >> > right , i believe we can make swap_free_nr() as general as possible.
> >> >> >> >
> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >> > + */
> >> >> >> >> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
> >> >> >> >> >> > +{
> >> >> >> >> >> > +     int i, j;
> >> >> >> >> >> > +     struct swap_cluster_info *ci;
> >> >> >> >> >> > +     struct swap_info_struct *p;
> >> >> >> >> >> > +     unsigned int type = swp_type(entry);
> >> >> >> >> >> > +     unsigned long offset = swp_offset(entry);
> >> >> >> >> >> > +     int batch_nr, remain_nr;
> >> >> >> >> >> > +     DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };
> >> >> >> >> >> > +
> >> >> >> >> >> > +     /* all swap entries are within a cluster for mTHP */
> >> >> >> >> >> > +     VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
> >> >> >> >> >> > +
> >> >> >> >> >> > +     if (nr_pages == 1) {
> >> >> >> >> >> > +             swap_free(entry);
> >> >> >> >> >> > +             return;
> >> >> >> >> >> > +     }
> >> >> >> >> >>
> >> >> >> >> >> Is it possible to unify swap_free() and swap_free_nr() into one function
> >> >> >> >> >> with acceptable performance?  IIUC, the general rule in mTHP effort is
> >> >> >> >> >> to avoid duplicate functions between mTHP and normal small folio.
> >> >> >> >> >> Right?
> >> >> >> >> >
> >> >> >> >> > I don't see why.
> >> >> >> >>
> >> >> >> >> Because duplicated implementation are hard to maintain in the long term.
> >> >> >> >
> >> >> >> > sorry, i actually meant "I don't see why not",  for some reason, the "not"
> >> >> >> > was missed. Obviously I meant "why not", there was a "but" after it :-)
> >> >> >> >
> >> >> >> >>
> >> >> >> >> > but we have lots of places calling swap_free(), we may
> >> >> >> >> > have to change them all to call swap_free_nr(entry, 1); the other possible
> >> >> >> >> > way is making swap_free() a wrapper of swap_free_nr() always using
> >> >> >> >> > 1 as the argument. In both cases, we are changing the semantics of
> >> >> >> >> > swap_free_nr() to partially freeing large folio cases and have to drop
> >> >> >> >> > "entry should be for the first subpage" then.
> >> >> >> >> >
> >> >> >> >> > Right now, the semantics is
> >> >> >> >> > * swap_free_nr() for an entire large folio;
> >> >> >> >> > * swap_free() for one entry of either a large folio or a small folio
> >> >> >> >>
> >> >> >> >> As above, I don't think the these semantics are important for
> >> >> >> >> swap_free_nr() implementation.
> >> >> >> >
> >> >> >> > right. I agree. If we are ready to change all those callers, nothing
> >> >> >> > can stop us from removing swap_free().
> >> >> >> >
> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >> > +
> >> >> >> >> >> > +     remain_nr = nr_pages;
> >> >> >> >> >> > +     p = _swap_info_get(entry);
> >> >> >> >> >> > +     if (p) {
> >> >> >> >> >> > +             for (i = 0; i < nr_pages; i += batch_nr) {
> >> >> >> >> >> > +                     batch_nr = min_t(int, SWAP_BATCH_NR, remain_nr);
> >> >> >> >> >> > +
> >> >> >> >> >> > +                     ci = lock_cluster_or_swap_info(p, offset);
> >> >> >> >> >> > +                     for (j = 0; j < batch_nr; j++) {
> >> >> >> >> >> > +                             if (__swap_entry_free_locked(p, offset + i * SWAP_BATCH_NR + j, 1))
> >> >> >> >> >> > +                                     __bitmap_set(usage, j, 1);
> >> >> >> >> >> > +                     }
> >> >> >> >> >> > +                     unlock_cluster_or_swap_info(p, ci);
> >> >> >> >> >> > +
> >> >> >> >> >> > +                     for_each_clear_bit(j, usage, batch_nr)
> >> >> >> >> >> > +                             free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
> >> >> >> >> >> > +
> >> >> >> >> >> > +                     bitmap_clear(usage, 0, SWAP_BATCH_NR);
> >> >> >> >> >> > +                     remain_nr -= batch_nr;
> >> >> >> >> >> > +             }
> >> >> >> >> >> > +     }
> >> >> >> >> >> > +}
> >> >> >> >> >> > +
> >> >> >> >> >> >  /*
> >> >> >> >> >> >   * Called after dropping swapcache to decrease refcnt to swap entries.
> >> >> >> >> >> >   */
> >> >> >> >> >>
> >> >> >> >> >> put_swap_folio() implements batching in another method.  Do you think
> >> >> >> >> >> that it's good to use the batching method in that function here?  It
> >> >> >> >> >> avoids to use bitmap operations and stack space.
> >> >> >> >> >
> >> >> >> >> > Chuanhua has strictly limited the maximum stack usage to several
> >> >> >> >> > unsigned long,
> >> >> >> >>
> >> >> >> >> 512 / 8 = 64 bytes.
> >> >> >> >>
> >> >> >> >> So, not trivial.
> >> >> >> >>
> >> >> >> >> > so this should be safe. on the other hand, i believe this
> >> >> >> >> > implementation is more efficient, as  put_swap_folio() might lock/
> >> >> >> >> > unlock much more often whenever __swap_entry_free_locked returns
> >> >> >> >> > 0.
> >> >> >> >>
> >> >> >> >> There are 2 most common use cases,
> >> >> >> >>
> >> >> >> >> - all swap entries have usage count == 0
> >> >> >> >> - all swap entries have usage count != 0
> >> >> >> >>
> >> >> >> >> In both cases, we only need to lock/unlock once.  In fact, I didn't
> >> >> >> >> find possible use cases other than above.
> >> >> >> >
> >> >> >> > i guess the point is free_swap_slot() shouldn't be called within
> >> >> >> > lock_cluster_or_swap_info? so when we are freeing nr_pages slots,
> >> >> >> > we'll have to unlock and lock nr_pages times?  and this is the most
> >> >> >> > common scenario.
> >> >> >>
> >> >> >> No.  In put_swap_folio(), free_entries is either SWAPFILE_CLUSTER (that
> >> >> >> is, nr_pages) or 0.  These are the most common cases.
> >> >> >>
> >> >> >
> >> >> > i am actually talking about the below code path,
> >> >> >
> >> >> > void put_swap_folio(struct folio *folio, swp_entry_t entry)
> >> >> > {
> >> >> >         ...
> >> >> >
> >> >> >         ci = lock_cluster_or_swap_info(si, offset);
> >> >> >         ...
> >> >> >         for (i = 0; i < size; i++, entry.val++) {
> >> >> >                 if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE)) {
> >> >> >                         unlock_cluster_or_swap_info(si, ci);
> >> >> >                         free_swap_slot(entry);
> >> >> >                         if (i == size - 1)
> >> >> >                                 return;
> >> >> >                         lock_cluster_or_swap_info(si, offset);
> >> >> >                 }
> >> >> >         }
> >> >> >         unlock_cluster_or_swap_info(si, ci);
> >> >> > }
> >> >> >
> >> >> > but i guess you are talking about the below code path:
> >> >> >
> >> >> > void put_swap_folio(struct folio *folio, swp_entry_t entry)
> >> >> > {
> >> >> >         ...
> >> >> >
> >> >> >         ci = lock_cluster_or_swap_info(si, offset);
> >> >> >         if (size == SWAPFILE_CLUSTER) {
> >> >> >                 map = si->swap_map + offset;
> >> >> >                 for (i = 0; i < SWAPFILE_CLUSTER; i++) {
> >> >> >                         val = map[i];
> >> >> >                         VM_BUG_ON(!(val & SWAP_HAS_CACHE));
> >> >> >                         if (val == SWAP_HAS_CACHE)
> >> >> >                                 free_entries++;
> >> >> >                 }
> >> >> >                 if (free_entries == SWAPFILE_CLUSTER) {
> >> >> >                         unlock_cluster_or_swap_info(si, ci);
> >> >> >                         spin_lock(&si->lock);
> >> >> >                         mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER);
> >> >> >                         swap_free_cluster(si, idx);
> >> >> >                         spin_unlock(&si->lock);
> >> >> >                         return;
> >> >> >                 }
> >> >> >         }
> >> >> > }
> >> >>
> >> >> I am talking about both code paths.  In 2 most common cases,
> >> >> __swap_entry_free_locked() will return 0 or !0 for all entries in range.
> >> >
> >> > I grasp your point, but if conditions involving 0 or non-0 values fail, we'll
> >> > end up repeatedly unlocking and locking. Picture a scenario with a large
> >> > folio shared by multiple processes. One process might unmap a portion
> >> > while another still holds an entire mapping to it. This could lead to situations
> >> > where free_entries doesn't equal 0 and free_entries doesn't equal
> >> > nr_pages, resulting in multiple unlock and lock operations.
> >>
> >> This is impossible in current caller, because the folio is in the swap
> >> cache.  But if we move the change to __swap_entry_free_nr(), we may run
> >> into this situation.
> >
> > I don't understand why it is impossible, after try_to_unmap_one() has done
> > on one process, mprotect and munmap called on a part of the large folio
> > pte entries which now have been swap entries, we are removing the PTE
> > for this part. Another process can entirely hit the swapcache and have
> > all swap entries mapped there, and we call swap_free_nr(entry, nr_pages) in
> > do_swap_page. Within those swap entries, some have swapcount=1 and others
> > have swapcount > 1. Am I missing something?
>
> For swap entries with swapcount=1, its sis->swap_map[] will be
>
> 1 | SWAP_HAS_CACHE
>
> so, __swap_entry_free_locked() will return SWAP_HAS_CACHE instead of 0.
>
> The swap entries will be free in
>
> folio_free_swap
>   delete_from_swap_cache
>     put_swap_folio
>

Yes. I realized this after replying to you yesterday.

> >> > Chuanhua has invested significant effort in following Ryan's suggestion
> >> > for the current approach, which generally handles all cases, especially
> >> > partial unmapping. Additionally, the widespread use of swap_free_nr()
> >> > as you suggested across various scenarios is noteworthy.
> >> >
> >> > Unless there's evidence indicating performance issues or bugs, I believe
> >> > the current approach remains preferable.
> >>
> >> TBH, I don't like the large stack space usage (64 bytes).  How about use
> >> a "unsigned long" as bitmap?  Then, we use much less stack space, use
> >> bitmap == 0 and bitmap == (unsigned long)(-1) to check the most common
> >> use cases.  And, we have enough batching.
> >
> > that is quite a straightforward modification like,
> >
> > - #define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
> > + #define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 64 ? 64 : SWAPFILE_CLUSTER)
> >
> > there is no necessity to remove the bitmap API and move to raw
> > unsigned long operations.
> > as bitmap is exactly some unsigned long. on 64bit CPU, we are now one
> > unsigned long,
> > on 32bit CPU, it is now two unsigned long.
>
> Yes.  We can still use most bitmap APIs if we use "unsigned long" as
> bitmap.  The advantage of "unsigned long" is to guarantee that
> bitmap_empty() and bitmap_full() is trivial.  We can use that for
> optimization.  For example, we can skip unlock/lock if bitmap_empty().

anyway we have avoided lock_cluster_or_swap_info and unlock_cluster_or_swap_info
for each individual swap entry.

if bitma_empty(), we won't call free_swap_slot() so no chance to
further take any lock,
right?

the optimization of bitmap_full() seems to be more useful only after we have
void free_swap_slot(swp_entry_t entry, int nr)

in which we can avoid many spin_lock_irq(&cache->free_lock);

On the other hand, it seems we can directly call
swapcache_free_entries() to skip cache if
nr_pages >= SWAP_BATCH(64) this might be an optimization as we are now
having a bitmap exactly equals 64.

>
> >>
> >> >>
> >> >> > we are mTHP, so we can't assume our size is SWAPFILE_CLUSTER?
> >> >> > or you want to check free_entries == "1 << swap_entry_order(folio_order(folio))"
> >> >> > instead of SWAPFILE_CLUSTER for the "for (i = 0; i < size; i++, entry.val++)"
> >> >> > path?
> >> >>
> >> >> Just replace SWAPFILE_CLUSTER with "nr_pages" in your code.
> >> >>
> >> >> >
> >> >> >> >>
> >> >> >> >> And, we should add batching in __swap_entry_free().  That will help
> >> >> >> >> free_swap_and_cache_nr() too.
> >> >> >
> >> >> > Chris Li and I actually discussed it before, while I completely
> >> >> > agree this can be batched. but i'd like to defer this as an incremental
> >> >> > patchset later to keep this swapcache-refault small.
> >> >>
> >> >> OK.
> >> >>
> >> >> >>
> >> >> >> Please consider this too.
>
> --
> Best Regards,
> Huang, Ying

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 5/5] mm: add per-order mTHP swpin_refault counter
  2024-04-17  1:16     ` Barry Song
@ 2024-04-17  1:38       ` Huang, Ying
  2024-04-17  1:48         ` Barry Song
  0 siblings, 1 reply; 54+ messages in thread
From: Huang, Ying @ 2024-04-17  1:38 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, hanchuanhua, hannes,
	hughd, kasong, ryan.roberts, surenb, v-songbaohua, willy, xiang,
	yosryahmed, yuzhao, ziy, linux-kernel

Barry Song <21cnbao@gmail.com> writes:

> On Wed, Apr 17, 2024 at 8:47 AM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Barry Song <21cnbao@gmail.com> writes:
>>
>> > From: Barry Song <v-songbaohua@oppo.com>
>> >
>> > Currently, we are handling the scenario where we've hit a
>> > large folio in the swapcache, and the reclaiming process
>> > for this large folio is still ongoing.
>> >
>> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>> > ---
>> >  include/linux/huge_mm.h | 1 +
>> >  mm/huge_memory.c        | 2 ++
>> >  mm/memory.c             | 1 +
>> >  3 files changed, 4 insertions(+)
>> >
>> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> > index c8256af83e33..b67294d5814f 100644
>> > --- a/include/linux/huge_mm.h
>> > +++ b/include/linux/huge_mm.h
>> > @@ -269,6 +269,7 @@ enum mthp_stat_item {
>> >       MTHP_STAT_ANON_ALLOC_FALLBACK,
>> >       MTHP_STAT_ANON_SWPOUT,
>> >       MTHP_STAT_ANON_SWPOUT_FALLBACK,
>> > +     MTHP_STAT_ANON_SWPIN_REFAULT,
>>
>> This is different from the refault concept used in other place in mm
>> subystem.  Please check the following code
>>
>>         if (shadow)
>>                 workingset_refault(folio, shadow);
>>
>> in __read_swap_cache_async().
>
> right. it is slightly different as refault can also cover the case folios
> have been entirely released and a new page fault happens soon
> after it.
> Do you have a better name for this?
> MTHP_STAT_ANON_SWPIN_UNDER_RECLAIM
> or
> MTHP_STAT_ANON_SWPIN_RECLAIMING ?

TBH, I don't think we need this counter.  It's important for you during
implementation.  But I don't think it's important for end users.

--
Best Regards,
Huang, Ying

>>
>> >       __MTHP_STAT_COUNT
>> >  };
>>
>> --
>> Best Regards,
>> Huang, Ying
>>
>> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> > index d8d2ed80b0bf..fb95345b0bde 100644
>> > --- a/mm/huge_memory.c
>> > +++ b/mm/huge_memory.c
>> > @@ -556,12 +556,14 @@ DEFINE_MTHP_STAT_ATTR(anon_alloc, MTHP_STAT_ANON_ALLOC);
>> >  DEFINE_MTHP_STAT_ATTR(anon_alloc_fallback, MTHP_STAT_ANON_ALLOC_FALLBACK);
>> >  DEFINE_MTHP_STAT_ATTR(anon_swpout, MTHP_STAT_ANON_SWPOUT);
>> >  DEFINE_MTHP_STAT_ATTR(anon_swpout_fallback, MTHP_STAT_ANON_SWPOUT_FALLBACK);
>> > +DEFINE_MTHP_STAT_ATTR(anon_swpin_refault, MTHP_STAT_ANON_SWPIN_REFAULT);
>> >
>> >  static struct attribute *stats_attrs[] = {
>> >       &anon_alloc_attr.attr,
>> >       &anon_alloc_fallback_attr.attr,
>> >       &anon_swpout_attr.attr,
>> >       &anon_swpout_fallback_attr.attr,
>> > +     &anon_swpin_refault_attr.attr,
>> >       NULL,
>> >  };
>> >
>> > diff --git a/mm/memory.c b/mm/memory.c
>> > index 9818dc1893c8..acc023795a4d 100644
>> > --- a/mm/memory.c
>> > +++ b/mm/memory.c
>> > @@ -4167,6 +4167,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> >               nr_pages = nr;
>> >               entry = folio->swap;
>> >               page = &folio->page;
>> > +             count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_SWPIN_REFAULT);
>> >       }
>> >
>> >  check_pte:

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 5/5] mm: add per-order mTHP swpin_refault counter
  2024-04-17  1:38       ` Huang, Ying
@ 2024-04-17  1:48         ` Barry Song
  0 siblings, 0 replies; 54+ messages in thread
From: Barry Song @ 2024-04-17  1:48 UTC (permalink / raw)
  To: Huang, Ying
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, hanchuanhua, hannes,
	hughd, kasong, ryan.roberts, surenb, v-songbaohua, willy, xiang,
	yosryahmed, yuzhao, ziy, linux-kernel

On Wed, Apr 17, 2024 at 1:40 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Wed, Apr 17, 2024 at 8:47 AM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Barry Song <21cnbao@gmail.com> writes:
> >>
> >> > From: Barry Song <v-songbaohua@oppo.com>
> >> >
> >> > Currently, we are handling the scenario where we've hit a
> >> > large folio in the swapcache, and the reclaiming process
> >> > for this large folio is still ongoing.
> >> >
> >> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >> > ---
> >> >  include/linux/huge_mm.h | 1 +
> >> >  mm/huge_memory.c        | 2 ++
> >> >  mm/memory.c             | 1 +
> >> >  3 files changed, 4 insertions(+)
> >> >
> >> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> >> > index c8256af83e33..b67294d5814f 100644
> >> > --- a/include/linux/huge_mm.h
> >> > +++ b/include/linux/huge_mm.h
> >> > @@ -269,6 +269,7 @@ enum mthp_stat_item {
> >> >       MTHP_STAT_ANON_ALLOC_FALLBACK,
> >> >       MTHP_STAT_ANON_SWPOUT,
> >> >       MTHP_STAT_ANON_SWPOUT_FALLBACK,
> >> > +     MTHP_STAT_ANON_SWPIN_REFAULT,
> >>
> >> This is different from the refault concept used in other place in mm
> >> subystem.  Please check the following code
> >>
> >>         if (shadow)
> >>                 workingset_refault(folio, shadow);
> >>
> >> in __read_swap_cache_async().
> >
> > right. it is slightly different as refault can also cover the case folios
> > have been entirely released and a new page fault happens soon
> > after it.
> > Do you have a better name for this?
> > MTHP_STAT_ANON_SWPIN_UNDER_RECLAIM
> > or
> > MTHP_STAT_ANON_SWPIN_RECLAIMING ?
>
> TBH, I don't think we need this counter.  It's important for you during
> implementation.  But I don't think it's important for end users.

Okay. If we can't find a shared interest between the
implementer and user, I'm perfectly fine with keeping it
local only for debugging purposes.

>
> --
> Best Regards,
> Huang, Ying
>
> >>
> >> >       __MTHP_STAT_COUNT
> >> >  };
> >>
> >> --
> >> Best Regards,
> >> Huang, Ying
> >>
> >> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >> > index d8d2ed80b0bf..fb95345b0bde 100644
> >> > --- a/mm/huge_memory.c
> >> > +++ b/mm/huge_memory.c
> >> > @@ -556,12 +556,14 @@ DEFINE_MTHP_STAT_ATTR(anon_alloc, MTHP_STAT_ANON_ALLOC);
> >> >  DEFINE_MTHP_STAT_ATTR(anon_alloc_fallback, MTHP_STAT_ANON_ALLOC_FALLBACK);
> >> >  DEFINE_MTHP_STAT_ATTR(anon_swpout, MTHP_STAT_ANON_SWPOUT);
> >> >  DEFINE_MTHP_STAT_ATTR(anon_swpout_fallback, MTHP_STAT_ANON_SWPOUT_FALLBACK);
> >> > +DEFINE_MTHP_STAT_ATTR(anon_swpin_refault, MTHP_STAT_ANON_SWPIN_REFAULT);
> >> >
> >> >  static struct attribute *stats_attrs[] = {
> >> >       &anon_alloc_attr.attr,
> >> >       &anon_alloc_fallback_attr.attr,
> >> >       &anon_swpout_attr.attr,
> >> >       &anon_swpout_fallback_attr.attr,
> >> > +     &anon_swpin_refault_attr.attr,
> >> >       NULL,
> >> >  };
> >> >
> >> > diff --git a/mm/memory.c b/mm/memory.c
> >> > index 9818dc1893c8..acc023795a4d 100644
> >> > --- a/mm/memory.c
> >> > +++ b/mm/memory.c
> >> > @@ -4167,6 +4167,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >> >               nr_pages = nr;
> >> >               entry = folio->swap;
> >> >               page = &folio->page;
> >> > +             count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_SWPIN_REFAULT);
> >> >       }
> >> >
> >> >  check_pte:

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/5] mm: swap: introduce swap_free_nr() for batched swap_free()
  2024-04-17  1:35                         ` Barry Song
@ 2024-04-18  5:27                           ` Barry Song
  2024-04-18  8:55                             ` Huang, Ying
  0 siblings, 1 reply; 54+ messages in thread
From: Barry Song @ 2024-04-18  5:27 UTC (permalink / raw)
  To: Huang, Ying
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, hanchuanhua, hannes,
	hughd, kasong, ryan.roberts, surenb, v-songbaohua, willy, xiang,
	yosryahmed, yuzhao, ziy, linux-kernel

On Wed, Apr 17, 2024 at 1:35 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, Apr 17, 2024 at 12:34 PM Huang, Ying <ying.huang@intel.com> wrote:
> >
> > Barry Song <21cnbao@gmail.com> writes:
> >
> > > On Tue, Apr 16, 2024 at 3:13 PM Huang, Ying <ying.huang@intel.com> wrote:
> > >>
> > >> Barry Song <21cnbao@gmail.com> writes:
> > >>
> > >> > On Tue, Apr 16, 2024 at 1:42 PM Huang, Ying <ying.huang@intel.com> wrote:
> > >> >>
> > >> >> Barry Song <21cnbao@gmail.com> writes:
> > >> >>
> > >> >> > On Mon, Apr 15, 2024 at 8:53 PM Huang, Ying <ying.huang@intel.com> wrote:
> > >> >> >>
> > >> >> >> Barry Song <21cnbao@gmail.com> writes:
> > >> >> >>
> > >> >> >> > On Mon, Apr 15, 2024 at 8:21 PM Huang, Ying <ying.huang@intel.com> wrote:
> > >> >> >> >>
> > >> >> >> >> Barry Song <21cnbao@gmail.com> writes:
> > >> >> >> >>
> > >> >> >> >> > On Mon, Apr 15, 2024 at 6:19 PM Huang, Ying <ying.huang@intel.com> wrote:
> > >> >> >> >> >>
> > >> >> >> >> >> Barry Song <21cnbao@gmail.com> writes:
> > >> >> >> >> >>
> > >> >> >> >> >> > From: Chuanhua Han <hanchuanhua@oppo.com>
> > >> >> >> >> >> >
> > >> >> >> >> >> > While swapping in a large folio, we need to free swaps related to the whole
> > >> >> >> >> >> > folio. To avoid frequently acquiring and releasing swap locks, it is better
> > >> >> >> >> >> > to introduce an API for batched free.
> > >> >> >> >> >> >
> > >> >> >> >> >> > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> > >> >> >> >> >> > Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> > >> >> >> >> >> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > >> >> >> >> >> > ---
> > >> >> >> >> >> >  include/linux/swap.h |  5 +++++
> > >> >> >> >> >> >  mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++
> > >> >> >> >> >> >  2 files changed, 56 insertions(+)
> > >> >> >> >> >> >
> > >> >> >> >> >> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > >> >> >> >> >> > index 11c53692f65f..b7a107e983b8 100644
> > >> >> >> >> >> > --- a/include/linux/swap.h
> > >> >> >> >> >> > +++ b/include/linux/swap.h
> > >> >> >> >> >> > @@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
> > >> >> >> >> >> >  extern int swap_duplicate(swp_entry_t);
> > >> >> >> >> >> >  extern int swapcache_prepare(swp_entry_t);
> > >> >> >> >> >> >  extern void swap_free(swp_entry_t);
> > >> >> >> >> >> > +extern void swap_free_nr(swp_entry_t entry, int nr_pages);
> > >> >> >> >> >> >  extern void swapcache_free_entries(swp_entry_t *entries, int n);
> > >> >> >> >> >> >  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
> > >> >> >> >> >> >  int swap_type_of(dev_t device, sector_t offset);
> > >> >> >> >> >> > @@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
> > >> >> >> >> >> >  {
> > >> >> >> >> >> >  }
> > >> >> >> >> >> >
> > >> >> >> >> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
> > >> >> >> >> >> > +{
> > >> >> >> >> >> > +}
> > >> >> >> >> >> > +
> > >> >> >> >> >> >  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
> > >> >> >> >> >> >  {
> > >> >> >> >> >> >  }
> > >> >> >> >> >> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > >> >> >> >> >> > index 28642c188c93..f4c65aeb088d 100644
> > >> >> >> >> >> > --- a/mm/swapfile.c
> > >> >> >> >> >> > +++ b/mm/swapfile.c
> > >> >> >> >> >> > @@ -1356,6 +1356,57 @@ void swap_free(swp_entry_t entry)
> > >> >> >> >> >> >               __swap_entry_free(p, entry);
> > >> >> >> >> >> >  }
> > >> >> >> >> >> >
> > >> >> >> >> >> > +/*
> > >> >> >> >> >> > + * Free up the maximum number of swap entries at once to limit the
> > >> >> >> >> >> > + * maximum kernel stack usage.
> > >> >> >> >> >> > + */
> > >> >> >> >> >> > +#define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
> > >> >> >> >> >> > +
> > >> >> >> >> >> > +/*
> > >> >> >> >> >> > + * Called after swapping in a large folio,
> > >> >> >> >> >>
> > >> >> >> >> >> IMHO, it's not good to document the caller in the function definition.
> > >> >> >> >> >> Because this will discourage function reusing.
> > >> >> >> >> >
> > >> >> >> >> > ok. right now there is only one user that is why it is added. but i agree
> > >> >> >> >> > we can actually remove this.
> > >> >> >> >> >
> > >> >> >> >> >>
> > >> >> >> >> >> > batched free swap entries
> > >> >> >> >> >> > + * for this large folio, entry should be for the first subpage and
> > >> >> >> >> >> > + * its offset is aligned with nr_pages
> > >> >> >> >> >>
> > >> >> >> >> >> Why do we need this?
> > >> >> >> >> >
> > >> >> >> >> > This is a fundamental requirement for the existing kernel, folio's
> > >> >> >> >> > swap offset is naturally aligned from the first moment add_to_swap
> > >> >> >> >> > to add swapcache's xa. so this comment is describing the existing
> > >> >> >> >> > fact. In the future, if we want to support swap-out folio to discontiguous
> > >> >> >> >> > and not-aligned offsets, we can't pass entry as the parameter, we should
> > >> >> >> >> > instead pass ptep or another different data struct which can connect
> > >> >> >> >> > multiple discontiguous swap offsets.
> > >> >> >> >> >
> > >> >> >> >> > I feel like we only need "for this large folio, entry should be for
> > >> >> >> >> > the first subpage" and drop "and its offset is aligned with nr_pages",
> > >> >> >> >> > the latter is not important to this context at all.
> > >> >> >> >>
> > >> >> >> >> IIUC, all these are requirements of the only caller now, not the
> > >> >> >> >> function itself.  If only part of the all swap entries of a mTHP are
> > >> >> >> >> called with swap_free_nr(), can swap_free_nr() still do its work?  If
> > >> >> >> >> so, why not make swap_free_nr() as general as possible?
> > >> >> >> >
> > >> >> >> > right , i believe we can make swap_free_nr() as general as possible.
> > >> >> >> >
> > >> >> >> >>
> > >> >> >> >> >>
> > >> >> >> >> >> > + */
> > >> >> >> >> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
> > >> >> >> >> >> > +{
> > >> >> >> >> >> > +     int i, j;
> > >> >> >> >> >> > +     struct swap_cluster_info *ci;
> > >> >> >> >> >> > +     struct swap_info_struct *p;
> > >> >> >> >> >> > +     unsigned int type = swp_type(entry);
> > >> >> >> >> >> > +     unsigned long offset = swp_offset(entry);
> > >> >> >> >> >> > +     int batch_nr, remain_nr;
> > >> >> >> >> >> > +     DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };
> > >> >> >> >> >> > +
> > >> >> >> >> >> > +     /* all swap entries are within a cluster for mTHP */
> > >> >> >> >> >> > +     VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
> > >> >> >> >> >> > +
> > >> >> >> >> >> > +     if (nr_pages == 1) {
> > >> >> >> >> >> > +             swap_free(entry);
> > >> >> >> >> >> > +             return;
> > >> >> >> >> >> > +     }
> > >> >> >> >> >>
> > >> >> >> >> >> Is it possible to unify swap_free() and swap_free_nr() into one function
> > >> >> >> >> >> with acceptable performance?  IIUC, the general rule in mTHP effort is
> > >> >> >> >> >> to avoid duplicate functions between mTHP and normal small folio.
> > >> >> >> >> >> Right?
> > >> >> >> >> >
> > >> >> >> >> > I don't see why.
> > >> >> >> >>
> > >> >> >> >> Because duplicated implementation are hard to maintain in the long term.
> > >> >> >> >
> > >> >> >> > sorry, i actually meant "I don't see why not",  for some reason, the "not"
> > >> >> >> > was missed. Obviously I meant "why not", there was a "but" after it :-)
> > >> >> >> >
> > >> >> >> >>
> > >> >> >> >> > but we have lots of places calling swap_free(), we may
> > >> >> >> >> > have to change them all to call swap_free_nr(entry, 1); the other possible
> > >> >> >> >> > way is making swap_free() a wrapper of swap_free_nr() always using
> > >> >> >> >> > 1 as the argument. In both cases, we are changing the semantics of
> > >> >> >> >> > swap_free_nr() to partially freeing large folio cases and have to drop
> > >> >> >> >> > "entry should be for the first subpage" then.
> > >> >> >> >> >
> > >> >> >> >> > Right now, the semantics is
> > >> >> >> >> > * swap_free_nr() for an entire large folio;
> > >> >> >> >> > * swap_free() for one entry of either a large folio or a small folio
> > >> >> >> >>
> > >> >> >> >> As above, I don't think the these semantics are important for
> > >> >> >> >> swap_free_nr() implementation.
> > >> >> >> >
> > >> >> >> > right. I agree. If we are ready to change all those callers, nothing
> > >> >> >> > can stop us from removing swap_free().
> > >> >> >> >
> > >> >> >> >>
> > >> >> >> >> >>
> > >> >> >> >> >> > +
> > >> >> >> >> >> > +     remain_nr = nr_pages;
> > >> >> >> >> >> > +     p = _swap_info_get(entry);
> > >> >> >> >> >> > +     if (p) {
> > >> >> >> >> >> > +             for (i = 0; i < nr_pages; i += batch_nr) {
> > >> >> >> >> >> > +                     batch_nr = min_t(int, SWAP_BATCH_NR, remain_nr);
> > >> >> >> >> >> > +
> > >> >> >> >> >> > +                     ci = lock_cluster_or_swap_info(p, offset);
> > >> >> >> >> >> > +                     for (j = 0; j < batch_nr; j++) {
> > >> >> >> >> >> > +                             if (__swap_entry_free_locked(p, offset + i * SWAP_BATCH_NR + j, 1))
> > >> >> >> >> >> > +                                     __bitmap_set(usage, j, 1);
> > >> >> >> >> >> > +                     }
> > >> >> >> >> >> > +                     unlock_cluster_or_swap_info(p, ci);
> > >> >> >> >> >> > +
> > >> >> >> >> >> > +                     for_each_clear_bit(j, usage, batch_nr)
> > >> >> >> >> >> > +                             free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
> > >> >> >> >> >> > +
> > >> >> >> >> >> > +                     bitmap_clear(usage, 0, SWAP_BATCH_NR);
> > >> >> >> >> >> > +                     remain_nr -= batch_nr;
> > >> >> >> >> >> > +             }
> > >> >> >> >> >> > +     }
> > >> >> >> >> >> > +}
> > >> >> >> >> >> > +
> > >> >> >> >> >> >  /*
> > >> >> >> >> >> >   * Called after dropping swapcache to decrease refcnt to swap entries.
> > >> >> >> >> >> >   */
> > >> >> >> >> >>
> > >> >> >> >> >> put_swap_folio() implements batching in another method.  Do you think
> > >> >> >> >> >> that it's good to use the batching method in that function here?  It
> > >> >> >> >> >> avoids to use bitmap operations and stack space.
> > >> >> >> >> >
> > >> >> >> >> > Chuanhua has strictly limited the maximum stack usage to several
> > >> >> >> >> > unsigned long,
> > >> >> >> >>
> > >> >> >> >> 512 / 8 = 64 bytes.
> > >> >> >> >>
> > >> >> >> >> So, not trivial.
> > >> >> >> >>
> > >> >> >> >> > so this should be safe. on the other hand, i believe this
> > >> >> >> >> > implementation is more efficient, as  put_swap_folio() might lock/
> > >> >> >> >> > unlock much more often whenever __swap_entry_free_locked returns
> > >> >> >> >> > 0.
> > >> >> >> >>
> > >> >> >> >> There are 2 most common use cases,
> > >> >> >> >>
> > >> >> >> >> - all swap entries have usage count == 0
> > >> >> >> >> - all swap entries have usage count != 0
> > >> >> >> >>
> > >> >> >> >> In both cases, we only need to lock/unlock once.  In fact, I didn't
> > >> >> >> >> find possible use cases other than above.
> > >> >> >> >
> > >> >> >> > i guess the point is free_swap_slot() shouldn't be called within
> > >> >> >> > lock_cluster_or_swap_info? so when we are freeing nr_pages slots,
> > >> >> >> > we'll have to unlock and lock nr_pages times?  and this is the most
> > >> >> >> > common scenario.
> > >> >> >>
> > >> >> >> No.  In put_swap_folio(), free_entries is either SWAPFILE_CLUSTER (that
> > >> >> >> is, nr_pages) or 0.  These are the most common cases.
> > >> >> >>
> > >> >> >
> > >> >> > i am actually talking about the below code path,
> > >> >> >
> > >> >> > void put_swap_folio(struct folio *folio, swp_entry_t entry)
> > >> >> > {
> > >> >> >         ...
> > >> >> >
> > >> >> >         ci = lock_cluster_or_swap_info(si, offset);
> > >> >> >         ...
> > >> >> >         for (i = 0; i < size; i++, entry.val++) {
> > >> >> >                 if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE)) {
> > >> >> >                         unlock_cluster_or_swap_info(si, ci);
> > >> >> >                         free_swap_slot(entry);
> > >> >> >                         if (i == size - 1)
> > >> >> >                                 return;
> > >> >> >                         lock_cluster_or_swap_info(si, offset);
> > >> >> >                 }
> > >> >> >         }
> > >> >> >         unlock_cluster_or_swap_info(si, ci);
> > >> >> > }
> > >> >> >
> > >> >> > but i guess you are talking about the below code path:
> > >> >> >
> > >> >> > void put_swap_folio(struct folio *folio, swp_entry_t entry)
> > >> >> > {
> > >> >> >         ...
> > >> >> >
> > >> >> >         ci = lock_cluster_or_swap_info(si, offset);
> > >> >> >         if (size == SWAPFILE_CLUSTER) {
> > >> >> >                 map = si->swap_map + offset;
> > >> >> >                 for (i = 0; i < SWAPFILE_CLUSTER; i++) {
> > >> >> >                         val = map[i];
> > >> >> >                         VM_BUG_ON(!(val & SWAP_HAS_CACHE));
> > >> >> >                         if (val == SWAP_HAS_CACHE)
> > >> >> >                                 free_entries++;
> > >> >> >                 }
> > >> >> >                 if (free_entries == SWAPFILE_CLUSTER) {
> > >> >> >                         unlock_cluster_or_swap_info(si, ci);
> > >> >> >                         spin_lock(&si->lock);
> > >> >> >                         mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER);
> > >> >> >                         swap_free_cluster(si, idx);
> > >> >> >                         spin_unlock(&si->lock);
> > >> >> >                         return;
> > >> >> >                 }
> > >> >> >         }
> > >> >> > }
> > >> >>
> > >> >> I am talking about both code paths.  In 2 most common cases,
> > >> >> __swap_entry_free_locked() will return 0 or !0 for all entries in range.
> > >> >
> > >> > I grasp your point, but if conditions involving 0 or non-0 values fail, we'll
> > >> > end up repeatedly unlocking and locking. Picture a scenario with a large
> > >> > folio shared by multiple processes. One process might unmap a portion
> > >> > while another still holds an entire mapping to it. This could lead to situations
> > >> > where free_entries doesn't equal 0 and free_entries doesn't equal
> > >> > nr_pages, resulting in multiple unlock and lock operations.
> > >>
> > >> This is impossible in current caller, because the folio is in the swap
> > >> cache.  But if we move the change to __swap_entry_free_nr(), we may run
> > >> into this situation.
> > >
> > > I don't understand why it is impossible, after try_to_unmap_one() has done
> > > on one process, mprotect and munmap called on a part of the large folio
> > > pte entries which now have been swap entries, we are removing the PTE
> > > for this part. Another process can entirely hit the swapcache and have
> > > all swap entries mapped there, and we call swap_free_nr(entry, nr_pages) in
> > > do_swap_page. Within those swap entries, some have swapcount=1 and others
> > > have swapcount > 1. Am I missing something?
> >
> > For swap entries with swapcount=1, its sis->swap_map[] will be
> >
> > 1 | SWAP_HAS_CACHE
> >
> > so, __swap_entry_free_locked() will return SWAP_HAS_CACHE instead of 0.
> >
> > The swap entries will be free in
> >
> > folio_free_swap
> >   delete_from_swap_cache
> >     put_swap_folio
> >
>
> Yes. I realized this after replying to you yesterday.
>
> > >> > Chuanhua has invested significant effort in following Ryan's suggestion
> > >> > for the current approach, which generally handles all cases, especially
> > >> > partial unmapping. Additionally, the widespread use of swap_free_nr()
> > >> > as you suggested across various scenarios is noteworthy.
> > >> >
> > >> > Unless there's evidence indicating performance issues or bugs, I believe
> > >> > the current approach remains preferable.
> > >>
> > >> TBH, I don't like the large stack space usage (64 bytes).  How about use
> > >> a "unsigned long" as bitmap?  Then, we use much less stack space, use
> > >> bitmap == 0 and bitmap == (unsigned long)(-1) to check the most common
> > >> use cases.  And, we have enough batching.
> > >
> > > that is quite a straightforward modification like,
> > >
> > > - #define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
> > > + #define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 64 ? 64 : SWAPFILE_CLUSTER)
> > >
> > > there is no necessity to remove the bitmap API and move to raw
> > > unsigned long operations.
> > > as bitmap is exactly some unsigned long. on 64bit CPU, we are now one
> > > unsigned long,
> > > on 32bit CPU, it is now two unsigned long.
> >
> > Yes.  We can still use most bitmap APIs if we use "unsigned long" as
> > bitmap.  The advantage of "unsigned long" is to guarantee that
> > bitmap_empty() and bitmap_full() is trivial.  We can use that for
> > optimization.  For example, we can skip unlock/lock if bitmap_empty().
>
> anyway we have avoided lock_cluster_or_swap_info and unlock_cluster_or_swap_info
> for each individual swap entry.
>
> if bitma_empty(), we won't call free_swap_slot() so no chance to
> further take any lock,
> right?
>
> the optimization of bitmap_full() seems to be more useful only after we have
> void free_swap_slot(swp_entry_t entry, int nr)
>
> in which we can avoid many spin_lock_irq(&cache->free_lock);
>
> On the other hand, it seems we can directly call
> swapcache_free_entries() to skip cache if
> nr_pages >= SWAP_BATCH(64) this might be an optimization as we are now
> having a bitmap exactly equals 64.

Hi ying,
considering the below code which has changed bitmap to 64 and generally support
different nr_pages(1 and ever cross cluster),

#define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 64 ? 64 : SWAPFILE_CLUSTER)

void swap_free_nr(swp_entry_t entry, int nr_pages)
{
        int i = 0, j;
        struct swap_cluster_info *ci;
        struct swap_info_struct *p;
        unsigned int type = swp_type(entry);
        unsigned long offset = swp_offset(entry);
        int batch_nr, remain_nr;
        DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };

        remain_nr = nr_pages;
        p = _swap_info_get(entry);
        if (!p)
                return;

        for ( ; ; ) {
                batch_nr = min3(SWAP_BATCH_NR, remain_nr,
                                (int)(SWAPFILE_CLUSTER - (offset %
SWAPFILE_CLUSTER)));

                ci = lock_cluster_or_swap_info(p, offset);
                for (j = 0; j < batch_nr; j++) {
                        if (__swap_entry_free_locked(p, offset + i *
SWAP_BATCH_NR + j, 1))
                                __bitmap_set(usage, j, 1);
                }
                unlock_cluster_or_swap_info(p, ci);

                for_each_clear_bit(j, usage, batch_nr)
                        free_swap_slot(swp_entry(type, offset + i *
SWAP_BATCH_NR + j));

                i += batch_nr;
                if (i >= nr_pages)
                        break;

                bitmap_clear(usage, 0, SWAP_BATCH_NR);
                remain_nr -= batch_nr;
        }
}

I still don't see the benefits of using bitmap_full and bitmap_empty over simple
for_each_clear_bit() unless we begin to support free_swap_slot_nr(), which,
I believe, needs a separate incremental patchset.

using bitmap_empty and full, if we want to free all slots, we need
if (bitmap_empty(usage))
{
    for (i=0;i<batch_nr;i++)
              free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
}
This seems just a game replacing for_each_clear_bit by
bitmap_empty()+another for loop.

if we don't want to free any one, we need
if(bitmap_full(usage)
       do_nothing.

in the for_each_clear_bit() case, the loop just simply ends.

What's your proposal code to use bitmap_empty and bitmap_full here?
Am I missing something?

>
> >
> > >>
> > >> >>
> > >> >> > we are mTHP, so we can't assume our size is SWAPFILE_CLUSTER?
> > >> >> > or you want to check free_entries == "1 << swap_entry_order(folio_order(folio))"
> > >> >> > instead of SWAPFILE_CLUSTER for the "for (i = 0; i < size; i++, entry.val++)"
> > >> >> > path?
> > >> >>
> > >> >> Just replace SWAPFILE_CLUSTER with "nr_pages" in your code.
> > >> >>
> > >> >> >
> > >> >> >> >>
> > >> >> >> >> And, we should add batching in __swap_entry_free().  That will help
> > >> >> >> >> free_swap_and_cache_nr() too.
> > >> >> >
> > >> >> > Chris Li and I actually discussed it before, while I completely
> > >> >> > agree this can be batched. but i'd like to defer this as an incremental
> > >> >> > patchset later to keep this swapcache-refault small.
> > >> >>
> > >> >> OK.
> > >> >>
> > >> >> >>
> > >> >> >> Please consider this too.
> >
> > --
> > Best Regards,
> > Huang, Ying

Thanks
Barry

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/5] mm: swap: introduce swap_free_nr() for batched swap_free()
  2024-04-18  5:27                           ` Barry Song
@ 2024-04-18  8:55                             ` Huang, Ying
  2024-04-18  9:14                               ` Barry Song
  0 siblings, 1 reply; 54+ messages in thread
From: Huang, Ying @ 2024-04-18  8:55 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, hanchuanhua, hannes,
	hughd, kasong, ryan.roberts, surenb, v-songbaohua, willy, xiang,
	yosryahmed, yuzhao, ziy, linux-kernel

Barry Song <21cnbao@gmail.com> writes:

> On Wed, Apr 17, 2024 at 1:35 PM Barry Song <21cnbao@gmail.com> wrote:
>>
>> On Wed, Apr 17, 2024 at 12:34 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >
>> > Barry Song <21cnbao@gmail.com> writes:
>> >
>> > > On Tue, Apr 16, 2024 at 3:13 PM Huang, Ying <ying.huang@intel.com> wrote:
>> > >>
>> > >> Barry Song <21cnbao@gmail.com> writes:
>> > >>
>> > >> > On Tue, Apr 16, 2024 at 1:42 PM Huang, Ying <ying.huang@intel.com> wrote:
>> > >> >>
>> > >> >> Barry Song <21cnbao@gmail.com> writes:
>> > >> >>
>> > >> >> > On Mon, Apr 15, 2024 at 8:53 PM Huang, Ying <ying.huang@intel.com> wrote:
>> > >> >> >>
>> > >> >> >> Barry Song <21cnbao@gmail.com> writes:
>> > >> >> >>
>> > >> >> >> > On Mon, Apr 15, 2024 at 8:21 PM Huang, Ying <ying.huang@intel.com> wrote:
>> > >> >> >> >>
>> > >> >> >> >> Barry Song <21cnbao@gmail.com> writes:
>> > >> >> >> >>
>> > >> >> >> >> > On Mon, Apr 15, 2024 at 6:19 PM Huang, Ying <ying.huang@intel.com> wrote:
>> > >> >> >> >> >>
>> > >> >> >> >> >> Barry Song <21cnbao@gmail.com> writes:
>> > >> >> >> >> >>
>> > >> >> >> >> >> > From: Chuanhua Han <hanchuanhua@oppo.com>
>> > >> >> >> >> >> >
>> > >> >> >> >> >> > While swapping in a large folio, we need to free swaps related to the whole
>> > >> >> >> >> >> > folio. To avoid frequently acquiring and releasing swap locks, it is better
>> > >> >> >> >> >> > to introduce an API for batched free.
>> > >> >> >> >> >> >
>> > >> >> >> >> >> > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
>> > >> >> >> >> >> > Co-developed-by: Barry Song <v-songbaohua@oppo.com>
>> > >> >> >> >> >> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>> > >> >> >> >> >> > ---
>> > >> >> >> >> >> >  include/linux/swap.h |  5 +++++
>> > >> >> >> >> >> >  mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++
>> > >> >> >> >> >> >  2 files changed, 56 insertions(+)
>> > >> >> >> >> >> >
>> > >> >> >> >> >> > diff --git a/include/linux/swap.h b/include/linux/swap.h
>> > >> >> >> >> >> > index 11c53692f65f..b7a107e983b8 100644
>> > >> >> >> >> >> > --- a/include/linux/swap.h
>> > >> >> >> >> >> > +++ b/include/linux/swap.h
>> > >> >> >> >> >> > @@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
>> > >> >> >> >> >> >  extern int swap_duplicate(swp_entry_t);
>> > >> >> >> >> >> >  extern int swapcache_prepare(swp_entry_t);
>> > >> >> >> >> >> >  extern void swap_free(swp_entry_t);
>> > >> >> >> >> >> > +extern void swap_free_nr(swp_entry_t entry, int nr_pages);
>> > >> >> >> >> >> >  extern void swapcache_free_entries(swp_entry_t *entries, int n);
>> > >> >> >> >> >> >  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
>> > >> >> >> >> >> >  int swap_type_of(dev_t device, sector_t offset);
>> > >> >> >> >> >> > @@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
>> > >> >> >> >> >> >  {
>> > >> >> >> >> >> >  }
>> > >> >> >> >> >> >
>> > >> >> >> >> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
>> > >> >> >> >> >> > +{
>> > >> >> >> >> >> > +}
>> > >> >> >> >> >> > +
>> > >> >> >> >> >> >  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
>> > >> >> >> >> >> >  {
>> > >> >> >> >> >> >  }
>> > >> >> >> >> >> > diff --git a/mm/swapfile.c b/mm/swapfile.c
>> > >> >> >> >> >> > index 28642c188c93..f4c65aeb088d 100644
>> > >> >> >> >> >> > --- a/mm/swapfile.c
>> > >> >> >> >> >> > +++ b/mm/swapfile.c
>> > >> >> >> >> >> > @@ -1356,6 +1356,57 @@ void swap_free(swp_entry_t entry)
>> > >> >> >> >> >> >               __swap_entry_free(p, entry);
>> > >> >> >> >> >> >  }
>> > >> >> >> >> >> >
>> > >> >> >> >> >> > +/*
>> > >> >> >> >> >> > + * Free up the maximum number of swap entries at once to limit the
>> > >> >> >> >> >> > + * maximum kernel stack usage.
>> > >> >> >> >> >> > + */
>> > >> >> >> >> >> > +#define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
>> > >> >> >> >> >> > +
>> > >> >> >> >> >> > +/*
>> > >> >> >> >> >> > + * Called after swapping in a large folio,
>> > >> >> >> >> >>
>> > >> >> >> >> >> IMHO, it's not good to document the caller in the function definition.
>> > >> >> >> >> >> Because this will discourage function reusing.
>> > >> >> >> >> >
>> > >> >> >> >> > ok. right now there is only one user that is why it is added. but i agree
>> > >> >> >> >> > we can actually remove this.
>> > >> >> >> >> >
>> > >> >> >> >> >>
>> > >> >> >> >> >> > batched free swap entries
>> > >> >> >> >> >> > + * for this large folio, entry should be for the first subpage and
>> > >> >> >> >> >> > + * its offset is aligned with nr_pages
>> > >> >> >> >> >>
>> > >> >> >> >> >> Why do we need this?
>> > >> >> >> >> >
>> > >> >> >> >> > This is a fundamental requirement for the existing kernel, folio's
>> > >> >> >> >> > swap offset is naturally aligned from the first moment add_to_swap
>> > >> >> >> >> > to add swapcache's xa. so this comment is describing the existing
>> > >> >> >> >> > fact. In the future, if we want to support swap-out folio to discontiguous
>> > >> >> >> >> > and not-aligned offsets, we can't pass entry as the parameter, we should
>> > >> >> >> >> > instead pass ptep or another different data struct which can connect
>> > >> >> >> >> > multiple discontiguous swap offsets.
>> > >> >> >> >> >
>> > >> >> >> >> > I feel like we only need "for this large folio, entry should be for
>> > >> >> >> >> > the first subpage" and drop "and its offset is aligned with nr_pages",
>> > >> >> >> >> > the latter is not important to this context at all.
>> > >> >> >> >>
>> > >> >> >> >> IIUC, all these are requirements of the only caller now, not the
>> > >> >> >> >> function itself.  If only part of the all swap entries of a mTHP are
>> > >> >> >> >> called with swap_free_nr(), can swap_free_nr() still do its work?  If
>> > >> >> >> >> so, why not make swap_free_nr() as general as possible?
>> > >> >> >> >
>> > >> >> >> > right , i believe we can make swap_free_nr() as general as possible.
>> > >> >> >> >
>> > >> >> >> >>
>> > >> >> >> >> >>
>> > >> >> >> >> >> > + */
>> > >> >> >> >> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
>> > >> >> >> >> >> > +{
>> > >> >> >> >> >> > +     int i, j;
>> > >> >> >> >> >> > +     struct swap_cluster_info *ci;
>> > >> >> >> >> >> > +     struct swap_info_struct *p;
>> > >> >> >> >> >> > +     unsigned int type = swp_type(entry);
>> > >> >> >> >> >> > +     unsigned long offset = swp_offset(entry);
>> > >> >> >> >> >> > +     int batch_nr, remain_nr;
>> > >> >> >> >> >> > +     DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };
>> > >> >> >> >> >> > +
>> > >> >> >> >> >> > +     /* all swap entries are within a cluster for mTHP */
>> > >> >> >> >> >> > +     VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
>> > >> >> >> >> >> > +
>> > >> >> >> >> >> > +     if (nr_pages == 1) {
>> > >> >> >> >> >> > +             swap_free(entry);
>> > >> >> >> >> >> > +             return;
>> > >> >> >> >> >> > +     }
>> > >> >> >> >> >>
>> > >> >> >> >> >> Is it possible to unify swap_free() and swap_free_nr() into one function
>> > >> >> >> >> >> with acceptable performance?  IIUC, the general rule in mTHP effort is
>> > >> >> >> >> >> to avoid duplicate functions between mTHP and normal small folio.
>> > >> >> >> >> >> Right?
>> > >> >> >> >> >
>> > >> >> >> >> > I don't see why.
>> > >> >> >> >>
>> > >> >> >> >> Because duplicated implementation are hard to maintain in the long term.
>> > >> >> >> >
>> > >> >> >> > sorry, i actually meant "I don't see why not",  for some reason, the "not"
>> > >> >> >> > was missed. Obviously I meant "why not", there was a "but" after it :-)
>> > >> >> >> >
>> > >> >> >> >>
>> > >> >> >> >> > but we have lots of places calling swap_free(), we may
>> > >> >> >> >> > have to change them all to call swap_free_nr(entry, 1); the other possible
>> > >> >> >> >> > way is making swap_free() a wrapper of swap_free_nr() always using
>> > >> >> >> >> > 1 as the argument. In both cases, we are changing the semantics of
>> > >> >> >> >> > swap_free_nr() to partially freeing large folio cases and have to drop
>> > >> >> >> >> > "entry should be for the first subpage" then.
>> > >> >> >> >> >
>> > >> >> >> >> > Right now, the semantics is
>> > >> >> >> >> > * swap_free_nr() for an entire large folio;
>> > >> >> >> >> > * swap_free() for one entry of either a large folio or a small folio
>> > >> >> >> >>
>> > >> >> >> >> As above, I don't think the these semantics are important for
>> > >> >> >> >> swap_free_nr() implementation.
>> > >> >> >> >
>> > >> >> >> > right. I agree. If we are ready to change all those callers, nothing
>> > >> >> >> > can stop us from removing swap_free().
>> > >> >> >> >
>> > >> >> >> >>
>> > >> >> >> >> >>
>> > >> >> >> >> >> > +
>> > >> >> >> >> >> > +     remain_nr = nr_pages;
>> > >> >> >> >> >> > +     p = _swap_info_get(entry);
>> > >> >> >> >> >> > +     if (p) {
>> > >> >> >> >> >> > +             for (i = 0; i < nr_pages; i += batch_nr) {
>> > >> >> >> >> >> > +                     batch_nr = min_t(int, SWAP_BATCH_NR, remain_nr);
>> > >> >> >> >> >> > +
>> > >> >> >> >> >> > +                     ci = lock_cluster_or_swap_info(p, offset);
>> > >> >> >> >> >> > +                     for (j = 0; j < batch_nr; j++) {
>> > >> >> >> >> >> > +                             if (__swap_entry_free_locked(p, offset + i * SWAP_BATCH_NR + j, 1))
>> > >> >> >> >> >> > +                                     __bitmap_set(usage, j, 1);
>> > >> >> >> >> >> > +                     }
>> > >> >> >> >> >> > +                     unlock_cluster_or_swap_info(p, ci);
>> > >> >> >> >> >> > +
>> > >> >> >> >> >> > +                     for_each_clear_bit(j, usage, batch_nr)
>> > >> >> >> >> >> > +                             free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
>> > >> >> >> >> >> > +
>> > >> >> >> >> >> > +                     bitmap_clear(usage, 0, SWAP_BATCH_NR);
>> > >> >> >> >> >> > +                     remain_nr -= batch_nr;
>> > >> >> >> >> >> > +             }
>> > >> >> >> >> >> > +     }
>> > >> >> >> >> >> > +}
>> > >> >> >> >> >> > +
>> > >> >> >> >> >> >  /*
>> > >> >> >> >> >> >   * Called after dropping swapcache to decrease refcnt to swap entries.
>> > >> >> >> >> >> >   */
>> > >> >> >> >> >>
>> > >> >> >> >> >> put_swap_folio() implements batching in another method.  Do you think
>> > >> >> >> >> >> that it's good to use the batching method in that function here?  It
>> > >> >> >> >> >> avoids to use bitmap operations and stack space.
>> > >> >> >> >> >
>> > >> >> >> >> > Chuanhua has strictly limited the maximum stack usage to several
>> > >> >> >> >> > unsigned long,
>> > >> >> >> >>
>> > >> >> >> >> 512 / 8 = 64 bytes.
>> > >> >> >> >>
>> > >> >> >> >> So, not trivial.
>> > >> >> >> >>
>> > >> >> >> >> > so this should be safe. on the other hand, i believe this
>> > >> >> >> >> > implementation is more efficient, as  put_swap_folio() might lock/
>> > >> >> >> >> > unlock much more often whenever __swap_entry_free_locked returns
>> > >> >> >> >> > 0.
>> > >> >> >> >>
>> > >> >> >> >> There are 2 most common use cases,
>> > >> >> >> >>
>> > >> >> >> >> - all swap entries have usage count == 0
>> > >> >> >> >> - all swap entries have usage count != 0
>> > >> >> >> >>
>> > >> >> >> >> In both cases, we only need to lock/unlock once.  In fact, I didn't
>> > >> >> >> >> find possible use cases other than above.
>> > >> >> >> >
>> > >> >> >> > i guess the point is free_swap_slot() shouldn't be called within
>> > >> >> >> > lock_cluster_or_swap_info? so when we are freeing nr_pages slots,
>> > >> >> >> > we'll have to unlock and lock nr_pages times?  and this is the most
>> > >> >> >> > common scenario.
>> > >> >> >>
>> > >> >> >> No.  In put_swap_folio(), free_entries is either SWAPFILE_CLUSTER (that
>> > >> >> >> is, nr_pages) or 0.  These are the most common cases.
>> > >> >> >>
>> > >> >> >
>> > >> >> > i am actually talking about the below code path,
>> > >> >> >
>> > >> >> > void put_swap_folio(struct folio *folio, swp_entry_t entry)
>> > >> >> > {
>> > >> >> >         ...
>> > >> >> >
>> > >> >> >         ci = lock_cluster_or_swap_info(si, offset);
>> > >> >> >         ...
>> > >> >> >         for (i = 0; i < size; i++, entry.val++) {
>> > >> >> >                 if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE)) {
>> > >> >> >                         unlock_cluster_or_swap_info(si, ci);
>> > >> >> >                         free_swap_slot(entry);
>> > >> >> >                         if (i == size - 1)
>> > >> >> >                                 return;
>> > >> >> >                         lock_cluster_or_swap_info(si, offset);
>> > >> >> >                 }
>> > >> >> >         }
>> > >> >> >         unlock_cluster_or_swap_info(si, ci);
>> > >> >> > }
>> > >> >> >
>> > >> >> > but i guess you are talking about the below code path:
>> > >> >> >
>> > >> >> > void put_swap_folio(struct folio *folio, swp_entry_t entry)
>> > >> >> > {
>> > >> >> >         ...
>> > >> >> >
>> > >> >> >         ci = lock_cluster_or_swap_info(si, offset);
>> > >> >> >         if (size == SWAPFILE_CLUSTER) {
>> > >> >> >                 map = si->swap_map + offset;
>> > >> >> >                 for (i = 0; i < SWAPFILE_CLUSTER; i++) {
>> > >> >> >                         val = map[i];
>> > >> >> >                         VM_BUG_ON(!(val & SWAP_HAS_CACHE));
>> > >> >> >                         if (val == SWAP_HAS_CACHE)
>> > >> >> >                                 free_entries++;
>> > >> >> >                 }
>> > >> >> >                 if (free_entries == SWAPFILE_CLUSTER) {
>> > >> >> >                         unlock_cluster_or_swap_info(si, ci);
>> > >> >> >                         spin_lock(&si->lock);
>> > >> >> >                         mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER);
>> > >> >> >                         swap_free_cluster(si, idx);
>> > >> >> >                         spin_unlock(&si->lock);
>> > >> >> >                         return;
>> > >> >> >                 }
>> > >> >> >         }
>> > >> >> > }
>> > >> >>
>> > >> >> I am talking about both code paths.  In 2 most common cases,
>> > >> >> __swap_entry_free_locked() will return 0 or !0 for all entries in range.
>> > >> >
>> > >> > I grasp your point, but if conditions involving 0 or non-0 values fail, we'll
>> > >> > end up repeatedly unlocking and locking. Picture a scenario with a large
>> > >> > folio shared by multiple processes. One process might unmap a portion
>> > >> > while another still holds an entire mapping to it. This could lead to situations
>> > >> > where free_entries doesn't equal 0 and free_entries doesn't equal
>> > >> > nr_pages, resulting in multiple unlock and lock operations.
>> > >>
>> > >> This is impossible in current caller, because the folio is in the swap
>> > >> cache.  But if we move the change to __swap_entry_free_nr(), we may run
>> > >> into this situation.
>> > >
>> > > I don't understand why it is impossible, after try_to_unmap_one() has done
>> > > on one process, mprotect and munmap called on a part of the large folio
>> > > pte entries which now have been swap entries, we are removing the PTE
>> > > for this part. Another process can entirely hit the swapcache and have
>> > > all swap entries mapped there, and we call swap_free_nr(entry, nr_pages) in
>> > > do_swap_page. Within those swap entries, some have swapcount=1 and others
>> > > have swapcount > 1. Am I missing something?
>> >
>> > For swap entries with swapcount=1, its sis->swap_map[] will be
>> >
>> > 1 | SWAP_HAS_CACHE
>> >
>> > so, __swap_entry_free_locked() will return SWAP_HAS_CACHE instead of 0.
>> >
>> > The swap entries will be free in
>> >
>> > folio_free_swap
>> >   delete_from_swap_cache
>> >     put_swap_folio
>> >
>>
>> Yes. I realized this after replying to you yesterday.
>>
>> > >> > Chuanhua has invested significant effort in following Ryan's suggestion
>> > >> > for the current approach, which generally handles all cases, especially
>> > >> > partial unmapping. Additionally, the widespread use of swap_free_nr()
>> > >> > as you suggested across various scenarios is noteworthy.
>> > >> >
>> > >> > Unless there's evidence indicating performance issues or bugs, I believe
>> > >> > the current approach remains preferable.
>> > >>
>> > >> TBH, I don't like the large stack space usage (64 bytes).  How about use
>> > >> a "unsigned long" as bitmap?  Then, we use much less stack space, use
>> > >> bitmap == 0 and bitmap == (unsigned long)(-1) to check the most common
>> > >> use cases.  And, we have enough batching.
>> > >
>> > > that is quite a straightforward modification like,
>> > >
>> > > - #define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
>> > > + #define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 64 ? 64 : SWAPFILE_CLUSTER)
>> > >
>> > > there is no necessity to remove the bitmap API and move to raw
>> > > unsigned long operations.
>> > > as bitmap is exactly some unsigned long. on 64bit CPU, we are now one
>> > > unsigned long,
>> > > on 32bit CPU, it is now two unsigned long.
>> >
>> > Yes.  We can still use most bitmap APIs if we use "unsigned long" as
>> > bitmap.  The advantage of "unsigned long" is to guarantee that
>> > bitmap_empty() and bitmap_full() is trivial.  We can use that for
>> > optimization.  For example, we can skip unlock/lock if bitmap_empty().
>>
>> anyway we have avoided lock_cluster_or_swap_info and unlock_cluster_or_swap_info
>> for each individual swap entry.
>>
>> if bitma_empty(), we won't call free_swap_slot() so no chance to
>> further take any lock,
>> right?
>>
>> the optimization of bitmap_full() seems to be more useful only after we have
>> void free_swap_slot(swp_entry_t entry, int nr)
>>
>> in which we can avoid many spin_lock_irq(&cache->free_lock);
>>
>> On the other hand, it seems we can directly call
>> swapcache_free_entries() to skip cache if
>> nr_pages >= SWAP_BATCH(64) this might be an optimization as we are now
>> having a bitmap exactly equals 64.
>
> Hi ying,
> considering the below code which has changed bitmap to 64 and generally support
> different nr_pages(1 and ever cross cluster),
>
> #define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 64 ? 64 : SWAPFILE_CLUSTER)
>
> void swap_free_nr(swp_entry_t entry, int nr_pages)
> {
>         int i = 0, j;
>         struct swap_cluster_info *ci;
>         struct swap_info_struct *p;
>         unsigned int type = swp_type(entry);
>         unsigned long offset = swp_offset(entry);
>         int batch_nr, remain_nr;
>         DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };
>
>         remain_nr = nr_pages;
>         p = _swap_info_get(entry);
>         if (!p)
>                 return;
>
>         for ( ; ; ) {
>                 batch_nr = min3(SWAP_BATCH_NR, remain_nr,
>                                 (int)(SWAPFILE_CLUSTER - (offset %
> SWAPFILE_CLUSTER)));
>
>                 ci = lock_cluster_or_swap_info(p, offset);
>                 for (j = 0; j < batch_nr; j++) {
>                         if (__swap_entry_free_locked(p, offset + i *
> SWAP_BATCH_NR + j, 1))
>                                 __bitmap_set(usage, j, 1);
>                 }
>                 unlock_cluster_or_swap_info(p, ci);
>
>                 for_each_clear_bit(j, usage, batch_nr)
>                         free_swap_slot(swp_entry(type, offset + i *
> SWAP_BATCH_NR + j));
>
>                 i += batch_nr;
>                 if (i >= nr_pages)
>                         break;
>
>                 bitmap_clear(usage, 0, SWAP_BATCH_NR);
>                 remain_nr -= batch_nr;
>         }
> }
>
> I still don't see the benefits of using bitmap_full and bitmap_empty over simple
> for_each_clear_bit() unless we begin to support free_swap_slot_nr(), which,
> I believe, needs a separate incremental patchset.
>
> using bitmap_empty and full, if we want to free all slots, we need
> if (bitmap_empty(usage))
> {
>     for (i=0;i<batch_nr;i++)
>               free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
> }
> This seems just a game replacing for_each_clear_bit by
> bitmap_empty()+another for loop.
>
> if we don't want to free any one, we need
> if(bitmap_full(usage)
>        do_nothing.
>
> in the for_each_clear_bit() case, the loop just simply ends.
>
> What's your proposal code to use bitmap_empty and bitmap_full here?
> Am I missing something?

My idea is something as below.  It's only build tested.

static void cluster_swap_free_nr(struct swap_info_struct *sis,
				 unsigned long offset, int nr_pages)
{
	struct swap_cluster_info *ci;
	DECLARE_BITMAP(to_free, BITS_PER_LONG) = { 0 };
	int i, nr;

	ci = lock_cluster_or_swap_info(sis, offset);
	while (nr_pages) {
		nr = min(BITS_PER_LONG, nr_pages);
		for (i = 0; i < nr; i++) {
			if (!__swap_entry_free_locked(sis, offset + i, 1))
				bitmap_set(to_free, i, 1);
		}
		if (!bitmap_empty(to_free, BITS_PER_LONG)) {
			unlock_cluster_or_swap_info(sis, ci);
			for_each_set_bit(i, to_free, BITS_PER_LONG)
				free_swap_slot(swp_entry(sis->type, offset + i));
			if (nr == nr_pages)
				return;
			bitmap_clear(to_free, 0, BITS_PER_LONG);
			ci = lock_cluster_or_swap_info(sis, offset);
		}
		offset += nr;
		nr_pages -= nr;
	}
	unlock_cluster_or_swap_info(sis, ci);
}

void swap_free_nr(swp_entry_t entry, int nr_pages)
{
	int nr;
	struct swap_info_struct *sis;
	unsigned long offset = swp_offset(entry);

	sis = _swap_info_get(entry);
	if (!sis)
		return;

	while (nr_pages >= 0) {
		nr = min_t(int, nr_pages, SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
		cluster_swap_free_nr(sis, offset, nr);
		offset += nr;
		nr_pages -= nr;
	}
}

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/5] mm: swap: introduce swap_free_nr() for batched swap_free()
  2024-04-18  8:55                             ` Huang, Ying
@ 2024-04-18  9:14                               ` Barry Song
  2024-05-02 23:05                                 ` Barry Song
  0 siblings, 1 reply; 54+ messages in thread
From: Barry Song @ 2024-04-18  9:14 UTC (permalink / raw)
  To: Huang, Ying
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, hanchuanhua, hannes,
	hughd, kasong, ryan.roberts, surenb, v-songbaohua, willy, xiang,
	yosryahmed, yuzhao, ziy, linux-kernel

On Thu, Apr 18, 2024 at 8:57 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Wed, Apr 17, 2024 at 1:35 PM Barry Song <21cnbao@gmail.com> wrote:
> >>
> >> On Wed, Apr 17, 2024 at 12:34 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >
> >> > Barry Song <21cnbao@gmail.com> writes:
> >> >
> >> > > On Tue, Apr 16, 2024 at 3:13 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> > >>
> >> > >> Barry Song <21cnbao@gmail.com> writes:
> >> > >>
> >> > >> > On Tue, Apr 16, 2024 at 1:42 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> > >> >>
> >> > >> >> Barry Song <21cnbao@gmail.com> writes:
> >> > >> >>
> >> > >> >> > On Mon, Apr 15, 2024 at 8:53 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> > >> >> >>
> >> > >> >> >> Barry Song <21cnbao@gmail.com> writes:
> >> > >> >> >>
> >> > >> >> >> > On Mon, Apr 15, 2024 at 8:21 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> > >> >> >> >>
> >> > >> >> >> >> Barry Song <21cnbao@gmail.com> writes:
> >> > >> >> >> >>
> >> > >> >> >> >> > On Mon, Apr 15, 2024 at 6:19 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> > >> >> >> >> >>
> >> > >> >> >> >> >> Barry Song <21cnbao@gmail.com> writes:
> >> > >> >> >> >> >>
> >> > >> >> >> >> >> > From: Chuanhua Han <hanchuanhua@oppo.com>
> >> > >> >> >> >> >> >
> >> > >> >> >> >> >> > While swapping in a large folio, we need to free swaps related to the whole
> >> > >> >> >> >> >> > folio. To avoid frequently acquiring and releasing swap locks, it is better
> >> > >> >> >> >> >> > to introduce an API for batched free.
> >> > >> >> >> >> >> >
> >> > >> >> >> >> >> > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> >> > >> >> >> >> >> > Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> >> > >> >> >> >> >> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >> > >> >> >> >> >> > ---
> >> > >> >> >> >> >> >  include/linux/swap.h |  5 +++++
> >> > >> >> >> >> >> >  mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++
> >> > >> >> >> >> >> >  2 files changed, 56 insertions(+)
> >> > >> >> >> >> >> >
> >> > >> >> >> >> >> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> >> > >> >> >> >> >> > index 11c53692f65f..b7a107e983b8 100644
> >> > >> >> >> >> >> > --- a/include/linux/swap.h
> >> > >> >> >> >> >> > +++ b/include/linux/swap.h
> >> > >> >> >> >> >> > @@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
> >> > >> >> >> >> >> >  extern int swap_duplicate(swp_entry_t);
> >> > >> >> >> >> >> >  extern int swapcache_prepare(swp_entry_t);
> >> > >> >> >> >> >> >  extern void swap_free(swp_entry_t);
> >> > >> >> >> >> >> > +extern void swap_free_nr(swp_entry_t entry, int nr_pages);
> >> > >> >> >> >> >> >  extern void swapcache_free_entries(swp_entry_t *entries, int n);
> >> > >> >> >> >> >> >  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
> >> > >> >> >> >> >> >  int swap_type_of(dev_t device, sector_t offset);
> >> > >> >> >> >> >> > @@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
> >> > >> >> >> >> >> >  {
> >> > >> >> >> >> >> >  }
> >> > >> >> >> >> >> >
> >> > >> >> >> >> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
> >> > >> >> >> >> >> > +{
> >> > >> >> >> >> >> > +}
> >> > >> >> >> >> >> > +
> >> > >> >> >> >> >> >  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
> >> > >> >> >> >> >> >  {
> >> > >> >> >> >> >> >  }
> >> > >> >> >> >> >> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> >> > >> >> >> >> >> > index 28642c188c93..f4c65aeb088d 100644
> >> > >> >> >> >> >> > --- a/mm/swapfile.c
> >> > >> >> >> >> >> > +++ b/mm/swapfile.c
> >> > >> >> >> >> >> > @@ -1356,6 +1356,57 @@ void swap_free(swp_entry_t entry)
> >> > >> >> >> >> >> >               __swap_entry_free(p, entry);
> >> > >> >> >> >> >> >  }
> >> > >> >> >> >> >> >
> >> > >> >> >> >> >> > +/*
> >> > >> >> >> >> >> > + * Free up the maximum number of swap entries at once to limit the
> >> > >> >> >> >> >> > + * maximum kernel stack usage.
> >> > >> >> >> >> >> > + */
> >> > >> >> >> >> >> > +#define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
> >> > >> >> >> >> >> > +
> >> > >> >> >> >> >> > +/*
> >> > >> >> >> >> >> > + * Called after swapping in a large folio,
> >> > >> >> >> >> >>
> >> > >> >> >> >> >> IMHO, it's not good to document the caller in the function definition.
> >> > >> >> >> >> >> Because this will discourage function reusing.
> >> > >> >> >> >> >
> >> > >> >> >> >> > ok. right now there is only one user that is why it is added. but i agree
> >> > >> >> >> >> > we can actually remove this.
> >> > >> >> >> >> >
> >> > >> >> >> >> >>
> >> > >> >> >> >> >> > batched free swap entries
> >> > >> >> >> >> >> > + * for this large folio, entry should be for the first subpage and
> >> > >> >> >> >> >> > + * its offset is aligned with nr_pages
> >> > >> >> >> >> >>
> >> > >> >> >> >> >> Why do we need this?
> >> > >> >> >> >> >
> >> > >> >> >> >> > This is a fundamental requirement for the existing kernel, folio's
> >> > >> >> >> >> > swap offset is naturally aligned from the first moment add_to_swap
> >> > >> >> >> >> > to add swapcache's xa. so this comment is describing the existing
> >> > >> >> >> >> > fact. In the future, if we want to support swap-out folio to discontiguous
> >> > >> >> >> >> > and not-aligned offsets, we can't pass entry as the parameter, we should
> >> > >> >> >> >> > instead pass ptep or another different data struct which can connect
> >> > >> >> >> >> > multiple discontiguous swap offsets.
> >> > >> >> >> >> >
> >> > >> >> >> >> > I feel like we only need "for this large folio, entry should be for
> >> > >> >> >> >> > the first subpage" and drop "and its offset is aligned with nr_pages",
> >> > >> >> >> >> > the latter is not important to this context at all.
> >> > >> >> >> >>
> >> > >> >> >> >> IIUC, all these are requirements of the only caller now, not the
> >> > >> >> >> >> function itself.  If only part of the all swap entries of a mTHP are
> >> > >> >> >> >> called with swap_free_nr(), can swap_free_nr() still do its work?  If
> >> > >> >> >> >> so, why not make swap_free_nr() as general as possible?
> >> > >> >> >> >
> >> > >> >> >> > right , i believe we can make swap_free_nr() as general as possible.
> >> > >> >> >> >
> >> > >> >> >> >>
> >> > >> >> >> >> >>
> >> > >> >> >> >> >> > + */
> >> > >> >> >> >> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
> >> > >> >> >> >> >> > +{
> >> > >> >> >> >> >> > +     int i, j;
> >> > >> >> >> >> >> > +     struct swap_cluster_info *ci;
> >> > >> >> >> >> >> > +     struct swap_info_struct *p;
> >> > >> >> >> >> >> > +     unsigned int type = swp_type(entry);
> >> > >> >> >> >> >> > +     unsigned long offset = swp_offset(entry);
> >> > >> >> >> >> >> > +     int batch_nr, remain_nr;
> >> > >> >> >> >> >> > +     DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };
> >> > >> >> >> >> >> > +
> >> > >> >> >> >> >> > +     /* all swap entries are within a cluster for mTHP */
> >> > >> >> >> >> >> > +     VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
> >> > >> >> >> >> >> > +
> >> > >> >> >> >> >> > +     if (nr_pages == 1) {
> >> > >> >> >> >> >> > +             swap_free(entry);
> >> > >> >> >> >> >> > +             return;
> >> > >> >> >> >> >> > +     }
> >> > >> >> >> >> >>
> >> > >> >> >> >> >> Is it possible to unify swap_free() and swap_free_nr() into one function
> >> > >> >> >> >> >> with acceptable performance?  IIUC, the general rule in mTHP effort is
> >> > >> >> >> >> >> to avoid duplicate functions between mTHP and normal small folio.
> >> > >> >> >> >> >> Right?
> >> > >> >> >> >> >
> >> > >> >> >> >> > I don't see why.
> >> > >> >> >> >>
> >> > >> >> >> >> Because duplicated implementation are hard to maintain in the long term.
> >> > >> >> >> >
> >> > >> >> >> > sorry, i actually meant "I don't see why not",  for some reason, the "not"
> >> > >> >> >> > was missed. Obviously I meant "why not", there was a "but" after it :-)
> >> > >> >> >> >
> >> > >> >> >> >>
> >> > >> >> >> >> > but we have lots of places calling swap_free(), we may
> >> > >> >> >> >> > have to change them all to call swap_free_nr(entry, 1); the other possible
> >> > >> >> >> >> > way is making swap_free() a wrapper of swap_free_nr() always using
> >> > >> >> >> >> > 1 as the argument. In both cases, we are changing the semantics of
> >> > >> >> >> >> > swap_free_nr() to partially freeing large folio cases and have to drop
> >> > >> >> >> >> > "entry should be for the first subpage" then.
> >> > >> >> >> >> >
> >> > >> >> >> >> > Right now, the semantics is
> >> > >> >> >> >> > * swap_free_nr() for an entire large folio;
> >> > >> >> >> >> > * swap_free() for one entry of either a large folio or a small folio
> >> > >> >> >> >>
> >> > >> >> >> >> As above, I don't think the these semantics are important for
> >> > >> >> >> >> swap_free_nr() implementation.
> >> > >> >> >> >
> >> > >> >> >> > right. I agree. If we are ready to change all those callers, nothing
> >> > >> >> >> > can stop us from removing swap_free().
> >> > >> >> >> >
> >> > >> >> >> >>
> >> > >> >> >> >> >>
> >> > >> >> >> >> >> > +
> >> > >> >> >> >> >> > +     remain_nr = nr_pages;
> >> > >> >> >> >> >> > +     p = _swap_info_get(entry);
> >> > >> >> >> >> >> > +     if (p) {
> >> > >> >> >> >> >> > +             for (i = 0; i < nr_pages; i += batch_nr) {
> >> > >> >> >> >> >> > +                     batch_nr = min_t(int, SWAP_BATCH_NR, remain_nr);
> >> > >> >> >> >> >> > +
> >> > >> >> >> >> >> > +                     ci = lock_cluster_or_swap_info(p, offset);
> >> > >> >> >> >> >> > +                     for (j = 0; j < batch_nr; j++) {
> >> > >> >> >> >> >> > +                             if (__swap_entry_free_locked(p, offset + i * SWAP_BATCH_NR + j, 1))
> >> > >> >> >> >> >> > +                                     __bitmap_set(usage, j, 1);
> >> > >> >> >> >> >> > +                     }
> >> > >> >> >> >> >> > +                     unlock_cluster_or_swap_info(p, ci);
> >> > >> >> >> >> >> > +
> >> > >> >> >> >> >> > +                     for_each_clear_bit(j, usage, batch_nr)
> >> > >> >> >> >> >> > +                             free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
> >> > >> >> >> >> >> > +
> >> > >> >> >> >> >> > +                     bitmap_clear(usage, 0, SWAP_BATCH_NR);
> >> > >> >> >> >> >> > +                     remain_nr -= batch_nr;
> >> > >> >> >> >> >> > +             }
> >> > >> >> >> >> >> > +     }
> >> > >> >> >> >> >> > +}
> >> > >> >> >> >> >> > +
> >> > >> >> >> >> >> >  /*
> >> > >> >> >> >> >> >   * Called after dropping swapcache to decrease refcnt to swap entries.
> >> > >> >> >> >> >> >   */
> >> > >> >> >> >> >>
> >> > >> >> >> >> >> put_swap_folio() implements batching in another method.  Do you think
> >> > >> >> >> >> >> that it's good to use the batching method in that function here?  It
> >> > >> >> >> >> >> avoids to use bitmap operations and stack space.
> >> > >> >> >> >> >
> >> > >> >> >> >> > Chuanhua has strictly limited the maximum stack usage to several
> >> > >> >> >> >> > unsigned long,
> >> > >> >> >> >>
> >> > >> >> >> >> 512 / 8 = 64 bytes.
> >> > >> >> >> >>
> >> > >> >> >> >> So, not trivial.
> >> > >> >> >> >>
> >> > >> >> >> >> > so this should be safe. on the other hand, i believe this
> >> > >> >> >> >> > implementation is more efficient, as  put_swap_folio() might lock/
> >> > >> >> >> >> > unlock much more often whenever __swap_entry_free_locked returns
> >> > >> >> >> >> > 0.
> >> > >> >> >> >>
> >> > >> >> >> >> There are 2 most common use cases,
> >> > >> >> >> >>
> >> > >> >> >> >> - all swap entries have usage count == 0
> >> > >> >> >> >> - all swap entries have usage count != 0
> >> > >> >> >> >>
> >> > >> >> >> >> In both cases, we only need to lock/unlock once.  In fact, I didn't
> >> > >> >> >> >> find possible use cases other than above.
> >> > >> >> >> >
> >> > >> >> >> > i guess the point is free_swap_slot() shouldn't be called within
> >> > >> >> >> > lock_cluster_or_swap_info? so when we are freeing nr_pages slots,
> >> > >> >> >> > we'll have to unlock and lock nr_pages times?  and this is the most
> >> > >> >> >> > common scenario.
> >> > >> >> >>
> >> > >> >> >> No.  In put_swap_folio(), free_entries is either SWAPFILE_CLUSTER (that
> >> > >> >> >> is, nr_pages) or 0.  These are the most common cases.
> >> > >> >> >>
> >> > >> >> >
> >> > >> >> > i am actually talking about the below code path,
> >> > >> >> >
> >> > >> >> > void put_swap_folio(struct folio *folio, swp_entry_t entry)
> >> > >> >> > {
> >> > >> >> >         ...
> >> > >> >> >
> >> > >> >> >         ci = lock_cluster_or_swap_info(si, offset);
> >> > >> >> >         ...
> >> > >> >> >         for (i = 0; i < size; i++, entry.val++) {
> >> > >> >> >                 if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE)) {
> >> > >> >> >                         unlock_cluster_or_swap_info(si, ci);
> >> > >> >> >                         free_swap_slot(entry);
> >> > >> >> >                         if (i == size - 1)
> >> > >> >> >                                 return;
> >> > >> >> >                         lock_cluster_or_swap_info(si, offset);
> >> > >> >> >                 }
> >> > >> >> >         }
> >> > >> >> >         unlock_cluster_or_swap_info(si, ci);
> >> > >> >> > }
> >> > >> >> >
> >> > >> >> > but i guess you are talking about the below code path:
> >> > >> >> >
> >> > >> >> > void put_swap_folio(struct folio *folio, swp_entry_t entry)
> >> > >> >> > {
> >> > >> >> >         ...
> >> > >> >> >
> >> > >> >> >         ci = lock_cluster_or_swap_info(si, offset);
> >> > >> >> >         if (size == SWAPFILE_CLUSTER) {
> >> > >> >> >                 map = si->swap_map + offset;
> >> > >> >> >                 for (i = 0; i < SWAPFILE_CLUSTER; i++) {
> >> > >> >> >                         val = map[i];
> >> > >> >> >                         VM_BUG_ON(!(val & SWAP_HAS_CACHE));
> >> > >> >> >                         if (val == SWAP_HAS_CACHE)
> >> > >> >> >                                 free_entries++;
> >> > >> >> >                 }
> >> > >> >> >                 if (free_entries == SWAPFILE_CLUSTER) {
> >> > >> >> >                         unlock_cluster_or_swap_info(si, ci);
> >> > >> >> >                         spin_lock(&si->lock);
> >> > >> >> >                         mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER);
> >> > >> >> >                         swap_free_cluster(si, idx);
> >> > >> >> >                         spin_unlock(&si->lock);
> >> > >> >> >                         return;
> >> > >> >> >                 }
> >> > >> >> >         }
> >> > >> >> > }
> >> > >> >>
> >> > >> >> I am talking about both code paths.  In 2 most common cases,
> >> > >> >> __swap_entry_free_locked() will return 0 or !0 for all entries in range.
> >> > >> >
> >> > >> > I grasp your point, but if conditions involving 0 or non-0 values fail, we'll
> >> > >> > end up repeatedly unlocking and locking. Picture a scenario with a large
> >> > >> > folio shared by multiple processes. One process might unmap a portion
> >> > >> > while another still holds an entire mapping to it. This could lead to situations
> >> > >> > where free_entries doesn't equal 0 and free_entries doesn't equal
> >> > >> > nr_pages, resulting in multiple unlock and lock operations.
> >> > >>
> >> > >> This is impossible in current caller, because the folio is in the swap
> >> > >> cache.  But if we move the change to __swap_entry_free_nr(), we may run
> >> > >> into this situation.
> >> > >
> >> > > I don't understand why it is impossible, after try_to_unmap_one() has done
> >> > > on one process, mprotect and munmap called on a part of the large folio
> >> > > pte entries which now have been swap entries, we are removing the PTE
> >> > > for this part. Another process can entirely hit the swapcache and have
> >> > > all swap entries mapped there, and we call swap_free_nr(entry, nr_pages) in
> >> > > do_swap_page. Within those swap entries, some have swapcount=1 and others
> >> > > have swapcount > 1. Am I missing something?
> >> >
> >> > For swap entries with swapcount=1, its sis->swap_map[] will be
> >> >
> >> > 1 | SWAP_HAS_CACHE
> >> >
> >> > so, __swap_entry_free_locked() will return SWAP_HAS_CACHE instead of 0.
> >> >
> >> > The swap entries will be free in
> >> >
> >> > folio_free_swap
> >> >   delete_from_swap_cache
> >> >     put_swap_folio
> >> >
> >>
> >> Yes. I realized this after replying to you yesterday.
> >>
> >> > >> > Chuanhua has invested significant effort in following Ryan's suggestion
> >> > >> > for the current approach, which generally handles all cases, especially
> >> > >> > partial unmapping. Additionally, the widespread use of swap_free_nr()
> >> > >> > as you suggested across various scenarios is noteworthy.
> >> > >> >
> >> > >> > Unless there's evidence indicating performance issues or bugs, I believe
> >> > >> > the current approach remains preferable.
> >> > >>
> >> > >> TBH, I don't like the large stack space usage (64 bytes).  How about use
> >> > >> a "unsigned long" as bitmap?  Then, we use much less stack space, use
> >> > >> bitmap == 0 and bitmap == (unsigned long)(-1) to check the most common
> >> > >> use cases.  And, we have enough batching.
> >> > >
> >> > > that is quite a straightforward modification like,
> >> > >
> >> > > - #define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
> >> > > + #define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 64 ? 64 : SWAPFILE_CLUSTER)
> >> > >
> >> > > there is no necessity to remove the bitmap API and move to raw
> >> > > unsigned long operations.
> >> > > as bitmap is exactly some unsigned long. on 64bit CPU, we are now one
> >> > > unsigned long,
> >> > > on 32bit CPU, it is now two unsigned long.
> >> >
> >> > Yes.  We can still use most bitmap APIs if we use "unsigned long" as
> >> > bitmap.  The advantage of "unsigned long" is to guarantee that
> >> > bitmap_empty() and bitmap_full() is trivial.  We can use that for
> >> > optimization.  For example, we can skip unlock/lock if bitmap_empty().
> >>
> >> anyway we have avoided lock_cluster_or_swap_info and unlock_cluster_or_swap_info
> >> for each individual swap entry.
> >>
> >> if bitma_empty(), we won't call free_swap_slot() so no chance to
> >> further take any lock,
> >> right?
> >>
> >> the optimization of bitmap_full() seems to be more useful only after we have
> >> void free_swap_slot(swp_entry_t entry, int nr)
> >>
> >> in which we can avoid many spin_lock_irq(&cache->free_lock);
> >>
> >> On the other hand, it seems we can directly call
> >> swapcache_free_entries() to skip cache if
> >> nr_pages >= SWAP_BATCH(64) this might be an optimization as we are now
> >> having a bitmap exactly equals 64.
> >
> > Hi ying,
> > considering the below code which has changed bitmap to 64 and generally support
> > different nr_pages(1 and ever cross cluster),
> >
> > #define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 64 ? 64 : SWAPFILE_CLUSTER)
> >
> > void swap_free_nr(swp_entry_t entry, int nr_pages)
> > {
> >         int i = 0, j;
> >         struct swap_cluster_info *ci;
> >         struct swap_info_struct *p;
> >         unsigned int type = swp_type(entry);
> >         unsigned long offset = swp_offset(entry);
> >         int batch_nr, remain_nr;
> >         DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };
> >
> >         remain_nr = nr_pages;
> >         p = _swap_info_get(entry);
> >         if (!p)
> >                 return;
> >
> >         for ( ; ; ) {
> >                 batch_nr = min3(SWAP_BATCH_NR, remain_nr,
> >                                 (int)(SWAPFILE_CLUSTER - (offset %
> > SWAPFILE_CLUSTER)));
> >
> >                 ci = lock_cluster_or_swap_info(p, offset);
> >                 for (j = 0; j < batch_nr; j++) {
> >                         if (__swap_entry_free_locked(p, offset + i *
> > SWAP_BATCH_NR + j, 1))
> >                                 __bitmap_set(usage, j, 1);
> >                 }
> >                 unlock_cluster_or_swap_info(p, ci);
> >
> >                 for_each_clear_bit(j, usage, batch_nr)
> >                         free_swap_slot(swp_entry(type, offset + i *
> > SWAP_BATCH_NR + j));
> >
> >                 i += batch_nr;
> >                 if (i >= nr_pages)
> >                         break;
> >
> >                 bitmap_clear(usage, 0, SWAP_BATCH_NR);
> >                 remain_nr -= batch_nr;
> >         }
> > }
> >
> > I still don't see the benefits of using bitmap_full and bitmap_empty over simple
> > for_each_clear_bit() unless we begin to support free_swap_slot_nr(), which,
> > I believe, needs a separate incremental patchset.
> >
> > using bitmap_empty and full, if we want to free all slots, we need
> > if (bitmap_empty(usage))
> > {
> >     for (i=0;i<batch_nr;i++)
> >               free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
> > }
> > This seems just a game replacing for_each_clear_bit by
> > bitmap_empty()+another for loop.
> >
> > if we don't want to free any one, we need
> > if(bitmap_full(usage)
> >        do_nothing.
> >
> > in the for_each_clear_bit() case, the loop just simply ends.
> >
> > What's your proposal code to use bitmap_empty and bitmap_full here?
> > Am I missing something?
>
> My idea is something as below.  It's only build tested.
>
> static void cluster_swap_free_nr(struct swap_info_struct *sis,
>                                  unsigned long offset, int nr_pages)
> {
>         struct swap_cluster_info *ci;
>         DECLARE_BITMAP(to_free, BITS_PER_LONG) = { 0 };
>         int i, nr;
>
>         ci = lock_cluster_or_swap_info(sis, offset);
>         while (nr_pages) {
>                 nr = min(BITS_PER_LONG, nr_pages);
>                 for (i = 0; i < nr; i++) {
>                         if (!__swap_entry_free_locked(sis, offset + i, 1))
>                                 bitmap_set(to_free, i, 1);
>                 }
>                 if (!bitmap_empty(to_free, BITS_PER_LONG)) {
>                         unlock_cluster_or_swap_info(sis, ci);
>                         for_each_set_bit(i, to_free, BITS_PER_LONG)
>                                 free_swap_slot(swp_entry(sis->type, offset + i));
>                         if (nr == nr_pages)
>                                 return;
>                         bitmap_clear(to_free, 0, BITS_PER_LONG);
>                         ci = lock_cluster_or_swap_info(sis, offset);
>                 }
>                 offset += nr;
>                 nr_pages -= nr;
>         }
>         unlock_cluster_or_swap_info(sis, ci);
> }
>
> void swap_free_nr(swp_entry_t entry, int nr_pages)
> {
>         int nr;
>         struct swap_info_struct *sis;
>         unsigned long offset = swp_offset(entry);
>
>         sis = _swap_info_get(entry);
>         if (!sis)
>                 return;
>
>         while (nr_pages >= 0) {
>                 nr = min_t(int, nr_pages, SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
>                 cluster_swap_free_nr(sis, offset, nr);
>                 offset += nr;
>                 nr_pages -= nr;
>         }
> }

Thanks!
It seems quite promising. I guess your intention is appreciating the
advantage of
small_const_nbits().
Conversely, you've created a distinct function for entries within a cluster.

I also agree with your change to:
if (!__swap_entry_free_locked(sis, offset + i, 1))

This adjustment ensures that for_each_set_bit() can also benefit from
small_const_nbits.
The original code didn't pass a compile-time const to for_each_clear_bit.

>
> --
> Best Regards,
> Huang, Ying

Thanks
Barry

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 4/5] mm: swap: entirely map large folios found in swapcache
  2024-04-16  2:36         ` Barry Song
  2024-04-16  2:39           ` Huang, Ying
@ 2024-04-18  9:55           ` Barry Song
  1 sibling, 0 replies; 54+ messages in thread
From: Barry Song @ 2024-04-18  9:55 UTC (permalink / raw)
  To: Huang, Ying, Khalid Aziz, sparclinux
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, hanchuanhua, hannes,
	hughd, kasong, ryan.roberts, surenb, v-songbaohua, willy, xiang,
	yosryahmed, yuzhao, ziy, linux-kernel

[snip]

> > >> >
> > >> >       VM_BUG_ON(!folio_test_anon(folio) ||
> > >> >                       (pte_write(pte) && !PageAnonExclusive(page)));
> > >> > -     set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> > >> > -     arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
> > >> > +     set_ptes(vma->vm_mm, start_address, start_pte, pte, nr_pages);
> > >> > +     vmf->orig_pte = ptep_get(vmf->pte);
> > >> > +     arch_do_swap_page(vma->vm_mm, vma, start_address, pte, pte);
> > >>
> > >> Do we need to call arch_do_swap_page() for each subpage?  IIUC, the
> > >> corresponding arch_unmap_one() will be called for each subpage.
> > >
> > > i actually thought about this very carefully, right now, the only one who
> > > needs this is sparc and it doesn't support THP_SWAPOUT at all. and
> > > there is no proof doing restoration one by one won't really break sparc.
> > > so i'd like to defer this to when sparc really needs THP_SWAPOUT.
> >
> > Let's ask SPARC developer (Cced) for this.
> >
> > IMHO, even if we cannot get help, we need to change code with our
> > understanding instead of deferring it.
>
> ok. Thanks for Ccing sparc developers.

Hi Khalid & Ying (also Cced sparc maillist),

SPARC is the only platform which needs arch_do_swap_page(), right now,
its THP_SWAPOUT is not enabled. so we will not really hit a large folio
in swapcache. just in case you might need THP_SWAPOUT later, i am
changing the code as below,

@@ -4286,7 +4285,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
        VM_BUG_ON(!folio_test_anon(folio) ||
                        (pte_write(pte) && !PageAnonExclusive(page)));
        set_ptes(vma->vm_mm, start_address, start_ptep, pte, nr_pages);
-       arch_do_swap_page(vma->vm_mm, vma, start_address, pte, pte);
+       for (int i = 0; i < nr_pages; i++) {
+               arch_do_swap_page(vma->vm_mm, vma, start_address + i *
PAGE_SIZE,
+                                 pte, pte);
+               pte = pte_advance_pfn(pte, 1);
+       }

        folio_unlock(folio);
        if (folio != swapcache && swapcache) {

for sparc, nr_pages will always be 1(THP_SWAPOUT not enabled). for
arm64/x86/riscv,
it seems redundant to do a for loop "for (int i = 0; i < nr_pages; i++)".

so another option is adding a helper as below to avoid the idle loop
for arm64/x86/riscv etc.

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e2f45e22a6d1..ea314a5f9b5e 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1085,6 +1085,28 @@ static inline void arch_do_swap_page(struct
mm_struct *mm,
 {

 }
+
+static inline void arch_do_swap_page_nr(struct mm_struct *mm,
+                                    struct vm_area_struct *vma,
+                                    unsigned long addr,
+                                    pte_t pte, pte_t oldpte,
+                                    int nr)
+{
+
+}
+#else
+static inline void arch_do_swap_page_nr(struct mm_struct *mm,
+                                    struct vm_area_struct *vma,
+                                    unsigned long addr,
+                                    pte_t pte, pte_t oldpte,
+                                    int nr)
+{
+       for (int i = 0; i < nr; i++) {
+               arch_do_swap_page(vma->vm_mm, vma, addr + i * PAGE_SIZE,
+                                pte_advance_pfn(pte, i),
+                                pte_advance_pfn(oldpte, i));
+       }
+}
 #endif

Please tell me your preference.

BTW, i found oldpte and pte are always same in do_swap_page(), is it
something wrong? does arch_do_swap_page() really need two same
arguments?


vmf->orig_pte = pte;
...
arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);


>
> >
> > > on the other hand, it seems really bad we have both
> > > arch_swap_restore  - for this, arm64 has moved to using folio
> > > and
> > > arch_do_swap_page
> > >
> > > we should somehow unify them later if sparc wants THP_SWPOUT.

Thanks
Barry

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/5] mm: swap: introduce swap_free_nr() for batched swap_free()
  2024-04-18  9:14                               ` Barry Song
@ 2024-05-02 23:05                                 ` Barry Song
  0 siblings, 0 replies; 54+ messages in thread
From: Barry Song @ 2024-05-02 23:05 UTC (permalink / raw)
  To: Huang, Ying
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, hanchuanhua, hannes,
	hughd, kasong, ryan.roberts, surenb, v-songbaohua, willy, xiang,
	yosryahmed, yuzhao, ziy, linux-kernel

[snip]

> >
> > My idea is something as below.  It's only build tested.
> >
> > static void cluster_swap_free_nr(struct swap_info_struct *sis,
> >                                  unsigned long offset, int nr_pages)
> > {
> >         struct swap_cluster_info *ci;
> >         DECLARE_BITMAP(to_free, BITS_PER_LONG) = { 0 };
> >         int i, nr;
> >
> >         ci = lock_cluster_or_swap_info(sis, offset);
> >         while (nr_pages) {
> >                 nr = min(BITS_PER_LONG, nr_pages);
> >                 for (i = 0; i < nr; i++) {
> >                         if (!__swap_entry_free_locked(sis, offset + i, 1))
> >                                 bitmap_set(to_free, i, 1);
> >                 }
> >                 if (!bitmap_empty(to_free, BITS_PER_LONG)) {
> >                         unlock_cluster_or_swap_info(sis, ci);
> >                         for_each_set_bit(i, to_free, BITS_PER_LONG)
> >                                 free_swap_slot(swp_entry(sis->type, offset + i));
> >                         if (nr == nr_pages)
> >                                 return;
> >                         bitmap_clear(to_free, 0, BITS_PER_LONG);
> >                         ci = lock_cluster_or_swap_info(sis, offset);
> >                 }
> >                 offset += nr;
> >                 nr_pages -= nr;
> >         }
> >         unlock_cluster_or_swap_info(sis, ci);
> > }
> >
> > void swap_free_nr(swp_entry_t entry, int nr_pages)
> > {
> >         int nr;
> >         struct swap_info_struct *sis;
> >         unsigned long offset = swp_offset(entry);
> >
> >         sis = _swap_info_get(entry);
> >         if (!sis)
> >                 return;
> >
> >         while (nr_pages >= 0) {

this should be "while (nr_pages) " exactly like
"cluster_swap_free_nr", otherwise, we get

[  383.652632] EXT4-fs (vda): error count since last fsck: 85
[  383.653453] EXT4-fs (vda): initial error at time 1709536044:
mb_free_blocks:1937: block 704527
[  383.654947] EXT4-fs (vda): last error at time 1714689579:
ext4_mb_generate_buddy:1213
[  398.564002] watchdog: BUG: soft lockup - CPU#1 stuck for 26s! [a.out:104]
[  398.564679] Modules linked in:
[  398.565239] irq event stamp: 0
[  398.565648] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
[  398.566888] hardirqs last disabled at (0): [<ffff8000800ade2c>]
copy_process+0x654/0x19a8
[  398.568255] softirqs last  enabled at (0): [<ffff8000800ade2c>]
copy_process+0x654/0x19a8
[  398.568813] softirqs last disabled at (0): [<0000000000000000>] 0x0
[  398.569481] CPU: 1 PID: 104 Comm: a.out Not tainted
6.9.0-rc4-g3c9251435c61-dirty #216
[  398.570076] Hardware name: linux,dummy-virt (DT)
[  398.570600] pstate: 01401005 (nzcv daif +PAN -UAO -TCO +DIT +SSBS BTYPE=--)
[  398.571148] pc : lock_acquire+0x3c/0x88
[  398.571636] lr : lock_acquire+0x3c/0x88
[  398.572075] sp : ffff800086f13af0
[  398.572484] x29: ffff800086f13af0 x28: 0000000000000000 x27: ffff0000c1096800
[  398.573349] x26: 0000000000000001 x25: ffff8000803e4d60 x24: 0000000000000000
[  398.574139] x23: 0000000000000001 x22: 0000000000000000 x21: 0000000000000000
[  398.574915] x20: 0000000000000000 x19: ffff0000c3eb0918 x18: 0000000000000000
[  398.576009] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[  398.576895] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[  398.577732] x11: 0000000000000000 x10: 0000000000000000 x9 : ffff8000814cf7a0
[  398.578668] x8 : ffff800086f13ab8 x7 : 0000000000000000 x6 : ffff8000803e4d60
[  398.579705] x5 : 0000000000000000 x4 : 0000000000000001 x3 : ffff800082f39008
[  398.580528] x2 : 0000000000000003 x1 : 0000000000000002 x0 : 0000000000000001
[  398.581387] Call trace:
[  398.581740]  lock_acquire+0x3c/0x88
[  398.582159]  _raw_spin_lock+0x50/0x70
[  398.582556]  swap_free_nr+0x98/0x2a0
[  398.582946]  do_swap_page+0x568/0xd00
[  398.583337]  __handle_mm_fault+0x76c/0x16d0
[  398.583853]  handle_mm_fault+0x7c/0x3c8
[  398.584276]  do_page_fault+0x188/0x698
[  398.584731]  do_translation_fault+0xb4/0xd8
[  398.585142]  do_mem_abort+0x4c/0xa8
[  398.585541]  el0_da+0x58/0x128
[  398.585923]  el0t_64_sync_handler+0xe4/0x158
[  398.586355]  el0t_64_sync+0x1a4/0x1a8
[  398.651930] watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [a.out:103]
[  398.652819] Modules linked in:
[  398.653682] irq event stamp: 0
[  398.654213] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
[  398.654714] hardirqs last disabled at (0): [<ffff8000800ade2c>]
copy_process+0x654/0x19a8
[  398.655242] softirqs last  enabled at (0): [<ffff8000800ade2c>]
copy_process+0x654/0x19a8
[  398.655835] softirqs last disabled at (0): [<0000000000000000>] 0x0
[  398.656657] CPU: 2 PID: 103 Comm: a.out Tainted: G             L
 6.9.0-rc4-g3c9251435c61-dirty #216
[  398.657705] Hardware name: linux,dummy-virt (DT)
[  398.658273] pstate: 01401005 (nzcv daif +PAN -UAO -TCO +DIT +SSBS BTYPE=--)
[  398.658743] pc : queued_spin_lock_slowpath+0x5c/0x528
[  398.659146] lr : do_raw_spin_lock+0xc8/0x120
[  398.659575] sp : ffff800086d03a60
[  398.659946] x29: ffff800086d03a60 x28: 0000000000000120 x27: 0800000103050003
[  398.661238] x26: ffff0000c27d5d40 x25: fffffdffc0000000 x24: ffff800082270d08
[  398.662431] x23: ffff800083f73a31 x22: ffff800086d03d48 x21: 0000ffff9bf70000
[  398.663358] x20: ffff800082f39008 x19: ffff0000c27d5d40 x18: 0000000000000000
[  398.664353] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[  398.665825] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[  398.666715] x11: 0000000000000000 x10: 0000000000000000 x9 : ffff8000801420f8
[  398.667528] x8 : ffff800086d03a48 x7 : 0000000000000000 x6 : ffff8000803abef0
[  398.668548] x5 : 0000000000000000 x4 : 0000000000000001 x3 : ffff800082f39008
[  398.670018] x2 : ffff80012ac81000 x1 : 0000000000000000 x0 : 0000000000000001
[  398.670831] Call trace:
[  398.671136]  queued_spin_lock_slowpath+0x5c/0x528
[  398.671604]  do_raw_spin_lock+0xc8/0x120
[  398.672144]  _raw_spin_lock+0x58/0x70
[  398.672860]  __pte_offset_map_lock+0x98/0x210
[  398.673695]  filemap_map_pages+0x10c/0x7f8
[  398.674495]  __handle_mm_fault+0x11e4/0x16d0
[  398.674926]  handle_mm_fault+0x7c/0x3c8
[  398.675336]  do_page_fault+0x100/0x698
[  398.675863]  do_translation_fault+0xb4/0xd8
[  398.676425]  do_mem_abort+0x4c/0xa8
[  398.677225]  el0_ia+0x80/0x188
[  398.678004]  el0t_64_sync_handler+0x100/0x158
[  398.678778]  el0t_64_sync+0x1a4/0x1a8
[  422.563920] watchdog: BUG: soft lockup - CPU#1 stuck for 48s! [a.out:104]
[  422.564528] Modules linked in:
[  422.564991] irq event stamp: 0



> >                 nr = min_t(int, nr_pages, SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
> >                 cluster_swap_free_nr(sis, offset, nr);
> >                 offset += nr;
> >                 nr_pages -= nr;
> >         }
> > }
> > --
> > Best Regards,
> > Huang, Ying
>

Thanks
Barry

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2024-05-02 23:05 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-09  8:26 [PATCH v2 0/5] large folios swap-in: handle refault cases first Barry Song
2024-04-09  8:26 ` [PATCH v2 1/5] mm: swap: introduce swap_free_nr() for batched swap_free() Barry Song
2024-04-10 23:37   ` SeongJae Park
2024-04-11  1:27     ` Barry Song
2024-04-11 14:30   ` Ryan Roberts
2024-04-12  2:07     ` Chuanhua Han
2024-04-12 11:28       ` Ryan Roberts
2024-04-12 11:38         ` Chuanhua Han
2024-04-15  6:17   ` Huang, Ying
2024-04-15  7:04     ` Barry Song
2024-04-15  8:06       ` Barry Song
2024-04-15  8:19       ` Huang, Ying
2024-04-15  8:34         ` Barry Song
2024-04-15  8:51           ` Huang, Ying
2024-04-15  9:01             ` Barry Song
2024-04-16  1:40               ` Huang, Ying
2024-04-16  2:08                 ` Barry Song
2024-04-16  3:11                   ` Huang, Ying
2024-04-16  4:32                     ` Barry Song
2024-04-17  0:32                       ` Huang, Ying
2024-04-17  1:35                         ` Barry Song
2024-04-18  5:27                           ` Barry Song
2024-04-18  8:55                             ` Huang, Ying
2024-04-18  9:14                               ` Barry Song
2024-05-02 23:05                                 ` Barry Song
2024-04-09  8:26 ` [PATCH v2 2/5] mm: swap: make should_try_to_free_swap() support large-folio Barry Song
2024-04-15  7:11   ` Huang, Ying
2024-04-09  8:26 ` [PATCH v2 3/5] mm: swap_pte_batch: add an output argument to reture if all swap entries are exclusive Barry Song
2024-04-11 14:54   ` Ryan Roberts
2024-04-11 15:00     ` David Hildenbrand
2024-04-11 15:36       ` Ryan Roberts
2024-04-09  8:26 ` [PATCH v2 4/5] mm: swap: entirely map large folios found in swapcache Barry Song
2024-04-11 15:33   ` Ryan Roberts
2024-04-11 23:30     ` Barry Song
2024-04-12 11:31       ` Ryan Roberts
2024-04-15  8:37   ` Huang, Ying
2024-04-15  8:53     ` Barry Song
2024-04-16  2:25       ` Huang, Ying
2024-04-16  2:36         ` Barry Song
2024-04-16  2:39           ` Huang, Ying
2024-04-16  2:52             ` Barry Song
2024-04-16  3:17               ` Huang, Ying
2024-04-16  4:40                 ` Barry Song
2024-04-18  9:55           ` Barry Song
2024-04-09  8:26 ` [PATCH v2 5/5] mm: add per-order mTHP swpin_refault counter Barry Song
2024-04-10 23:15   ` SeongJae Park
2024-04-11  1:46     ` Barry Song
2024-04-11 16:14       ` SeongJae Park
2024-04-11 15:53   ` Ryan Roberts
2024-04-11 23:01     ` Barry Song
2024-04-17  0:45   ` Huang, Ying
2024-04-17  1:16     ` Barry Song
2024-04-17  1:38       ` Huang, Ying
2024-04-17  1:48         ` Barry Song

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.